How to Implement the New Speech-to-Text in Chatbots

SAP Conversational AI has added speech-to-text support to its chatbot, but what is included in this support, and how are the different ways you can use it?

The documentation is available in GitHub (nice information, including about other Web Client APIs), but below is my deconstruction, plus an example of how I implemented speech-to-text. You can also do the 2 new speech-to-text tutorials:

What’s available with STT?

The speech-to-text capability provides a set of features, and you can use some of them or all of them. Here they are:

Microphone Button

Simply by creating the window.sapcai.webclientBridge.sttGetConfig object, you will get the microphone displayed

The object can be empty and you will still get the microphone. It may not do anything, but you’ll get it.

Trigger UI Events

Along with the microphone you are able to capture when someone presses. You can implement window.sapcai.webclientBridge.sttStartListening.

In this method, you can start you speech-to-text service, set up callback methods for when audio is transcribed, open the browser microphone, and do any start-up tasks you need.

Display Interim Results

If your STT service provides interim results while the user talks, the chatbot provides a place to put these interim transcriptions

The Web Client provides a method you can call, onSTTResult. The method takes the text you want to display and a boolean to indicate whether these are interim results (displayed in the interim transcription window) or final (sent to the chatbot conversation as the next utterance).

The interim transcription window contains the abort and stop buttons, and you can also capture when the user clicks these and do any needed cleanup: sttAbort

Capture Audio (Media Recorder)

If you want, the chatbot can handle interacting with the browser to capture the audio. This is configured in the window.sapcai.webclientBridge.sttGetConfig method. We’ll see an example where we handle the audio capture ourselves in the example below.

If the chatbot does handle the audio, then you can a few methods for handling this:

  • sttOnInterimAudioData,which receives an interim blob which you can pass on to your STT service
  • sttOnFinalAudioData, which receives a final blob when the user stops talking and which you can pass on to your STT service
  • sttStopListening, which is called when the user stops talking and in which you close websockets and do any other cleanup work

Using IBM Speech-to-Text Service

I used the recent SAP Community Code Challenge image editor project as the starting point. One of my added features was to enable a chatbot to select a community profile avatar to be loaded into the application.

  • You open the chatbot and type the community ID, and the image editor loads the corresponding avatar.
  • The chatbot checks if it is a valid ID and let’s the user know. If not valid, the chatbot still loads the default avatar.
  • The chatbot also fixes some formatting by eliminating spaces and converting to lowercase.

Here’s how I enabled speech to text for the chatbot. If you want to replicate this, you would need to get an IBM Cloud account, get a service plan for speech to text (there is a free one), and generate tokens for connecting.

I created a file called webclient.js with just the tokens I needed. In real life I would have hidden the tokens, and created a service to generate the token for calling the service, which expires every 30 minutes.

const data_expander_preferences = "<chatbot preferences>";
const data_channel_id = "<chatbot channel ID>";
const data_token = "<chatbot token>";
const ibmtoken = "<IBM token>";
const ibmurl = "<IBM service URL for your tenant>";

I then created a file called webClientBridge.js with just the tokens I needed. In real life I would have hidden the tokens, and created a service to generate the tokens for calling the service, which expire every 30 minutes.

const webclientBridge = { // ALL THE STT METHODS //-------------------- callImplMethod: async (name, ...args) => { console.log(name) if (window.webclientBridgeImpl && window.webclientBridgeImpl[name]) { return window.webclientBridgeImpl[name](...args) } }, // if this function returns an object, WebClient will enable the microphone button. sttGetConfig: async (...args) => { return webclientBridge.callImplMethod('sttGetConfig', ...args) }, sttStartListening: async (...args) => { return webclientBridge.callImplMethod('sttStartListening', ...args) }, sttStopListening: async (...args) => { return webclientBridge.callImplMethod('sttStopListening', ...args) }, sttAbort: async (...args) => { return webclientBridge.callImplMethod('sttAbort', ...args) }, // only called if useMediaRecorder = true in sttGetConfig sttOnFinalAudioData: async (...args) => { return webclientBridge.callImplMethod('sttOnFinalAudioData', ...args) }, // only called if useMediaRecorder = true in sttGetConfig sttOnInterimAudioData: async (...args) => { // send interim blob to STT service return webclientBridge.callImplMethod('sttOnInterimAudioData', ...args) }, // OTHER BRIDGE METHODS //-------------------- // called on each message to update the memory // called on each message onMessage: (payload) => { payload.messages.forEach(element => { if (element.participant.isBot && element.attachment.content.text.startsWith("SENDING AVATAR")) { profile = element.attachment.content.text.substring(19); window.sapcai.webclientBridge.imageeditor.setSrc("https://avatars.services.sap.com/images/" + profile + ".png") } }); }
} window.sapcai = { webclientBridge,
}

And the last new file I created was webClientBridgeImpl.jswith the implementation (I could have combined the last 2 files. This file opens a websocket to the IBM service when the user clicks the microphone and sends interim and final audio blobs to the service. The websocket callbacks put the transcribed texts into the interim transcription area or the conversation.

const IBM_URL = ibmurl
const access_token = ibmtoken let wsclient = null
const sttIBMWebsocket = { sttGetConfig: async () => { return { useMediaRecorder: true, interimResultTime: 50, } }, sttStartListening: async (params) => { const [metadata] = params const sttConfig = await sttIBMWebsocket.sttGetConfig() const interim_results = sttConfig.interimResultTime wsclient = new WebSocket(`wss://${IBM_URL}?access_token=${access_token}`) wsclient.onopen = (event) => { wsclient.send(JSON.stringify({ action: 'start', interim_results, 'content-type': `audio/${metadata.audioMetadata.fileFormat}`, })) } wsclient.onmessage = (event) => { const data = JSON.parse(event.data) const results = _.get(data, 'results', []) if (results.length > 0) { const lastresult = _.get(results, `[${results.length - 1}]`) const m = { text: _.get(lastresult, 'alternatives[0].transcript', ''), final: _.get(lastresult, 'final'), } window.sap.cai.webclient.onSTTResult(m) } } wsclient.onclose = (event) => { console.log('OnClose') } wsclient.onerror = (event) => { console.log('OnError', JSON.stringify(event.data)) } }, sttStopListening: async () => { const client = wsclient setTimeout(() => { if (client) { client.close() } }, 5000) }, sttAbort: async () => { if (wsclient) { wsclient.close() wsclient = null } }, sttOnInterimAudioData: async (params) => { if (wsclient) { const [blob, metadata] = params wsclient.send(blob) } }, sttOnFinalAudioData: async (params) => { if (wsclient) { const [blob, metadata] = params wsclient.send(blob) wsclient.send(JSON.stringify({ action: 'stop', })) } },
} window.webclientBridgeImpl = sttIBMWebsocket

In the SAPUI5 application, I did the following:

  • Loaded these JavaScript files in the sap.ui.define.
  • Loaded the chatbot script after rendering the view.
  • Saved a reference to the image editor in the window object, so I could reference it within the chatbot client APIs. I imagine there is a better way to do this.

If you want to see all the details, see the tutorial at Add Speech-to-Text to Your Chatbot (with recorder).

Result

I open the Community Contest image editor, and open the chatbot.

Then click on the microphone.

I say the name of a community ID, like “exams geek”. The transcribed words go into the interim transcription area.

When I stop talking, the text is transferred into the conversation, and my client-side onMessage method captures the text, checks what it says, and then updates the image editor picture with the proper avatar.