Speech-to-text API

Added in 0.8.2

jambonz provides native support for lots of speech recognition vendors, but if you want to integrate with a vendor we don't yet support you can easily do this by writing to our API.

The STT API is based on Websockets.

jambonz opens a Websocket connection towards a URL that you specify, and sends audio as well as JSON control text frames to your server. Your server is responsible for implementing the interface to your chosen speech vendor and returning results in JSON format back over the Websocket connection to jambonz.

Important Note: your server is responsible for closing the websocket connection. Generally, this is done after receiving the stop control message from jambonz.

Want to look at some working code? Check out these examples.

Authentication

An Authorization header is sent by jambonz on the HTTP request that creates the Websocket connection. The Authorization header contains an api key, e.g.

Authorization: Bearer <apiKey>

When you create a custom speech vendor in the jambonz portal you will specify an api key which is then then provided in the Authorization header whenever that custom speech vendor is used in your application.

In the example below, we creeate a Custom speech service for AssemblyAI and add an apiKey of 'foobarbazzle'.

Note: this is not the API key that you may get from AssemblyAI to use their service.

Creating custom STT vendor

Control messages sent by jambonz

Control messages are sent as JSON frames. Audio is sent as binary frames containing linear16 pcm-encoded audio at 8khz sampling.

The first message that you will receive from jambonz after accepting and upgrading the http request to a Websocket connection is a "start" control message, followed by binary audio frames.

Start control message
property type description
type String "start"
language String ISO language code (e.g. "en-US")
format String Defines audio format. Currently will always be "raw"
encoding String Defines how the audio is encoded. Currently will always be "LINEAR16"
interimResults Boolean whether or not interim (partial) results are being requested
sampleRateHz Number Sample rate of audio. Currently will always be 8000.
options Object This will contain any options that the application is passing on to the recognizer. This object may be empty.
options.hints Array or Object Any dynamic hints provided by the application.
options.hintsBoost Number A boost number to apply to the provided hints.
Stop control message

jambonz sends a "stop" message when it is time to stop speech recognition.

jambonz does not close the socket after sending this control message. This is to allow your speech recognizer to return a final transcript, if necessary. So when receiving the stop control message, you should do what is necessary to close and clean up the speech recognition service you are using, return a final transcript if any, and then close the websocket with a normal close.

property type description
type String "stop"
Control messages sent to jambonz

Your server is responsible for sending transcriptions, as well as any errors, to jambonz.

Transcription control message
property type description
type String "transcription"
is_final Boolean indicates whether this is a final or interim transcription.
alternatives Array an ordered list of alternative transcriptions (must contain at least one).
alternatives[n].transcript String A transcript of the speaker's utterance.
alternatives[n].confidence Number A confidence probability, between 0 and 1.
language String the language that was recognized.
channel Number The channel number (only relevant if diarization is being performed, default to 1).
Error control message
property type description
type String "error"
error String detailed error message.