The recognizer
property is used in multiple verbs (gather, transcribe, dial). It selects and configures the speech recognizer.
It is an object containing the following properties:
option | description | required |
---|---|---|
vendor | Speech vendor to use (see list below, along with any others you add via the custom speech API) | no |
language | Language code to use for speech detection. Defaults to the application level setting | no |
fallbackVendor | Fallback Speech vendor to use (see list below, along with any others you add via the custom speech API) | no |
fallbackLanguage | Fallback Language code to use for speech detection. Defaults to the application level setting | no |
interim | If true, interim transcriptions are sent | no (default: false) |
hints | (google, microsoft, deepgram, nvidia, soniox) Array of words or phrases to assist speech detection. See examples below. | no |
hintsBoost | (google, nvidia) Number indicating the strength to assign to the configured hints. See examples below. | no |
profanityFilter | (google, deepgram, nuance, nvidia) If true, filter profanity from speech transcription . Default: no | no |
singleUtterance | (google) If true, return only a single utterance/transcript | no (default: true for gather) |
vad.enable | If true, delay connecting to cloud recognizer until speech is detected | no |
vad.voiceMs | If vad is enabled, the number of milliseconds of speech required before connecting to cloud recognizer | no |
vad.mode | If vad is enabled, this setting governs the sensitivity of the voice activity detector; value must be between 0 to 3 inclusive, lower numbers mean more sensitive | no |
separateRecognitionPerChannel | If true, recognize both caller and called party speech using separate recognition sessions | no |
altLanguages | (google, microsoft) An array of alternative languages that the speaker may be using | no |
punctuation | (google) Enable automatic punctuation | no |
model | (google) speech recognition model to use | no (default: phone_call) |
enhancedModel | (google) Use enhanced model | no |
words | (google) Enable word offsets | no |
diarization | (google) Enable speaker diarization | no |
diarizationMinSpeakers | (google) Set the minimum speaker count | no |
diarizationMaxSpeakers | (google) Set the maximum speaker count | no |
interactionType | (google) Set the interaction type: discussion, presentation, phone_call, voicemail, professionally_produced, voice_search, voice_command, dictation | no |
naicsCode | (google) set an industry NAICS code that is relevant to the speech | no |
vocabularyName | (aws) The name of a vocabulary to use when processing the speech. | no |
vocabularyFilterName | (aws) The name of a vocabulary filter to use when processing the speech. | no |
filterMethod | (aws) The method to use when filtering the speech: remove, mask, or tag. | no |
languageModelName | (aws) The name of the custom language model when processing speech. | no |
identifyChannels | (aws) Enable channel identification. | no |
profanityOption | (microsoft) masked, removed, or raw. Default: raw | no |
outputFormat | (microsoft) simple or detailed. Default: simple | no |
requestSnr | (microsoft) Request signal to noise information | no |
initialSpeechTimeoutMs | (microsoft) Initial speech timeout in milliseconds | no |
minConfidence | If provided, final transcripts with confidence less than this value return a reason of 'stt-low-confidence' in webhook | no |
transcriptionHook | Webhook to receive an HTPP POST when an interim or final transcription is received. | yes |
asrTimeout | timeout value for continuous ASR feature | no |
asrDtmfTerminationDigit | DMTF key that terminates continuous ASR feature | no |
azureServiceEndpoint | Custom service endpoint to connect to, instead of hosted Microsoft regional endpoints | no |
azureOptions (added in 0.8.5) | Azure-specific speech recognition options (see below) | no |
deepgramOptions (added in 0.8.0) | Deepgram-specific speech recognition options (see below) | no |
ibmOptions (added in 0.8.0) | IBM Watson-specific speech recognition options (see below) | no |
nuanceOptions (added in 0.8.0) | Nuance-specific speech recognition options (see below) | no |
nvidiaOptions (added in 0.8.0) | Nvidia-specific speech recognition options (see below) | no |
sonioxOptions (added in 0.8.2) | Soniox-specific speech recognition options (see below) | no |
jambonz natively supports the following speech-to-text services:
- assemblyai
- aws
- azure
- cobalt
- deepgram
- ibm
- nuance
- nvidia
- sonoix
Note: Microsoft supports on-prem and private link options for deploying the speech service in addition to the hosted Microsoft service.
google, microsoft, deepgram, and nvidia all support the ability to provide a dynamic list of words or phrases that should be "boosted" by the recognizer, i.e. the recognizer should be more likely to detect this terms and return them in the transcript. A boost factor can also be applied. In the most basic implementation it would look like this:
"hints": ["benign", "malignant", "biopsy"],
"hintsBoost": 50
Additionally, google and nvidia allow a boost factor to be specified at the phrase level, e.g.
"hints": [
{"phrase": "benign", "boost": 50},
{"phrase": "malignant", "boost": 10},
{"phrase": "biopsy", "boost": 20},
]
azureOptions
is an object with the following properties. This option is available in jambonz 0.8.5 or above.
option | description | required |
---|---|---|
speechSegmentationSilenceTimeoutMs | Duration (in ms) of nonspeech audio within a phrase that's currently being spoken before that phrase is considered "done." See here for details | no |
nuanceOptions
is an object with the following properties. Please refer to the Nuance Documentation for detailed descriptions. This option is available in jambonz 0.8.0 or above.
option | description | required |
---|---|---|
clientId | Nuance client ID to authenticate with (overrides setting in jambonz portal) | no |
secret | Nuance secret to authenticate with (overrides setting in jambonz portal) | no |
kryptonEndpoint | Endpoint of on-prem Krypton endpoint to connect to | no (defaults to hosted service) |
topic | specialized language model | no |
utteranceDetectionMode | How many sentences (utterances) within the audio stream are processed ('single', 'multiple', 'disabled') | no (default: single |
punctuation | Whether to enable auto punctuation | no |
includeTokenization | Whether to include tokenized recognition result. | no |
discardSpeakerAdaptation | If speaker profiles are used, whether to discard updated speaker data. By default, data is stored. | no |
suppressCallRecording | Whether to disable call logging and audio capture. By default, call logs, audio, and metadata are collected. | no |
maskLoadFailures | whether to terminate recogition when failing to load external resources | no |
suppressInitialCapitalization | When true, the first word in a sentence is not automatically capitalized. | no |
allowZeroBaseLmWeight | When true, custom resources (DLMs, wordsets, etc.) can use the entire | no |
filterWakeupWord | Whether to remove the wakeup word from the final result. | no |
resultType | The level of recognition results ('final', 'partial', 'immutable_partial') | no (default: final) |
noInputTimeoutMs | Maximum silence, in milliseconds, allowed while waiting for user input after recognition timers are started. | no |
recognitionTimeoutMs | Maximum duration, in milliseconds, of recognition turn | no |
utteranceEndSilenceMs | Minimum silence, in milliseconds, that determines the end of a sentence | no |
maxHypotheses | Maximum number of n-best hypotheses to return | no |
speechDomain | Mapping to internal weight sets for language models in the data pack | no |
userId | Identifies a specific user within the application | no |
speechDetectionSensitivity | A balance between detecting speech and noise (breathing, etc.), 0 to 1. 0 means ignore all noise, 1 means interpret all noise as speech | no (default: 0.5) |
clientData | An object containing arbitrary key, value pairs to inject into the call log. | no |
formatting.scheme | Keyword for a formatting type defined in the data pack | no |
formatting.options | Object containing key, value pairs of formatting options and values defined in the data pack | no |
resource | An array of zero or more recognition resources (domain LMs, wordsets, etc.) to improve recognition | no |
resource[].inlineWordset | Inline wordset JSON resource. See Wordsets for details | no |
resource[].builtin | Name of a builtin resource in the data pack | no |
resource[].inlineGrammar | Inline grammar, SRGS XML format | no |
resource[].wakeupWord | Array of wakeup words | no |
resource[].weightName | input field setting the weight of the domain LM or builtin relative to the data pack ('defaultWeight', 'lowest', 'low', 'medium', 'high', 'highest') | no (default = MEDIUM |
resource[].weightValue | Weight of DLM or builtin as a numeric value from 0 to 1 | no (default: 0.25) |
resource[].reuse | Whether the resource will be used multiple times ('undefined_reuse', 'low_reuse','high_reuse') | no (default: low_reuse |
resource[].externalReference | An external DLM or settings file for creating or updating a speaker profile | no |
resource[].externalReference.type | Resource type ('undefined_resource_type', 'wordset', 'compiled_wordset', 'domain_lm', 'speaker_profile', 'grammar', 'settings') | no |
resource[].externalReference.uri | Location of the resource as a URN reference | no |
resource[].externalReference.maxLoadFailures | when true allow transcription to proceed resource loading fails | no |
resource[].externalReference.requestTimeoutMs | Time to wait when downloading resources | no |
resource[].externalReference.headers | An object containing HTTP cache-control directives (e.g. max-age etc) | no |
deepgramOptions
is an object with the following properties. Please refer to the Deepgram Documentation for detailed descriptions. This option is available in jambonz 0.8.0 or above.
option | description | required |
---|---|---|
apiKey | Deepgram api key to authenticate with (overrides setting in jambonz portal) | no |
tier | Deepgram tier you would like to use ('enhanced', 'base') | no (default: base) |
model | Deepgram model used to process submitted audio ('general', 'meeting', 'phonecall', 'voicemail', 'finance', 'conversationalai', 'video', 'custom') | no (default: general) |
endpointing | Indicates the number of milliseconds of silence Deepgram will use to determine a speaker has finished saying a word or phrase. The value provided must be iether a number of milliseconds or 'false' to disable the feature entirely. Note: the default endpointing value that Deepgram uses is 10 milliseconds. You can set this value higher to allow to require more silence before a final transcript is returned but we suggest a value of 1000 (one second) or less, as we have observed strange behaviors with higher values. If you wish to allow more time for pauses during a conversation before returning a transcript, we suggest using the utteranceEndMs feature instead that is described below. | no (default: 10ms) |
customModel | Id of custom model | no |
version | Deepgram version of model to use | no (default: latest) |
punctuate | Indicates whether to add punctuation and capitalization to the transcript | no |
profanityFilter | Indicates whether to remove profanity from the transcript | no |
redact | Whether to redact information from transcripts ('pci', 'numbers', 'true', 'ssn') | no |
diarize | Wehther to assign a speaker to each word in the transcript | no |
diarizeVersion | if set to '2021-07-14.0' the legacy diarization feature will be used | no |
multichannel | Indicates whether to transcribe each audio channel independently | no |
alternatives | Number of alternative transcripts to return | no |
numerals | Indicates whether to convert numbers from written format (e.g., one) to numerical format (e.g., 1) | no |
search | An array of terms or phrases to search for in the submitted audio | no |
replace | An array of terms or phrases to search for in the submitted audio and replace | no |
keywords | An array keywords to which the model should pay particular attention to boosting or suppressing to help it understand context | no |
tag | A tag to associate with the request. Tags appear in usage reports | no |
utteranceEndMs (added in 08.5) | a number of milliseconds of silence that deepgram will wait after the last word was spoken before returning an UtteranceEnd event, which is used by jambonz to trigger the transcript webhook if this proprety is supplied. This is essentially Deepgram's version of continous ASR (and in fact if you enable continuos ASR on Deepgram it will work by enabling this property) | no |
shortUtterance (added in 08.5) | Causes a transcript to be returned as soon as the Deepgram is_final property is set. This should only be used in scenarios where you are expecting a very short confirmation or directed command and you want minimal latency | no |
smartFormatting (added in 08.5) | Indicates whether to enable Deepgram's Smart Formatting feature. | no |
ibmOptions
is an object with the following properties. Please refer to the IBM Watson Documentation for detailed descriptions. This option is available in jambonz 0.8.0 or above.
option | description | required |
---|---|---|
sttApiKey | IBM api key to authenticate with (overrides setting in jambonz portal) | no |
sttRegion | IBM region (overrides setting in jambonz portal) | no |
instanceId | IBM speech instance id (overrides setting in jambonz portal) | no |
model | The model to use for speech recognition | no |
languageCustomizationId | Id of a custom language model | no |
acousticCustomizationId | Id of a custom acoustic model | no |
baseModelVersion | Base model to be used | no |
watsonMetadata | a tag value to apply to the request data provided | no |
watsonLearningOptOut | set to true to prevent IBM from using your api request data to improve their service | no |
nvidiaOptions
is an object with the following properties. Please refer to the Nvidia Riva Documentation for detailed descriptions. This option is available in jambonz 0.8.0 or above.
option | description | required |
---|---|---|
rivaUri | grcp endpoint (ip:port) that Nvidia Riva is listening on | no |
maxAlternatives | number of alternatives to return | no |
profanityFilter | Indicates whether to remove profanity from the transcript | no |
punctuation | Indicates whether to provide puncutation in the transcripts | no |
wordTimeOffsets | indicates whether to provide word-level detail | no |
verbatimTranscripts | Indicates whether to provide verbatim transcripts | no |
customConfiguration | An object of key-value pairs that can be sent to Nvidia for custom configuration | no |
sonioxOptions
is an object with the following properties. Please refer to the Soniox Documentation for detailed descriptions. This option is available in jambonz 0.8.2 or above.
option | description | required |
---|---|---|
api_key | Soniox api key | no |
model | Soniox model to use | no (default: precision_ivr) |
profanityFilter | Indicates whether to remove profanity from the transcript | no |
storage | properties that dictate whether to audio and/or transcripts. Can be useful for debugging purposes. | no |
storage.id | storage identifier | no |
storage.title | storage title | no |
storage.disableStoreAudio | if true do not store audio | no (default: false) |
storage.disableStoreTranscript | if true do not store transcript | no (default: false) |
storage.disableSearch | if true do not allow search | no (default: false) |