The open source voice gateway for conversational AI providers

recognizer

The recognizer property is used in multiple verbs (gather, transcribe, dial). It selects and configures the speech recognizer.

It is an object containing the following properties:

option	description	required
vendor	Speech vendor to use (see list below, along with any others you add via the custom speech API)	no
language	Language code to use for speech detection. Defaults to the application level setting	no
fallbackVendor	Fallback Speech vendor to use (see list below, along with any others you add via the custom speech API)	no
fallbackLanguage	Fallback Language code to use for speech detection. Defaults to the application level setting	no
interim	If true, interim transcriptions are sent	no (default: false)
hints	(google, microsoft, deepgram, nvidia, soniox) Array of words or phrases to assist speech detection. See examples below.	no
hintsBoost	(google, nvidia) Number indicating the strength to assign to the configured hints. See examples below.	no
profanityFilter	(google, deepgram, nuance, nvidia) If true, filter profanity from speech transcription . Default: no	no
singleUtterance	(google) If true, return only a single utterance/transcript	no (default: true for gather)
vad.enable	If true, delay connecting to cloud recognizer until speech is detected	no
vad.voiceMs	If vad is enabled, the number of milliseconds of speech required before connecting to cloud recognizer	no
vad.mode	If vad is enabled, this setting governs the sensitivity of the voice activity detector; value must be between 0 to 3 inclusive, lower numbers mean more sensitive	no
separateRecognitionPerChannel	If true, recognize both caller and called party speech using separate recognition sessions	no
altLanguages	(google, microsoft) An array of alternative languages that the speaker may be using	no
punctuation	(google) Enable automatic punctuation	no
model	(google) speech recognition model to use	no (default: phone_call)
enhancedModel	(google) Use enhanced model	no
words	(google) Enable word offsets	no
diarization	(google) Enable speaker diarization	no
diarizationMinSpeakers	(google) Set the minimum speaker count	no
diarizationMaxSpeakers	(google) Set the maximum speaker count	no
interactionType	(google) Set the interaction type: discussion, presentation, phone_call, voicemail, professionally_produced, voice_search, voice_command, dictation	no
naicsCode	(google) set an industry NAICS code that is relevant to the speech	no
vocabularyName	(aws) The name of a vocabulary to use when processing the speech.	no
vocabularyFilterName	(aws) The name of a vocabulary filter to use when processing the speech.	no
filterMethod	(aws) The method to use when filtering the speech: remove, mask, or tag.	no
languageModelName	(aws) The name of the custom language model when processing speech.	no
identifyChannels	(aws) Enable channel identification.	no
profanityOption	(microsoft) masked, removed, or raw. Default: raw	no
outputFormat	(microsoft) simple or detailed. Default: simple	no
requestSnr	(microsoft) Request signal to noise information	no
initialSpeechTimeoutMs	(microsoft) Initial speech timeout in milliseconds	no
minConfidence	If provided, final transcripts with confidence less than this value return a reason of 'stt-low-confidence' in webhook	no
transcriptionHook	Webhook to receive an HTPP POST when an interim or final transcription is received.	yes
asrTimeout	timeout value for continuous ASR feature	no
asrDtmfTerminationDigit	DMTF key that terminates continuous ASR feature	no
azureServiceEndpoint	Custom service endpoint to connect to, instead of hosted Microsoft regional endpoints	no
azureOptions (added in 0.8.5)	Azure-specific speech recognition options (see below)	no
deepgramOptions (added in 0.8.0)	Deepgram-specific speech recognition options (see below)	no
ibmOptions (added in 0.8.0)	IBM Watson-specific speech recognition options (see below)	no
nuanceOptions (added in 0.8.0)	Nuance-specific speech recognition options (see below)	no
nvidiaOptions (added in 0.8.0)	Nvidia-specific speech recognition options (see below)	no
sonioxOptions (added in 0.8.2)	Soniox-specific speech recognition options (see below)	no

Speech-to-text vendors

jambonz natively supports the following speech-to-text services:

assemblyai
aws
azure
cobalt
deepgram
google
ibm
nuance
nvidia
sonoix

Note: Microsoft supports on-prem and private link options for deploying the speech service in addition to the hosted Microsoft service.

google, microsoft, deepgram, and nvidia all support the ability to provide a dynamic list of words or phrases that should be "boosted" by the recognizer, i.e. the recognizer should be more likely to detect this terms and return them in the transcript. A boost factor can also be applied. In the most basic implementation it would look like this:

"hints": ["benign", "malignant", "biopsy"],
"hintsBoost": 50

Additionally, google and nvidia allow a boost factor to be specified at the phrase level, e.g.

"hints": [
  {"phrase": "benign", "boost": 50},
  {"phrase": "malignant", "boost": 10},
  {"phrase": "biopsy", "boost": 20},
]

azureOptions is an object with the following properties. This option is available in jambonz 0.8.5 or above.

option	description	required
speechSegmentationSilenceTimeoutMs	Duration (in ms) of nonspeech audio within a phrase that's currently being spoken before that phrase is considered "done." See here for details	no

nuanceOptions is an object with the following properties. Please refer to the Nuance Documentation for detailed descriptions. This option is available in jambonz 0.8.0 or above.

option	description	required
clientId	Nuance client ID to authenticate with (overrides setting in jambonz portal)	no
secret	Nuance secret to authenticate with (overrides setting in jambonz portal)	no
kryptonEndpoint	Endpoint of on-prem Krypton endpoint to connect to	no (defaults to hosted service)
topic	specialized language model	no
utteranceDetectionMode	How many sentences (utterances) within the audio stream are processed ('single', 'multiple', 'disabled')	no (default: single
punctuation	Whether to enable auto punctuation	no
includeTokenization	Whether to include tokenized recognition result.	no
discardSpeakerAdaptation	If speaker profiles are used, whether to discard updated speaker data. By default, data is stored.	no
suppressCallRecording	Whether to disable call logging and audio capture. By default, call logs, audio, and metadata are collected.	no
maskLoadFailures	whether to terminate recogition when failing to load external resources	no
suppressInitialCapitalization	When true, the first word in a sentence is not automatically capitalized.	no
allowZeroBaseLmWeight	When true, custom resources (DLMs, wordsets, etc.) can use the entire	no
filterWakeupWord	Whether to remove the wakeup word from the final result.	no
resultType	The level of recognition results ('final', 'partial', 'immutable_partial')	no (default: final)
noInputTimeoutMs	Maximum silence, in milliseconds, allowed while waiting for user input after recognition timers are started.	no
recognitionTimeoutMs	Maximum duration, in milliseconds, of recognition turn	no
utteranceEndSilenceMs	Minimum silence, in milliseconds, that determines the end of a sentence	no
maxHypotheses	Maximum number of n-best hypotheses to return	no
speechDomain	Mapping to internal weight sets for language models in the data pack	no
userId	Identifies a specific user within the application	no
speechDetectionSensitivity	A balance between detecting speech and noise (breathing, etc.), 0 to 1. 0 means ignore all noise, 1 means interpret all noise as speech	no (default: 0.5)
clientData	An object containing arbitrary key, value pairs to inject into the call log.	no
formatting.scheme	Keyword for a formatting type defined in the data pack	no
formatting.options	Object containing key, value pairs of formatting options and values defined in the data pack	no
resource	An array of zero or more recognition resources (domain LMs, wordsets, etc.) to improve recognition	no
resource[].inlineWordset	Inline wordset JSON resource. See Wordsets for details	no
resource[].builtin	Name of a builtin resource in the data pack	no
resource[].inlineGrammar	Inline grammar, SRGS XML format	no
resource[].wakeupWord	Array of wakeup words	no
resource[].weightName	input field setting the weight of the domain LM or builtin relative to the data pack ('defaultWeight', 'lowest', 'low', 'medium', 'high', 'highest')	no (default = MEDIUM
resource[].weightValue	Weight of DLM or builtin as a numeric value from 0 to 1	no (default: 0.25)
resource[].reuse	Whether the resource will be used multiple times ('undefined_reuse', 'low_reuse','high_reuse')	no (default: low_reuse
resource[].externalReference	An external DLM or settings file for creating or updating a speaker profile	no
resource[].externalReference.type	Resource type ('undefined_resource_type', 'wordset', 'compiled_wordset', 'domain_lm', 'speaker_profile', 'grammar', 'settings')	no
resource[].externalReference.uri	Location of the resource as a URN reference	no
resource[].externalReference.maxLoadFailures	when true allow transcription to proceed resource loading fails	no
resource[].externalReference.requestTimeoutMs	Time to wait when downloading resources	no
resource[].externalReference.headers	An object containing HTTP cache-control directives (e.g. max-age etc)	no

deepgramOptions is an object with the following properties. Please refer to the Deepgram Documentation for detailed descriptions. This option is available in jambonz 0.8.0 or above.

option	description	required
apiKey	Deepgram api key to authenticate with (overrides setting in jambonz portal)	no
tier	Deepgram tier you would like to use ('enhanced', 'base')	no (default: base)
model	Deepgram model used to process submitted audio ('general', 'meeting', 'phonecall', 'voicemail', 'finance', 'conversationalai', 'video', 'custom')	no (default: general)
endpointing	Indicates the number of milliseconds of silence Deepgram will use to determine a speaker has finished saying a word or phrase. The value provided must be iether a number of milliseconds or 'false' to disable the feature entirely. Note: the default endpointing value that Deepgram uses is 10 milliseconds. You can set this value higher to allow to require more silence before a final transcript is returned but we suggest a value of 1000 (one second) or less, as we have observed strange behaviors with higher values. If you wish to allow more time for pauses during a conversation before returning a transcript, we suggest using the utteranceEndMs feature instead that is described below.	no (default: 10ms)
customModel	Id of custom model	no
version	Deepgram version of model to use	no (default: latest)
punctuate	Indicates whether to add punctuation and capitalization to the transcript	no
profanityFilter	Indicates whether to remove profanity from the transcript	no
redact	Whether to redact information from transcripts ('pci', 'numbers', 'true', 'ssn')	no
diarize	Wehther to assign a speaker to each word in the transcript	no
diarizeVersion	if set to '2021-07-14.0' the legacy diarization feature will be used	no
multichannel	Indicates whether to transcribe each audio channel independently	no
alternatives	Number of alternative transcripts to return	no
numerals	Indicates whether to convert numbers from written format (e.g., one) to numerical format (e.g., 1)	no
search	An array of terms or phrases to search for in the submitted audio	no
replace	An array of terms or phrases to search for in the submitted audio and replace	no
keywords	An array keywords to which the model should pay particular attention to boosting or suppressing to help it understand context	no
tag	A tag to associate with the request. Tags appear in usage reports	no
utteranceEndMs (added in 08.5)	a number of milliseconds of silence that deepgram will wait after the last word was spoken before returning an UtteranceEnd event, which is used by jambonz to trigger the transcript webhook if this proprety is supplied. This is essentially Deepgram's version of continous ASR (and in fact if you enable continuos ASR on Deepgram it will work by enabling this property)	no
shortUtterance (added in 08.5)	Causes a transcript to be returned as soon as the Deepgram is_final property is set. This should only be used in scenarios where you are expecting a very short confirmation or directed command and you want minimal latency	no
smartFormatting (added in 08.5)	Indicates whether to enable Deepgram's Smart Formatting feature.	no
fillerWords (added in 0.9.3)	Indicates if Deepgram should transcribe filler words	no

ibmOptions is an object with the following properties. Please refer to the IBM Watson Documentation for detailed descriptions. This option is available in jambonz 0.8.0 or above.

option	description	required
sttApiKey	IBM api key to authenticate with (overrides setting in jambonz portal)	no
sttRegion	IBM region (overrides setting in jambonz portal)	no
instanceId	IBM speech instance id (overrides setting in jambonz portal)	no
model	The model to use for speech recognition	no
languageCustomizationId	Id of a custom language model	no
acousticCustomizationId	Id of a custom acoustic model	no
baseModelVersion	Base model to be used	no
watsonMetadata	a tag value to apply to the request data provided	no
watsonLearningOptOut	set to true to prevent IBM from using your api request data to improve their service	no

nvidiaOptions is an object with the following properties. Please refer to the Nvidia Riva Documentation for detailed descriptions. This option is available in jambonz 0.8.0 or above.

option	description	required
rivaUri	grcp endpoint (ip:port) that Nvidia Riva is listening on	no
maxAlternatives	number of alternatives to return	no
profanityFilter	Indicates whether to remove profanity from the transcript	no
punctuation	Indicates whether to provide puncutation in the transcripts	no
wordTimeOffsets	indicates whether to provide word-level detail	no
verbatimTranscripts	Indicates whether to provide verbatim transcripts	no
customConfiguration	An object of key-value pairs that can be sent to Nvidia for custom configuration	no

sonioxOptions is an object with the following properties. Please refer to the Soniox Documentation for detailed descriptions. This option is available in jambonz 0.8.2 or above.

option	description	required
api_key	Soniox api key	no
model	Soniox model to use	no (default: precision_ivr)
profanityFilter	Indicates whether to remove profanity from the transcript	no
storage	properties that dictate whether to audio and/or transcripts. Can be useful for debugging purposes.	no
storage.id	storage identifier	no
storage.title	storage title	no
storage.disableStoreAudio	if true do not store audio	no (default: false)
storage.disableStoreTranscript	if true do not store transcript	no (default: false)
storage.disableSearch	if true do not allow search	no (default: false)