Speech Parser using Web APIs

Speech recognition (or speech parsing) is the technology for turning speech (audio) to text. It's a technology involving phonetic detection and having a language grammar available for matching sounds to words.

“...”

Confidence: ...

Voice assistants use this technology, and they perform reasonably well, but it's hard to do. It involves some machine learning algorithms and a lot of language data. Something rather expensive to do. But, browsers are implementing a native web API for handling all this with "ease". Providing the necessary methods and language tools for transcribing audio.

const SpeechRecognition = window.SpeechRecognition || window.webkitSpeechRecognition
const SpeechGrammarList = window.SpeechGrammarList || window.webkitSpeechGrammarList
const recognition = new SpeechRecognition()
recognition.grammars = new SpeechGrammarList()
recognition.lang = "en-US"
recognition.interimResults = false
recognition.maxAlternatives = 1
recognition.start()
recognition.onresult = event=> {
  const result = event.results[0][0]
  const { transcript, confidence } = result
  // The "transcript" contains the text 
  // The "confidence" is a 0 (lowest) to 1 (highest) scale
} 
recognition.onspeechend = ()=> recognition.stop()

This is the minimal code necessary to handle speech parsing. To actually do something useful, a number of additional steps should be performed. For one, support sucks, this only works in webkit browsers (as of early 2020) - so the first step is checking the API will work or throw an error when creating the SpeechRecognition instance.

There are also a number of scenarios we haven't taken into account, such as errors, or a lot of noise in the audio, or foreign words... The API provides many additional listeners, and there should be a specific handler for each.

recognition.onaudiostart = ()=> console.log("[Speech Parser] Now capturing live audio")
recognition.onaudioend = ()=> console.log("[Speech Parser] Finished capturing live audio")
recognition.onend = ()=> console.log("[Speech Parser] Recognition service has ended successfully")
recognition.onnomatch = ()=> console.log("[Speech Parser] No comprehensible speech found (Either noise or not passing the threshold)")
recognition.onsoundstart = ()=> console.log("[Speech Parser] Sounds (speech or noise) detected")
recognition.onsoundend = ()=> console.log("[Speech Parser] Sounds (speech or noise) not detected anymore")
recognition.onspeechstart = ()=> console.log("[Speech Parser] Comprehensible speech detected")
recognition.onstart = ()=> console.log("[Speech Parser] Started")
recognition.onerror = ()=> console.log("[Speech Parser] Error capturing audio")

As I've mentioned before, support for this as of today (early 2020) is bad, to say the least. Don't present this as the killer cross-browser feature that's going to change the game, because it's still too soon. Check out all the details here: Can I Use

More concepts