Run time context strings

Any suggestions for how to incorporate context strings into an e2e transformer ASR model?

For example we have a training set where each utterance and transcription also has a set of possible words that if present should be transcribed a particular way. Similar words will be prepared for each inference as well. Essentially I’d like a way to pass ‘hints’ in so that if we know at inference time that a name will be “Smythe” instead of “Smith” the model can output the correct spelling.

Ideally I’d like the model/training to learn how to use (or not use) the context hints, rather than doing some sort of post processing LM rescoring.

There’s been some work on this for attention models in general (not just transformers), see e.g. [1808.02480] Deep context: end-to-end contextual speech recognition. The idea is you just attend to the list of context words in addition to the audio.

1 Like