Any suggestions for how to incorporate context strings into an e2e transformer ASR model?
For example we have a training set where each utterance and transcription also has a set of possible words that if present should be transcribed a particular way. Similar words will be prepared for each inference as well. Essentially I’d like a way to pass ‘hints’ in so that if we know at inference time that a name will be “Smythe” instead of “Smith” the model can output the correct spelling.
Ideally I’d like the model/training to learn how to use (or not use) the context hints, rather than doing some sort of post processing LM rescoring.