Does speech brain allow vocal recognition of phonemes?


With my lab, we want to develop tests that use vocal recognition in order to diagnose dyslexia in children. Chilren will be asked to read aloud a list of words and pseudo words and we want our speech recognition tool to be able to transcript what has been said by the child, even if he/she made some errors.
As an example, if the child correctly read the pseudo word ‘stet’, we want the tool to recognize it as ‘stet’ and make a difference with the real words ‘step’ or ‘set’ that can also be included in the task. On the other hand, if the child read the word ‘nerve’ as ‘nevre’, we also want the tool to recognize that what has been read was ‘nevre’ and not ‘nerve’.
That’s why we are looking for a framework we could use for phoneme recognition. Would it be possible to do it with speech brain?

Thank you for your answer


I’m pretty that you can train a phoneme recogniser with almost any available dataset as long as you have a correct phonetizer. @lugosch might know more about this.

For this particular use case, I think your best bet would be Allosaurus: GitHub - xinjli/allosaurus: Allosaurus is a pretrained universal phone recognizer for more than 2000 languages

(SpeechBrain does have a basic forced aligner, but it assumes that the speaker has correctly used one of the pronunciations for the word listed in the lexicon, which is not your use case.)


More generally, for others who want train your own phoneme recognizer, you would need either:

  1. audios labeled with phoneme sequences, in which case you could train a model very easily by modifying our TIMIT recipe or any CTC recipe—though note that there are very few datasets labeled with phonemes.


  1. audios labeled with transcripts and a lexicon (= a pronunciation dictionary with one or more phoneme sequences for each word). It’s not a straightforward one-to-one mapping between the transcript and the phoneme sequence for a given utterance because each word typically has multiple possible pronunciations, and there may or may not be silence between each pair of consecutive words, so you need to run forced alignment to get the sequence of phoneme labels for an utterance (set up a chain of HMMs with each possible pronunciation for each word of the sentence and run the Viterbi algorithm to find the most likely sequence of pronunciations). We have a forced alignment recipe for TIMIT: speechbrain/recipes/TIMIT/Alignment at develop · speechbrain/speechbrain · GitHub

Thanks a lot for all these advices, we will have a look.

Hi lugosch,
Thanks for sharing this phoneme detection model.
Do you know any model that does the same thing for live audio? Or are there any tutorials/materials on training such a model?

1 Like

Thank you so much for this information

Can I use forced alignment recipe for TIMIT with AISHELL-1?

Or the dataset should have a phoneme transcription as well?