SpeechBrain: A Community Roadmap

Why a Roadmap?
As a community-based and open source project, SpeechBrain needs the help of its community to grow in the right direction. Opening the roadmap to our users enables the toolkit to benefit from new ideas, new research axes, or even new technologies. The roadmap lists all the changes and updates that need to be done in the current version of SpeechBrain.

How can I contribute?
If you don’t have any precise idea on what you would like to do with SpeechBrain, feel free to simply pick one or many items from the list to implement/solve them! Do not hesitate to contact us to know more about the progress of the different elements that interest you, there certainly is space for your help!

How do I propose a new item?
The roadmap heavily relies on the needs expressed by the community. Hence, we need you to keep expanding it! If you think that an item should be added to the roadmap, simply open a topic on the right category to foster some discussions. If enough peoples are interested, or if the task leader is convinced by your idea, he/she will add it to the roadmap.

The Roadmap

General Architecture

  • Measure the performance of mixed-precision training
  • Facilitate the use of multiple optimizers
  • Facilitate partial and gradual unfreezing of architectures
  • Ensure batch independent evaluations
  • Facilitate the use of multiple dataloaders
  • Dynamic batching
  • Making beamformer jitable
  • Improve and expand tutorials

E2E Speech Recognition

  • Online decoding
  • Extend N-Gram decoding / rescoring to word-level decoding
  • Refactor the transformer interface for even more transparency.
  • Refactor the rescoring interface for even more transparency.
  • Windowed attention for faster training and decoding with attention
  • Scale wav2vec 2.0 experiments for ASR (various datasets, architectures …)
  • Other types of efficient transformers
  • Jasper and QuartzNet
  • Optimize and test for production scenarios (benchmarking)
  • Real-time CTC decoding

HMM Speech Recognition

  • K2 integration

Self-Supervised Learning

  • Full implementation of wav2vec 2.0 (not only loading from Fairseq or HuggingFace)
  • Full implementation of PASE +
  • More fine-tuned languages

Spoken Language Understanding

  • Adding MEDIA and Port-Media recipes.

Speech Enhancement

  • All done for now!

Speaker Identification and Verification

  • Anonymization
  • Couple Diarization pipeline with VAD (+ put model on HuggingFace)
  • Benchmark the ECAPA-TDNN architecture

Source Separation

  • Speech separation with a varying number of speakers

Speech Processing

  • Add more acoustic features (PLP, pitch …).


  • DeepVoice
  • Tacotron 2
  • WaveNet
  • Higi-GAN


  • G2P on HuggingFace