Realtime Speech Enhancement on Jetson Nano

Hello hello!

I’m building a realtime speech enhancement pipeline that would run on an Nvidia Jetson Nano. The aim is to prototype a deep learning based hearing aid. I have three questions.

How would I go about implementing realtime inference? I saw a few posts on here mentioning it but not sure where to start? Also curious if you know of example code/tutorials I could get inspiration from?

Am I understanding correctly that I can’t use transformer based models? Since I would need to divide the audio into chunks and transformers need future timesteps for self attention that wouldn’t work? Is there an architecture you recommend?

Am I also understanding correctly that I can’t use the mtl-mimic-voicebank pre-trained model (speechbrain/mtl-mimic-voicebank · Hugging Face) if my speech is not in english? Because the model relies on features learned for an ASR task?