DNN Beamforming and other front-ends modules

Hi,
Is SpeechBrain at its current state supports multi-channel End-To-End speech recognition?
Does it have any beamforming(DNN based)/Masking/Biasing capabilities?

Hi ! Yes it does we have an example on our tutorials Here

However, I don’t think we currently have a recipe for that. @mravanelli could you confirm ?

Not Found

The requested URL was not found on this server.

We support different beamformers (e.g, delay-and-sum, MVDR, GeV) and there is a tutorials on that. Some of my students are now working to connect the beamformers to the speech recognition pipelines in different way. It is something quite straightforward do to do though (just connecting existing functionalities).

@stuartiannaylor oops: https://colab.research.google.com/drive/1UVoYDUiIrwMpBTghQPbA6rC1mc9IBzi6?usp=sharing

1 Like

Does anyone have a guide how to install on Arm64 as whatever I do I always seem to hit Intel_mkl problems.

I just followed this install.

https://mathinf.eu/pytorch/arm64/2021-01/ after several fails of my own but still.

(venv) pi@raspberrypi:~/speech-brain/speechbrain-0.5.7 $ python3 delay-sum.py
/home/pi/speech-brain/venv/lib/python3.7/site-packages/torch/functional.py:585: UserWarning: stft with return_complex=False is deprecated. In a future pytorch release, stft will return complex tensors for all inputs, and return_complex=False will raise an error.
Note: you can still call torch.view_as_real on the complex output to recover the old return format. (Triggered internally at  ../aten/src/ATen/native/SpectralOps.cpp:483.)
  normalized, onesided, return_complex)
Traceback (most recent call last):
  File "delay-sum.py", line 38, in <module>
    Xs = stft(xs)
  File "/home/pi/speech-brain/venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 880, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/pi/speech-brain/speechbrain-0.5.7/speechbrain/processing/features.py", line 171, in forward
    return_complex=False,
  File "/home/pi/speech-brain/venv/lib/python3.7/site-packages/torch/functional.py", line 585, in stft
    normalized, onesided, return_complex)
RuntimeError: fft: ATen not compiled with MKL support

Do you think you guys could port some of the libs to work without torch dependencies so that we can get the satellites to connect to a torch driven central brain so that the brain can have ears?

I just want to use simple low load KWS to use the DOA of KWS hit to focus whilst a command sentence is spoken and reset for next KWS hit.
I am trying to do this on a Pi3a+ that is just KWS & Beamforming then websockets to talk to a central brain using common 2mic low cost hats.

Has anyone got a Arm64 (RaspiOS64 even better) install guide that actually does work? As it would be so amazing to get some simple satellites to work.
PS also if William Aris, Francois Grondin maybe could look at a simple 2mic subset for ESP32 as getting some low cost ears to a central brain would just be awesome and even if the the attenuation is small it still can make a huge difference in recognition.

We can’t due to resources focused on other things. I’ll ping Francois to see if he can help

1 Like

According to the Pytorch Devs its to be done with no firm ETA yet as Torch mobile also hits the same probs.

I think what Francois Grondin did in terms of examples is likely to garner some adoption at one point, but it would be great if we could be free of any large modules such as torch & torch audio for standard Math functions.

This makes sense. For now we used the torch modules as SpeechBrain is used for offline processing. From what I understand, you would like to do some DoA estimation online with low-cost hardware. May I suggest you look at the ODAS project (odas.io)? This is a framework I designed in C for DoA estimation that seems to match your needs in terms of real-time processing. It can compile and run on RP3.

Yeah I had a bit of a play Francios but again it was just swapping out for alternative modules (scipy.signal) as the idea is realtime offline networked KWS to a central brain and can be Intel offline in terms of ASR / TTS that would route intent to various services.

      f1, t1, Zxx1 = signal.stft(buffer[ :, 0], sample_rate)
      f2, t2, Zxx2 = signal.stft(buffer[ :, 1], sample_rate)
      #the hardcoded 2 mic which is common usb/hat
      cov = np.cov(Zxx1, Zxx2)

I didn’t get very far as gcc-phat wasn’t particularly sure and also wanted to even try and stay away from another module and for Arm64 just stay with numpy as was looking at this.

Its not microcontroller lite, but was also thinking maybe with a tinkering of Numba it actually could be quite lite and programmatically accessible using the examples you have given. Maybe a tad few more remarks in the current code and even a dummy like me might be able to have a stab at it.

I did try Odas quite a while ago and apols but I have a memory it does run on a Pi3 but doesn’t leave much run for anything else (Have this vague memory of using a webbrowser and just connecting and Odas had all sorts of tracking and was pretty amazing, but maybe a bit too much load and a lot that maybe isn’t needed.

The idea is to do something super simple and use a streaming KWS to capture KW and only have TDOA sampling and calc on a 1 sec buffer that is filled from chunked windows to a streaming KWS.
KW hit run a 1 shot TDOA that will use a beamformer to help capture a command sentence and then return to a no beamformer KWS listening state.

The processing section for me lends itself to being a companion module and really like what was done as someone with a tad more Python skill than me is likely to get streaming working as the brain dizzying Math has already been programmatically described.

I will have a look at Odas again, but thinking it does far more than I need, for what are distributed ‘ears’ :ear:

Ok; please let me know if you have further questions down the road. Cheers!

The code itself as is looks like scippy.signal.stft and numpy torch equivalents are interchangeable.

Can I just ask that I am presuming the time difference of the phase of directional angle of the frequency steps can be applied to a chunked stream.
That a file based version of a 4 mic was provided as an example but if you ignore the battle to do the load in real time, then it should work?

Yes you can use the stft, and then compute the TDOA for each pair of microphones. Then the system scans for the most likely DoA. If you ignore the burden of doing it in real-time with very low latency, you can do it for a chunk of signal.

The idea to just provide the lowest common denominator that is fit for purpose is to have a ring buffer that is KW length and a bit more.
The buffer is frozen on KW hit and used to provide TDOAS which is used by a delay+sum beamformer on the presumption the previous (1sec) contains mainly KW from the voice actor.

Its far from perfect but single word KWS often offer more reliance to noise and overall accuracy than complex phonetic based ASR.
I am not going to try and scan for the most likely DoA but just presume its contained in the KW ‘record’ buffer and use that for a short duration voice command (post kw till silence or timeout).
I guess KWS will just be driven from a single mic but only ASR will use the beamformer to help with directive noise.

My last question and apols about all this and my noobness that if I was to use a ‘2 mic’ with only x axis. Mvdr + SRP-PHAT vs Delay-and-Sum Beamforming GCC-PHAT does one offer any advantage for simple 2 mic single axis as presuming (Delay-and-Sum Beamforming GCC-PHAT)?

If you are dealing with only two mics, my guts feeling tells me that Mvdr should be quite similar to delay-and-sum in terms of enhancement performance.

Is basically 2 mics are basic and triangulation (more points) can provide expotential accuracy through mic number or is it approx linear.
The examples show a 4 mic ‘square’ mic on 50mm spacing’s using 16K samples with 343m/s @16k giving ≈ 21mm per sample, is there any sort of golden ratio? Was 50mm chosen as an example as a good minimum at that SR?

Has anyone ever created a model of KW multichannel phase detection or even modeled a beamformer similar to say some current NS & AEC models?

GitHub - SaneBow/PiDTLN: Apply machine learning model DTLN for noise suppression and acoustic echo cancellation on Raspberry Pi I was talking to sanebow who has done a amazing job of optimising the amazing GitHub - breizhn/DTLN: Tensorflow 2.x implementation of the DTLN real time speech denoising model. With TF-lite, ONNX and real-time audio processing support. and even though tensorflow lite it still shows what is capable with quantised ‘mobile’ NN frameworks and that is Python.

@FrancoisGrondin @titouan.parcollet

Pytorch 1.1 now seems to work on ARM64 which is great even if the KumaTea repo still has a prob with breakpad (but only for debug) so disabled.

I am going to give raspios_arm64 a go as have been waiting for this to be added for what seems a long time now :slight_smile:

from speechbrain.dataio.dataio import read_audio
from speechbrain.processing.features import STFT
from speechbrain.processing.multi_mic import Covariance
from speechbrain.processing.multi_mic import SrpPhat

import torch
import torchaudio

xs_speech = read_audio('speech_-0.82918_0.55279_-0.082918.flac') # [time, channels]
xs_speech = xs_speech.unsqueeze(0) # [batch, time, channels]
xs_noise_diff = read_audio('noise_diffuse.flac') # [time, channels]
xs_noise_diff = xs_noise_diff.unsqueeze(0) # [batch, time, channels]
xs_noise_loc = read_audio('noise_0.70225_-0.70225_0.11704.flac') # [time, channels]
xs_noise_loc =  xs_noise_loc.unsqueeze(0) # [batch, time, channels]
fs = 16000 # sampling rate

ss = xs_speech
nn_diff = 0.05 * xs_noise_diff
nn_loc = 0.05 * xs_noise_loc
xs_diffused_noise = ss + nn_diff
xs_localized_noise = ss + nn_loc

mics = torch.zeros((4,3), dtype=torch.float)
mics[0,:] = torch.FloatTensor([-0.05, -0.05, +0.00])
mics[1,:] = torch.FloatTensor([-0.05, +0.05, +0.00])
mics[2,:] = torch.FloatTensor([+0.05, +0.05, +0.00])
mics[3,:] = torch.FloatTensor([+0.05, +0.05, +0.00])

stft = STFT(sample_rate=fs)
cov = Covariance()
srpphat = SrpPhat(mics=mics)

Xs = stft(xs_diffused_noise)
XXs = cov(Xs)
doas = srpphat(XXs)

print(doas)

All seems to install OK and run but the result is

(venv) ubuntu@ubuntu:~/speechbrain$ python3 bf-test.py
tensor([[[0., 0., 1.],
         [0., 0., 1.],
         [0., 0., 1.],
         ...,
         [0., 0., 1.],
         [0., 0., 1.],
         [0., 0., 1.]]])

but expecting

tensor([[[-0.8284,  0.5570,  0.0588],
         [-0.8284,  0.5570,  0.0588],
         [-0.8284,  0.5570,  0.0588],
         ...,
         [-0.8284,  0.5570,  0.0588],
         [-0.8284,  0.5570,  0.0588],
         [-0.8284,  0.5570,  0.0588]]])

Does that mean torch 1.10 & torchaudio 0.10 are not supported or just the Arm64 version doesn’t seem to be working correct?