We were hoping to use SpeechBrain to fine-tune a pre-trained ASR model on an emotion recognition dataset like RAVDES to classify the emotion of a speech signal. We’ve poked around the great tutorials but didn’t find a good match to get us going. Is SpeechBrain an appropriate toolkit for this task? Any pointers would be greatly appreciated!
Hey ! Yes, SpeechBrain can do this. We actually have many peoples interested in doing emotion recognition. Isn’t the last part of this tutorial: https://colab.research.google.com/drive/1LN7R3U3xneDgDRK2gC5MzGkLysCWxuC3?usp=sharing Helping you in understanding how to do it ?
Thank you for the speedy response. Hah, that’s the one we zeroed in on but wasn’t sure it was the closest match. We’ll take a closer look now that our efforts aren’t going in the wrong direction. I Will report back if we get stuck. Thanks again.
Hey this looks really cool I’m working on a similar project but making my own dataset so i have just been cleaning that up more for the last few days. I Would love to see your notebook if you solve your problem. I would be happy to work with you in this problem over discord if you want. Im just trying to get better at working with this library but i have been stuck fixing my data so this looks like a cool way actually get working with it. good luck.
That sounds great @CupOfGeo. More brainpower is always a good thing. New to the SpeechBrain library as well. I’ve used HuggingFace but only with text. Hopefully, that’ll help smooth the learning curve a bit. If I get something working or get stuck I’ll let you know. Ditto for you. I’ve been working on it in fits and starts so it may be a bit before I reach out.
@CupOfGeo That’s cool that you’re working on this too
One hopeful subsequent goal after recognizing emotion may be recognizing social cues – things like signalling playfulness, or signalling seriousness, or signalling command… those are about cadence, a certain pattern in time of the sounds… and they’re about pitch, and about attack in certain spots, etc… Such cues actually layer on top of emotion… they are sort of like “side channels”…
Starting with Emotion may be a good first step towards that, because it has training samples available, so you can figure out the logistics with that…
Then, hopefully, training with the emotion will have helped to get the right kind of features needed for social cue patterns. At that point, it might be possible to generate training samples for social cues by finding snippets in You Tube or other videos… or perhaps make custom ones, exagerated recordings that make it easier to learn…
Does that make sense?
(“social cues” may not be obvious… just think about the way a police officer talks to you when they pull you over… the sound of voice is very different from the way someone flirting speaks… or they way someone being playful speaks… )
I’m doing literature review this week for my undergrad final year project (Crime and threat detection from audio w/ RNN or LSTM or maybe something else). I was wondering if anyone could suggest me some papers that’d probably give me some idea about approaching the problem, any paper that you’ve read or write yourself that may deal with audio, and some sort of classification task I guess could help me with this one. I’m thinking the task of emotion recognition does align with what I’m trying to do, thus I’m writing it here. Thanks🙂
Im gonna try to look at the dataset later today down to chat about it tomorrow on discord ill dm you my name and times.
@seanhalle my idea is im trying to clone a cartoon characters voice and take his different tone/emotional/inflections like when he’s happy angry drunk so on but honestly i think that would be too much for me so i was just starting with cloning the voice.
@khalid277144 i don’t know of any technical papers per say on using the technology. I do remember when it came out that they could use just the words not even the audio from during traffic stops by the oakland police department to determine what race someone is just by the words spoken. i did a quick google search and found this pbs news article Study slams troubled Oakland police department for racial bias | PBS NewsHour
maybe this REBECCA HETEY, from Stanford University might have something for you or at least for your paper. bias is very importation in ai (and easy to talk about to fill up a page)
@khalid277144 that’s a very interesting use… Is the goal to detect when someone’s speech pattern suggests they may be likely to commit a crime?
If so, it seems very aligned with the eventual goals of social cues. Threat appears to also be communicated by cadence, pitch, emphasis, volume, etc… thinking off the top of my head, I would guess that several kinds of cue combine… there’s more to think about here…
@CupOfGeo that sounds like fun Then you can generate new cartoons without voice actor, yes?
For that, we sound aligned on the first step, of effective emotion recognition.
Yeah but thats not what I want it would be great if they could use it to help them make more toons but I don’t want them to be automated away . . . I just want to clone the voice so that i can have it. I make my music for me kinda deal.
I started making a notebook working with RAVDES. I have never used py torch before. I started on with tenorflow and havn’t ever had a reason to look at anything else till now. Im looking into how to make a some extra layers for a network head to have an output shape of 8 instead of a seq2seq. I see in the git they have other recipes for ASR but for now im just playing around and screwing with stuff.
as you may have guess i didn’t go thought the intro tutorials as much as i should have. https://gab41.lab41.org/speech-recognition-you-down-with-ctc-8d3b558943f0 just learned what a ctc is so i got that going for me lol.
I would actually not advise using a pre-trained ASR model for emotion recognition. An ASR model will probably throw out any information in the input speech not relevant to predicting the transcript (so, speaker identity, emotion, environmental noise…)
You would probably have better luck with an unsupervised feature extractor like wav2vec 2.0 (@titouan.parcollet just added it to SpeechBrain)
This tutorial might help you with that:
In general, we are interested in emotion recognition (e.g. IEMOCAP, etc) and it would be great if some of your can share a baseline on this task.
At the end, how did you manage to use SB for emotion recognition?
Still in motion…stay tuned.
Well…to be honest…we’ve stalled out but hope to restart the engine again. A bit of a false start. Other priorities took over. We’ll update once we make some progress.
what about using PASE+ for feature extractor and some classifier. All as a simple starter ?
I will try to help withing 1-2 months since I am finishing the work to use pretrained models to fine tune to multilanguage small datasets, and also to use self-supervised representation as well to obtain emotion classification.
Question is: does simple branching and creating notebooks is ok for contribution ?
or do you have more strict, guided rules?
It depends on what you want to do. Colab are fine for tutorials. If you want to contribute to the code of SpeechBrain. It gets more sophisticated of course (but not hard).