FORMAT OF DATA SET FOR MAKING CUSTOM LANGUAGE ASR SYSTEM (Speech Emotion and Speech to Text)

Hello.

Im working on making ASR system from scratch in Urdu and Punjabi language as my thesis work (so I am not much of an expert). I have a huge bunch of WAV file data available from a Call center which means that amount of audio data is well above 2000 hours. I want to make a Speech Emotion Recognition System (with voice finger printing) and a Transcription system (on which I apply Sentiment Analysis). It is to be implemented on live call and on saved data as well.

Can anyone please show me a sample as to how to set up the dataset (even if it is in English or Hindi or Mandarin or some other language) so that I dont face implementation issues? I just want to understand the architecture so that in terms of data set I work one time and then I dont have to worry on it again.

Regards

2 Likes

Wha do you mean for “Set up the dataset”?

Hi, you don’t have to “setup” the dataset. I propose that you take a look at our data input pipeline tutorial first :slight_smile:

In practice you just need to create a csv or json file that contains all the requested fields for you. Have a look at the one for each recipe we have to get examples :smiley:

Ive gone through the tutorial. I just need to understand this thing. Here we go…:

I want to make a language and acoustic model for my ASR system (Speech to text) in URDU LANGUAGE. I have a huge database of Call records (more than 5000 hours of audio).

Do I need to break the audio into words? Sentences? or complete paragraphs?

Do I have to make the corresponding transcript for each audio?

Where do I keep the URDU Language Dictionary?
Where do I save the audio?
I just need to see a sample so that I can correlate and do it myself. Im unable to correlate the whole thing from tutorial with regards to the URDU LANGUAGE.

I want to make a language and acoustic model for my ASR system (Speech to text) in URDU LANGUAGE. I have a huge database of Call records (more than 5000 hours of audio).

Do I need to break the audio into words? Sentences? or complete paragraphs?

Do I have to make the corresponding transcript for each audio?

Where do I keep the URDU Language Dictionary?
Where do I save the audio?
I just need to see a sample so that I can correlate and do it myself. Im unable to correlate the whole thing from tutorial with regards to the URDU LANGUAGE.

Ok. Isn’t the ASR from scratch tutorial helping at all ? Nor this template ? :slight_smile:

The best thing is to have sentence-aligned audio — transcriptions. Sentences are usually long enough to enable training and short enough to be tractable in terms of resources. For the dictionnary it all comes down to the type of ASR system you want to train. If it is based on BPE, you won’t have to manage one as BPE will discover it for you. Same if you use characters with BPE.

Audio can be stored anywhere as long as the correct path are given for each sentence in your csv or json input files :slight_smile:

Thank you so much sir for replying. I have access to audio data more than 5000 hours in URDU Language. I need a good amount of data for a good training and finally I have it now from a call center :wink:

So what I understand is I keep a folder containing all audios of single spoken sentences followed by transcripts. Each audio will be like Spkr#_Sent#.wav (Number # starting 1 to 1Million lol). There will be transcript of each saved as txt file along with it.

There will be a json script which will show location of audio file and corresponding text (as in step 1 of Google Colab ).

The next will be tokenization which will be done by the script and I dont need to do much on that front. I would also not going for BPE so no need for a dictionary or word2vec or seq or Phoneme set or utterance dictionary or anything. All I need is audio and corresponding transcripts.

More over the template shown can be used for languages other than ENGLISH right?