Im working on making ASR system from scratch in Urdu and Punjabi language as my thesis work (so I am not much of an expert). I have a huge bunch of WAV file data available from a Call center which means that amount of audio data is well above 2000 hours. I want to make a Speech Emotion Recognition System (with voice finger printing) and a Transcription system (on which I apply Sentiment Analysis). It is to be implemented on live call and on saved data as well.
Can anyone please show me a sample as to how to set up the dataset (even if it is in English or Hindi or Mandarin or some other language) so that I dont face implementation issues? I just want to understand the architecture so that in terms of data set I work one time and then I dont have to worry on it again.