The template YAML for ASRFromScratch uses the pretrained LM model on line 38. Following the ASRFromScratch CoLab, we make an LM model locally which I would like to use. The instructions in the YAML are simply to replace line 38 with “a local path pointing to a directory containing the lm.ckpt and tokenizer.ckpt”. However:
- The Tokenizer in Tokenizer/save contains 1000_unigram.model and 1000_unigram.vocab. There is no file tokenizer.ckpt. I found this clue, which I read as saying that I can ignore the 1000_unigram.vocab and rename the 1000_unigram.model to tokenizer.ckpt: how to set asr model trained from zero, tokenizer is 4257_unigram.model - Giters
- The LM has LM/results/RNNLM/save/CKPT+2022-01-31+10-42-23+01/model.ckpt. There is no lm.ckpt
- The LM config file RNNLM.yaml has embedding dimension 256, but ASR/train.yaml has cnn_channels: (128,256) on line 81 and emb_size: 128 on line 91. It seems that one or both of these needs to be changed to 256 to be compatible with the locally trained LM.
- The ASR config file references pretrained asr.ckpt. Commenting that out causes the training process to fail. So the demo CoLab is showing fine-tuning of a pre-existing model. How do we create asr.ckpt from scratch?
Please help me sort this out. Where should I be finding or separately building tokenizer.ckpt and lm.ckpt? What should I do with lines 81 and 91 of ASR/train.yaml?