How can I use my trained model after 4th step?

Hello,

I managed to train ASR (all 4 steps in ASR from scratch).
Now i have all saved results in output folder that is defined in ./templates/speech_recongition/ASR/train.yaml file, to be specific: output_folder: !ref results/CRDNN_BPE_960h_LM/<seed>.

Now when model is trained I have this structure in ASR directory:

Now what is next step if I want to transcribe some audio file on my special trained model after 4th step.

I have tried this:

from speechbrain.pretrained import EncoderDecoderASR

asr_model = EncoderDecoderASR.from_hparams(source="<same as output_folder from train.yaml file>", savedir="<directory that contains lm.ckpt, tokenizer.ckpt, hyperparams.yaml, model.ckpt>")
audio_file = '<some_audio_file>'
asr_model.transcribe_file(audio_file)

Now when I start that all I see are errors.

Can you please explain how can I use trained files from 4th step in order to transcribe file (do I need to make custom hparams.yaml, if yes how?)?

Hi, it should be explained in Step 5 of the ASR tutorial.

@titouan.parcollet

Yeah, I followed every step:

  1. created training.py file (in ASR) with this content:
from speechbrain.pretrained import EncoderDecoderASR

asr_model = EncoderDecoderASR.from_hparams(source="./results/CRDNN_BPE_960h_LM/2602/", savedir="pretrained_model")
audio_file = './19-198-0000.flac'
asr_model.transcribe_file(audio_file)
  1. then I copied all results from ASR inside <path_to_ASR>/ASR/:
    image

  2. this is my hyperparams.yaml in 2602:

# Generated 2021-07-12 from:
# yamllint disable
# ############################################################################
# Model: E2E ASR with attention-based ASR
# Encoder: CRDNN
# Decoder: GRU + beamsearch + RNNLM
# Tokens: 1000 BPE
# losses: CTC+ NLL
# Training: mini-librispeech
# Pre-Training: librispeech 960h
# Authors:  Ju-Chieh Chou, Mirco Ravanelli, Abdel Heba, Peter Plantinga, Samuele Cornell 2020
# # ############################################################################

# Seed needs to be set at top of yaml, before objects with parameters are instantiated
seed: 2602
__set_seed: !apply:torch.manual_seed [2602]

# If you plan to train a system on an HPC cluster with a big dataset,
# we strongly suggest doing the following:
# 1- Compress the dataset in a single tar or zip file.
# 2- Copy your dataset locally (i.e., the local disk of the computing node).
# 3- Uncompress the dataset in the local folder.
# 4- Set data_folder with the local path
# Reading data from the local disk of the compute node (e.g. $SLURM_TMPDIR with SLURM-based clusters) is very important.
# It allows you to read the data much faster without slowing down the shared filesystem.

data_folder: ../data # In this case, data will be automatically downloaded here.
data_folder_rirs: ../data            # noise/ris dataset will automatically be downloaded here
output_folder: results/CRDNN_BPE_960h_LM/2602
wer_file: results/CRDNN_BPE_960h_LM/2602/wer.txt
save_folder: results/CRDNN_BPE_960h_LM/2602/save
train_log: results/CRDNN_BPE_960h_LM/2602/train_log.txt

# Language model (LM) pretraining
# NB: To avoid mismatch, the speech recognizer must be trained with the same
# tokenizer used for LM training. Here, we download everything from the
# speechbrain HuggingFace repository. However, a local path pointing to a
# directory containing the lm.ckpt and tokenizer.ckpt may also be specified
# instead. E.g if you want to use your own LM / tokenizer.
pretrained_path: speechbrain/asr-crdnn-rnnlm-librispeech


# Path where data manifest files will be stored. The data manifest files are created by the
# data preparation script
train_annotation: ../train.json
valid_annotation: ../valid.json
test_annotation: ../test.json

# The train logger writes training statistics to a file, as well as stdout.
train_logger: !new:speechbrain.utils.train_logger.FileTrainLogger
  save_file: results/CRDNN_BPE_960h_LM/2602/train_log.txt

# Training parameters
number_of_epochs: 15
number_of_ctc_epochs: 5
batch_size: 2
lr: 1.0
ctc_weight: 0.5
sorting: ascending
ckpt_interval_minutes: 15 # save checkpoint every N min
label_smoothing: 0.1

# Dataloader options
train_dataloader_opts:
  batch_size: 2

valid_dataloader_opts:
  batch_size: 2

test_dataloader_opts:
  batch_size: 2


# Feature parameters
sample_rate: 16000
n_fft: 400
n_mels: 40

# Model parameters
activation: &id001 !name:torch.nn.LeakyReLU
dropout: 0.15
cnn_blocks: 2
cnn_channels: (128, 256)
inter_layer_pooling_size: (2, 2)
cnn_kernelsize: (3, 3)
time_pooling_size: 4
rnn_class: &id002 !name:speechbrain.nnet.RNN.LSTM
rnn_layers: 4
rnn_neurons: 1024
rnn_bidirectional: true
dnn_blocks: 2
dnn_neurons: 512
emb_size: 128
dec_neurons: 1024
output_neurons: 1000  # Number of tokens (same as LM)
blank_index: 0
bos_index: 0
eos_index: 0
unk_index: 0

# Decoding parameters
min_decode_ratio: 0.0
max_decode_ratio: 1.0
valid_beam_size: 8
test_beam_size: 80
eos_threshold: 1.5
using_max_attn_shift: true
max_attn_shift: 240
lm_weight: 0.50
ctc_weight_decode: 0.0
coverage_penalty: 1.5
temperature: 1.25
temperature_lm: 1.25

# The first object passed to the Brain class is this "Epoch Counter"
# which is saved by the Checkpointer so that training can be resumed
# if it gets interrupted at any point.
epoch_counter: &id013 !new:speechbrain.utils.epoch_loop.EpochCounter

# This object is used to pretrain the language model and the tokenizers
# (defined above). In this case, we also pretrain the ASR model (to make
# sure the model converges on a small amount of data)
  limit: 15

# Feature extraction
compute_features: !new:speechbrain.lobes.features.Fbank
  sample_rate: 16000
  n_fft: 400
  n_mels: 40

# Feature normalization (mean and std)
normalize: &id008 !new:speechbrain.processing.features.InputNormalization
  norm_type: global

# Added noise and reverb come from OpenRIR dataset, automatically
# downloaded and prepared with this Environmental Corruption class.
env_corrupt: &id009 !new:speechbrain.lobes.augment.EnvCorrupt
  openrir_folder: ../data
  babble_prob: 0.0
  reverb_prob: 0.0
  noise_prob: 1.0
  noise_snr_low: 0
  noise_snr_high: 15

# Adds speech change + time and frequency dropouts (time-domain implementation).
augmentation: !new:speechbrain.lobes.augment.TimeDomainSpecAugment
  sample_rate: 16000
  speeds: [95, 100, 105]

# The CRDNN model is an encoder that combines CNNs, RNNs, and DNNs.
encoder: &id003 !new:speechbrain.lobes.models.CRDNN.CRDNN
  input_shape: [null, null, 40]
  activation: *id001
  dropout: 0.15
  cnn_blocks: 2
  cnn_channels: (128, 256)
  cnn_kernelsize: (3, 3)
  inter_layer_pooling_size: (2, 2)
  time_pooling: true
  using_2d_pooling: false
  time_pooling_size: 4
  rnn_class: *id002
  rnn_layers: 4
  rnn_neurons: 1024
  rnn_bidirectional: true
  rnn_re_init: true
  dnn_blocks: 2
  dnn_neurons: 512
  use_rnnp: false

# Embedding (from indexes to an embedding space of dimension emb_size).
embedding: &id004 !new:speechbrain.nnet.embedding.Embedding
  num_embeddings: 1000
  embedding_dim: 128

# Attention-based RNN decoder.
decoder: &id005 !new:speechbrain.nnet.RNN.AttentionalRNNDecoder
  enc_dim: 512
  input_size: 128
  rnn_type: gru
  attn_type: location
  hidden_size: 1024
  attn_dim: 1024
  num_layers: 1
  scaling: 1.0
  channels: 10
  kernel_size: 100
  re_init: true
  dropout: 0.15

# Linear transformation on the top of the encoder.
ctc_lin: &id006 !new:speechbrain.nnet.linear.Linear
  input_size: 512
  n_neurons: 1000

# Linear transformation on the top of the decoder.
seq_lin: &id007 !new:speechbrain.nnet.linear.Linear

# This is the RNNLM that is used according to the Huggingface repository
# NB: It has to match the pre-trained RNNLM!!
  input_size: 1024
  n_neurons: 1000

# Final softmax (for log posteriors computation).
log_softmax: !new:speechbrain.nnet.activations.Softmax
  apply_log: true

# Cost definition for the CTC part.
ctc_cost: !name:speechbrain.nnet.losses.ctc_loss
  blank_index: 0


# Tokenizer initialization
tokenizer: &id014 !new:sentencepiece.SentencePieceProcessor

# Objects in "modules" dict will have their parameters moved to the correct
# device, as well as having train()/eval() called on them by the Brain class
modules:
  encoder: *id003
  embedding: *id004
  decoder: *id005
  ctc_lin: *id006
  seq_lin: *id007
  normalize: *id008
  env_corrupt: *id009
  lm_model: &id010 !new:speechbrain.lobes.models.RNNLM.RNNLM

# Gathering all the submodels in a single model object.
    output_neurons: 1000
    embedding_dim: 128
    activation: !name:torch.nn.LeakyReLU
    dropout: 0.0
    rnn_layers: 2
    rnn_neurons: 2048
    dnn_blocks: 1
    dnn_neurons: 512
    return_hidden: true  # For inference

# Beamsearch is applied on the top of the decoder. If the language model is
# given, a language model is applied (with a weight specified in lm_weight).
# If ctc_weight is set, the decoder uses CTC + attention beamsearch. This
# improves the performance, but slows down decoding. For a description of
# the other parameters, please see the speechbrain.decoders.S2SRNNBeamSearchLM.

# It makes sense to have a lighter search during validation. In this case,
# we don't use the LM and CTC probabilities during decoding.
model: &id011 !new:torch.nn.ModuleList
- - *id003
  - *id004
  - *id005
  - *id006
  - *id007
lm_model: *id010
valid_search: !new:speechbrain.decoders.S2SRNNBeamSearcher
  embedding: *id004
  decoder: *id005
  linear: *id007
  ctc_linear: *id006
  bos_index: 0
  eos_index: 0
  blank_index: 0
  min_decode_ratio: 0.0
  max_decode_ratio: 1.0
  beam_size: 8
  eos_threshold: 1.5
  using_max_attn_shift: true
  max_attn_shift: 240
  coverage_penalty: 1.5
  temperature: 1.25

# The final decoding on the test set can be more computationally demanding.
# In this case, we use the LM + CTC probabilities during decoding as well.
# Please, remove this part if you need a faster decoder.
test_search: !new:speechbrain.decoders.S2SRNNBeamSearchLM
  embedding: *id004
  decoder: *id005
  linear: *id007
  ctc_linear: *id006
  language_model: *id010
  bos_index: 0
  eos_index: 0
  blank_index: 0
  min_decode_ratio: 0.0
  max_decode_ratio: 1.0
  beam_size: 80
  eos_threshold: 1.5
  using_max_attn_shift: true
  max_attn_shift: 240
  coverage_penalty: 1.5
  lm_weight: 0.50
  ctc_weight: 0.0
  temperature: 1.25
  temperature_lm: 1.25

# This function manages learning rate annealing over the epochs.
# We here use the NewBoB algorithm, that anneals the learning rate if
# the improvements over two consecutive epochs is less than the defined
# threshold.
lr_annealing: &id012 !new:speechbrain.nnet.schedulers.NewBobScheduler
  initial_value: 1.0
  improvement_threshold: 0.0025
  annealing_factor: 0.8
  patient: 0

# This optimizer will be constructed by the Brain class after all parameters
# are moved to the correct device. Then it will be added to the checkpointer.
opt_class: !name:torch.optim.Adadelta
  lr: 1.0
  rho: 0.95
  eps: 1.e-8

# Functions that compute the statistics to track during the validation step.
error_rate_computer: !name:speechbrain.utils.metric_stats.ErrorRateStats

cer_computer: !name:speechbrain.utils.metric_stats.ErrorRateStats
  split_tokens: true

# This object is used for saving the state of training both so that it
# can be resumed if it gets interrupted, and also so that the best checkpoint
# can be later loaded for evaluation or inference.
checkpointer: !new:speechbrain.utils.checkpoints.Checkpointer
  checkpoints_dir: results/CRDNN_BPE_960h_LM/2602/save
  recoverables:
    model: *id011
    scheduler: *id012
    normalizer: *id008
    counter: *id013
pretrainer: !new:speechbrain.utils.parameter_transfer.Pretrainer
  collect_in: results/CRDNN_BPE_960h_LM/2602/save
  loadables:
    lm: *id010
    tokenizer: *id014
    model: *id011
  paths:
    lm: ./results/CRDNN_BPE_960h_LM/2602/save/lm.ckpt
    tokenizer: ./results/CRDNN_BPE_960h_LM/2602/save/tokenizer.ckpt
    model: ./results/CRDNN_BPE_960h_LM/2602/save/model.ckpt
  1. in ASR/ I placed 19-198-0000.flac

  2. after that I ran:
    python3 training.py
    and got this error:

Can you help me please @titouan.parcollet ,I do not understand what am I doing wrong?

@Satwik Do you maybe know how to use pretrained model from 4th step?

I am facing the same issue, if you (@Grga) found any solution then please share!!

@titouan.parcollet can you please help with this??

Sorry for the late reply, the yaml is wrong. The key here is the yaml. Training and deployment YAML are different. In the tutorial it is clearly explained, the encoder defined in the yaml:

encoder: !new:speechbrain.nnet.containers.LengthsCapableSequential
    input_shape: [null, null, !ref <n_mels>]
    compute_features: !ref <compute_features>
    normalize: !ref <normalize>
    model: !ref <enc>

You can see that the encoder is the full pipeline of encoding not just the CRDNN. It’s the features / normalise / CRDNN. This enable users to have very different encoder pipelines while being able to still use this interface. Have a look at the HuggingFace .yaml we have for our ASR models to get a better idea. (But the tutorial should be sufficient :()

So what I should do here, is to replace the encoder, decoder and modules field in the hyperparameters.yaml with the ones in the tutorial right?

If you have the same setup yes. Otherwise, you should describe yours. The yaml is what describes your experiment.

So after I modify the encoder field like this:

I get this error instead:

/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
    887             result = self._slow_forward(*input, **kwargs)
    888         else:
--> 889             result = self.forward(*input, **kwargs)
    890         for hook in itertools.chain(
    891                 _global_forward_hooks.values(),

TypeError: forward() missing 1 required positional argument: 'wav_len'

I’m confused :slight_smile:

this is the inference yaml of that