Missing ARPA file for LM fine-tuning (NeMo / KenLM workflow)

#6
by kimthangg - opened

Hi, I’m trying to follow the NVIDIA NeMo / Riva tutorial for n-gram LM training and fine-tuning:
https://docs.nvidia.com/deeplearning/riva/user-guide/docs/tutorials/asr-python-advanced-nemo-ngram-training-and-finetuning.html

From the tutorial, the workflow for LM “fine-tuning” requires:

  1. Generating an intermediate ARPA file
  2. Then using ngram_merge.py to interpolate two ARPA LMs (base + domain)

However, for this model (nvidia/parakeet-ctc-0.6b-Vietnamese), I only see a KenLM .bin file and lexicon, but no .arpa file provided.

My questions:

  • Is the original ARPA LM for this model available somewhere?
  • If not, what is the recommended way to “fine-tune” or adapt the provided LM?
  • Should we retrain a new LM from text and use it directly?
  • Or is there a way to recover ARPA from the provided .bin?

It seems that without the original ARPA, the official NeMo workflow for LM interpolation cannot be applied directly.

Any guidance would be appreciated , Thanks!

Sign up or log in to comment