nvidia
/

parakeet-ctc-0.6b-Vietnamese

Automatic Speech Recognition

Model card Files Files and versions

Missing ARPA file for LM fine-tuning (NeMo / KenLM workflow)

#6

by kimthangg - opened Mar 30

Mar 30

Hi, I’m trying to follow the NVIDIA NeMo / Riva tutorial for n-gram LM training and fine-tuning:
https://docs.nvidia.com/deeplearning/riva/user-guide/docs/tutorials/asr-python-advanced-nemo-ngram-training-and-finetuning.html

From the tutorial, the workflow for LM “fine-tuning” requires:

Generating an intermediate ARPA file
Then using ngram_merge.py to interpolate two ARPA LMs (base + domain)

However, for this model (nvidia/parakeet-ctc-0.6b-Vietnamese), I only see a KenLM .bin file and lexicon, but no .arpa file provided.

My questions:

Is the original ARPA LM for this model available somewhere?
If not, what is the recommended way to “fine-tune” or adapt the provided LM?
Should we retrain a new LM from text and use it directly?
Or is there a way to recover ARPA from the provided .bin?

It seems that without the original ARPA, the official NeMo workflow for LM interpolation cannot be applied directly.

Any guidance would be appreciated , Thanks!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment