Export ConvEmformer transducer models to ncnn

We use the pre-trained model from the following repository as an example:

We will show you step by step how to export it to ncnn and run it with sherpa-ncnn.

Hint

We use Ubuntu 18.04, torch 1.13, and Python 3.8 for testing.

Caution

Please use a more recent version of PyTorch. For instance, torch 1.8 may not work.

1. Download the pre-trained model

Hint

You can also refer to https://k2-fsa.github.io/sherpa/cpp/pretrained_models/online_transducer.html#icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05 to download the pre-trained model.

You have to install git-lfs before you continue.

cd egs/librispeech/ASR

GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/Zengwei/icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05
cd icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05

git lfs pull --include "exp/pretrained-epoch-30-avg-10-averaged.pt"
git lfs pull --include "data/lang_bpe_500/bpe.model"

cd ..

Note

We downloaded exp/pretrained-xxx.pt, not exp/cpu-jit_xxx.pt.

In the above code, we downloaded the pre-trained model into the directory egs/librispeech/ASR/icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05.

2. Install ncnn and pnnx

# We put ncnn into $HOME/open-source/ncnn
# You can change it to anywhere you like

cd $HOME
mkdir -p open-source
cd open-source

git clone https://github.com/csukuangfj/ncnn
cd ncnn
git submodule update --recursive --init

# Note: We don't use "python setup.py install" or "pip install ." here

mkdir -p build-wheel
cd build-wheel

cmake \
  -DCMAKE_BUILD_TYPE=Release \
  -DNCNN_PYTHON=ON \
  -DNCNN_BUILD_BENCHMARK=OFF \
  -DNCNN_BUILD_EXAMPLES=OFF \
  -DNCNN_BUILD_TOOLS=ON \
..

make -j4

cd ..

# Note: $PWD here is $HOME/open-source/ncnn

export PYTHONPATH=$PWD/python:$PYTHONPATH
export PATH=$PWD/tools/pnnx/build/src:$PATH
export PATH=$PWD/build-wheel/tools/quantize:$PATH

# Now build pnnx
cd tools/pnnx
mkdir build
cd build
cmake ..
make -j4

./src/pnnx

Congratulations! You have successfully installed the following components:

  • pnnx, which is an executable located in $HOME/open-source/ncnn/tools/pnnx/build/src. We will use it to convert models exported by torch.jit.trace().

  • ncnn2int8, which is an executable located in $HOME/open-source/ncnn/build-wheel/tools/quantize. We will use it to quantize our models to int8.

  • ncnn.cpython-38-x86_64-linux-gnu.so, which is a Python module located in $HOME/open-source/ncnn/python/ncnn.

    Note

    I am using Python 3.8, so it is ncnn.cpython-38-x86_64-linux-gnu.so. If you use a different version, say, Python 3.9, the name would be ncnn.cpython-39-x86_64-linux-gnu.so.

    Also, if you are not using Linux, the file name would also be different. But that does not matter. As long as you can compile it, it should work.

We have set up PYTHONPATH so that you can use import ncnn in your Python code. We have also set up PATH so that you can use pnnx and ncnn2int8 later in your terminal.

Caution

Please don’t use https://github.com/tencent/ncnn. We have made some modifications to the official ncnn.

We will synchronize https://github.com/csukuangfj/ncnn periodically with the official one.

3. Export the model via torch.jit.trace()

First, let us rename our pre-trained model:

cd egs/librispeech/ASR

cd icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp

ln -s pretrained-epoch-30-avg-10-averaged.pt epoch-30.pt

cd ../..

Next, we use the following code to export our model:

dir=./icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/

./conv_emformer_transducer_stateless2/export-for-ncnn.py \
  --exp-dir $dir/exp \
  --tokens $dir/data/lang_bpe_500/tokens.txt \
  --epoch 30 \
  --avg 1 \
  --use-averaged-model 0 \
  --num-encoder-layers 12 \
  --chunk-length 32 \
  --cnn-module-kernel 31 \
  --left-context-length 32 \
  --right-context-length 8 \
  --memory-size 32 \
  --encoder-dim 512

Caution

If your model has different configuration parameters, please change them accordingly.

Hint

We have renamed our model to epoch-30.pt so that we can use --epoch 30. There is only one pre-trained model, so we use --avg 1 --use-averaged-model 0.

If you have trained a model by yourself and if you have all checkpoints available, please first use decode.py to tune --epoch --avg and select the best combination with with --use-averaged-model 1.

Note

You will see the following log output:

2023-01-11 12:15:38,677 INFO [export-for-ncnn.py:220] device: cpu
2023-01-11 12:15:38,681 INFO [export-for-ncnn.py:229] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_v
alid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampl
ing_factor': 4, 'decoder_dim': 512, 'joiner_dim': 512, 'model_warm_step': 3000, 'env_info': {'k2-version': '1.23.2', 'k2-build-type':
'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'a34171ed85605b0926eebbd0463d059431f4f74a', 'k2-git-date': 'Wed Dec 14 00:06:38 2022',
 'lhotse-version': '1.12.0.dev+missing.version.file', 'torch-version': '1.10.0+cu102', 'torch-cuda-available': False, 'torch-cuda-vers
ion': '10.2', 'python-version': '3.8', 'icefall-git-branch': 'fix-stateless3-train-2022-12-27', 'icefall-git-sha1': '530e8a1-dirty', '
icefall-git-date': 'Tue Dec 27 13:59:18 2022', 'icefall-path': '/star-fj/fangjun/open-source/icefall', 'k2-path': '/star-fj/fangjun/op
en-source/k2/k2/python/k2/__init__.py', 'lhotse-path': '/star-fj/fangjun/open-source/lhotse/lhotse/__init__.py', 'hostname': 'de-74279
-k2-train-3-1220120619-7695ff496b-s9n4w', 'IP address': '127.0.0.1'}, 'epoch': 30, 'iter': 0, 'avg': 1, 'exp_dir': PosixPath('icefa
ll-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp'), 'bpe_model': './icefall-asr-librispeech-conv-emformer-transdu
cer-stateless2-2022-07-05//data/lang_bpe_500/bpe.model', 'jit': False, 'context_size': 2, 'use_averaged_model': False, 'encoder_dim':
512, 'nhead': 8, 'dim_feedforward': 2048, 'num_encoder_layers': 12, 'cnn_module_kernel': 31, 'left_context_length': 32, 'chunk_length'
: 32, 'right_context_length': 8, 'memory_size': 32, 'blank_id': 0, 'vocab_size': 500}
2023-01-11 12:15:38,681 INFO [export-for-ncnn.py:231] About to create model
2023-01-11 12:15:40,053 INFO [checkpoint.py:112] Loading checkpoint from icefall-asr-librispeech-conv-emformer-transducer-stateless2-2
022-07-05/exp/epoch-30.pt
2023-01-11 12:15:40,708 INFO [export-for-ncnn.py:315] Number of model parameters: 75490012
2023-01-11 12:15:41,681 INFO [export-for-ncnn.py:318] Using torch.jit.trace()
2023-01-11 12:15:41,681 INFO [export-for-ncnn.py:320] Exporting encoder
2023-01-11 12:15:41,682 INFO [export-for-ncnn.py:149] chunk_length: 32, right_context_length: 8

The log shows the model has 75490012 parameters, i.e., ~75 M.

ls -lh icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp/pretrained-epoch-30-avg-10-averaged.pt

-rw-r--r-- 1 kuangfangjun root 289M Jan 11 12:05 icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp/pretrained-epoch-30-avg-10-averaged.pt

You can see that the file size of the pre-trained model is 289 MB, which is roughly equal to 75490012*4/1024/1024 = 287.97 MB.

After running conv_emformer_transducer_stateless2/export-for-ncnn.py, we will get the following files:

ls -lh icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp/*pnnx*

-rw-r--r-- 1 kuangfangjun root 1010K Jan 11 12:15 icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp/decoder_jit_trace-pnnx.pt
-rw-r--r-- 1 kuangfangjun root  283M Jan 11 12:15 icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp/encoder_jit_trace-pnnx.pt
-rw-r--r-- 1 kuangfangjun root  3.0M Jan 11 12:15 icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp/joiner_jit_trace-pnnx.pt

4. Export torchscript model via pnnx

Hint

Make sure you have set up the PATH environment variable. Otherwise, it will throw an error saying that pnnx could not be found.

Now, it’s time to export our models to ncnn via pnnx.

cd icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp/

pnnx ./encoder_jit_trace-pnnx.pt
pnnx ./decoder_jit_trace-pnnx.pt
pnnx ./joiner_jit_trace-pnnx.pt

It will generate the following files:

ls -lh  icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp/*ncnn*{bin,param}

-rw-r--r-- 1 kuangfangjun root 503K Jan 11 12:38 icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp/decoder_jit_trace-pnnx.ncnn.bin
-rw-r--r-- 1 kuangfangjun root  437 Jan 11 12:38 icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp/decoder_jit_trace-pnnx.ncnn.param
-rw-r--r-- 1 kuangfangjun root 142M Jan 11 12:36 icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp/encoder_jit_trace-pnnx.ncnn.bin
-rw-r--r-- 1 kuangfangjun root  79K Jan 11 12:36 icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp/encoder_jit_trace-pnnx.ncnn.param
-rw-r--r-- 1 kuangfangjun root 1.5M Jan 11 12:38 icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp/joiner_jit_trace-pnnx.ncnn.bin
-rw-r--r-- 1 kuangfangjun root  488 Jan 11 12:38 icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp/joiner_jit_trace-pnnx.ncnn.param

There are two types of files:

  • param: It is a text file containing the model architectures. You can use a text editor to view its content.

  • bin: It is a binary file containing the model parameters.

We compare the file sizes of the models below before and after converting via pnnx:

File name

File size

encoder_jit_trace-pnnx.pt

283 MB

decoder_jit_trace-pnnx.pt

1010 KB

joiner_jit_trace-pnnx.pt

3.0 MB

encoder_jit_trace-pnnx.ncnn.bin

142 MB

decoder_jit_trace-pnnx.ncnn.bin

503 KB

joiner_jit_trace-pnnx.ncnn.bin

1.5 MB

You can see that the file sizes of the models after conversion are about one half of the models before conversion:

  • encoder: 283 MB vs 142 MB

  • decoder: 1010 KB vs 503 KB

  • joiner: 3.0 MB vs 1.5 MB

The reason is that by default pnnx converts float32 parameters to float16. A float32 parameter occupies 4 bytes, while it is 2 bytes for float16. Thus, it is twice smaller after conversion.

Hint

If you use pnnx ./encoder_jit_trace-pnnx.pt fp16=0, then pnnx won’t convert float32 to float16.

5. Test the exported models in icefall

Note

We assume you have set up the environment variable PYTHONPATH when building ncnn.

Now we have successfully converted our pre-trained model to ncnn format. The generated 6 files are what we need. You can use the following code to test the converted models:

./conv_emformer_transducer_stateless2/streaming-ncnn-decode.py \
  --tokens ./icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/data/lang_bpe_500/tokens.txt \
  --encoder-param-filename ./icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp/encoder_jit_trace-pnnx.ncnn.param \
  --encoder-bin-filename ./icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp/encoder_jit_trace-pnnx.ncnn.bin \
  --decoder-param-filename ./icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp/decoder_jit_trace-pnnx.ncnn.param \
  --decoder-bin-filename ./icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp/decoder_jit_trace-pnnx.ncnn.bin \
  --joiner-param-filename ./icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp/joiner_jit_trace-pnnx.ncnn.param \
  --joiner-bin-filename ./icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp/joiner_jit_trace-pnnx.ncnn.bin \
  ./icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/test_wavs/1089-134686-0001.wav

Hint

ncnn supports only batch size == 1, so streaming-ncnn-decode.py accepts only 1 wave file as input.

The output is given below:

2023-01-11 14:02:12,216 INFO [streaming-ncnn-decode.py:320] {'tokens': './icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/data/lang_bpe_500/tokens.txt', 'encoder_param_filename': './icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp/encoder_jit_trace-pnnx.ncnn.param', 'encoder_bin_filename': './icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp/encoder_jit_trace-pnnx.ncnn.bin', 'decoder_param_filename': './icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp/decoder_jit_trace-pnnx.ncnn.param', 'decoder_bin_filename': './icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp/decoder_jit_trace-pnnx.ncnn.bin', 'joiner_param_filename': './icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp/joiner_jit_trace-pnnx.ncnn.param', 'joiner_bin_filename': './icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp/joiner_jit_trace-pnnx.ncnn.bin', 'sound_filename': './icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/test_wavs/1089-134686-0001.wav'}
T 51 32
2023-01-11 14:02:13,141 INFO [streaming-ncnn-decode.py:328] Constructing Fbank computer
2023-01-11 14:02:13,151 INFO [streaming-ncnn-decode.py:331] Reading sound files: ./icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/test_wavs/1089-134686-0001.wav
2023-01-11 14:02:13,176 INFO [streaming-ncnn-decode.py:336] torch.Size([106000])
2023-01-11 14:02:17,581 INFO [streaming-ncnn-decode.py:380] ./icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/test_wavs/1089-134686-0001.wav
2023-01-11 14:02:17,581 INFO [streaming-ncnn-decode.py:381] AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE SQUALID QUARTER OF THE BROTHELS

Congratulations! You have successfully exported a model from PyTorch to ncnn!

6. Modify the exported encoder for sherpa-ncnn

In order to use the exported models in sherpa-ncnn, we have to modify encoder_jit_trace-pnnx.ncnn.param.

Let us have a look at the first few lines of encoder_jit_trace-pnnx.ncnn.param:

7767517
1060 1342
Input                    in0                      0 1 in0

Explanation of the above three lines:

  1. 7767517, it is a magic number and should not be changed.

  2. 1060 1342, the first number 1060 specifies the number of layers in this file, while 1342 specifies the number of intermediate outputs of this file

  3. Input in0 0 1 in0, Input is the layer type of this layer; in0 is the layer name of this layer; 0 means this layer has no input; 1 means this layer has one output; in0 is the output name of this layer.

We need to add 1 extra line and also increment the number of layers. The result looks like below:

7767517
1061 1342
SherpaMetaData           sherpa_meta_data1        0 0 0=1 1=12 2=32 3=31 4=8 5=32 6=8 7=512
Input                    in0                      0 1 in0

Explanation

  1. 7767517, it is still the same

  2. 1061 1342, we have added an extra layer, so we need to update 1060 to 1061. We don’t need to change 1342 since the newly added layer has no inputs or outputs.

  3. SherpaMetaData  sherpa_meta_data1  0 0 0=1 1=12 2=32 3=31 4=8 5=32 6=8 7=512 This line is newly added. Its explanation is given below:

    • SherpaMetaData is the type of this layer. Must be SherpaMetaData.

    • sherpa_meta_data1 is the name of this layer. Must be sherpa_meta_data1.

    • 0 0 means this layer has no inputs or output. Must be 0 0

    • 0=1, 0 is the key and 1 is the value. MUST be 0=1

    • 1=12, 1 is the key and 12 is the value of the parameter --num-encoder-layers that you provided when running conv_emformer_transducer_stateless2/export-for-ncnn.py.

    • 2=32, 2 is the key and 32 is the value of the parameter --memory-size that you provided when running conv_emformer_transducer_stateless2/export-for-ncnn.py.

    • 3=31, 3 is the key and 31 is the value of the parameter --cnn-module-kernel that you provided when running conv_emformer_transducer_stateless2/export-for-ncnn.py.

    • 4=8, 4 is the key and 8 is the value of the parameter --left-context-length that you provided when running conv_emformer_transducer_stateless2/export-for-ncnn.py.

    • 5=32, 5 is the key and 32 is the value of the parameter --chunk-length that you provided when running conv_emformer_transducer_stateless2/export-for-ncnn.py.

    • 6=8, 6 is the key and 8 is the value of the parameter --right-context-length that you provided when running conv_emformer_transducer_stateless2/export-for-ncnn.py.

    • 7=512, 7 is the key and 512 is the value of the parameter --encoder-dim that you provided when running conv_emformer_transducer_stateless2/export-for-ncnn.py.

    For ease of reference, we list the key-value pairs that you need to add in the following table. If your model has a different setting, please change the values for SherpaMetaData accordingly. Otherwise, you will be SAD.

    key

    value

    0

    1 (fixed)

    1

    --num-encoder-layers

    2

    --memory-size

    3

    --cnn-module-kernel

    4

    --left-context-length

    5

    --chunk-length

    6

    --right-context-length

    7

    --encoder-dim

  4. Input in0 0 1 in0. No need to change it.

Caution

When you add a new layer SherpaMetaData, please remember to update the number of layers. In our case, update 1060 to 1061. Otherwise, you will be SAD later.

Hint

After adding the new layer SherpaMetaData, you cannot use this model with streaming-ncnn-decode.py anymore since SherpaMetaData is supported only in sherpa-ncnn.

Hint

ncnn is very flexible. You can add new layers to it just by text-editing the param file! You don’t need to change the bin file.

Now you can use this model in sherpa-ncnn. Please refer to the following documentation:

We have a list of pre-trained models that have been exported for sherpa-ncnn:

7. (Optional) int8 quantization with sherpa-ncnn

This step is optional.

In this step, we describe how to quantize our model with int8.

Change 4. Export torchscript model via pnnx to disable fp16 when using pnnx:

cd icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp/

pnnx ./encoder_jit_trace-pnnx.pt fp16=0
pnnx ./decoder_jit_trace-pnnx.pt
pnnx ./joiner_jit_trace-pnnx.pt fp16=0

Note

We add fp16=0 when exporting the encoder and joiner. ncnn does not support quantizing the decoder model yet. We will update this documentation once ncnn supports it. (Maybe in this year, 2023).

It will generate the following files

ls -lh icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp/*_jit_trace-pnnx.ncnn.{param,bin}

-rw-r--r-- 1 kuangfangjun root 503K Jan 11 15:56 icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp/decoder_jit_trace-pnnx.ncnn.bin
-rw-r--r-- 1 kuangfangjun root  437 Jan 11 15:56 icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp/decoder_jit_trace-pnnx.ncnn.param
-rw-r--r-- 1 kuangfangjun root 283M Jan 11 15:56 icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp/encoder_jit_trace-pnnx.ncnn.bin
-rw-r--r-- 1 kuangfangjun root  79K Jan 11 15:56 icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp/encoder_jit_trace-pnnx.ncnn.param
-rw-r--r-- 1 kuangfangjun root 3.0M Jan 11 15:56 icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp/joiner_jit_trace-pnnx.ncnn.bin
-rw-r--r-- 1 kuangfangjun root  488 Jan 11 15:56 icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp/joiner_jit_trace-pnnx.ncnn.param

Let us compare again the file sizes:

File name

File size

encoder_jit_trace-pnnx.pt

283 MB

decoder_jit_trace-pnnx.pt

1010 KB

joiner_jit_trace-pnnx.pt

3.0 MB

encoder_jit_trace-pnnx.ncnn.bin (fp16)

142 MB

decoder_jit_trace-pnnx.ncnn.bin (fp16)

503 KB

joiner_jit_trace-pnnx.ncnn.bin (fp16)

1.5 MB

encoder_jit_trace-pnnx.ncnn.bin (fp32)

283 MB

joiner_jit_trace-pnnx.ncnn.bin (fp32)

3.0 MB

You can see that the file sizes are doubled when we disable fp16.

Note

You can again use streaming-ncnn-decode.py to test the exported models.

Next, follow 6. Modify the exported encoder for sherpa-ncnn to modify encoder_jit_trace-pnnx.ncnn.param.

Change

7767517
1060 1342
Input                    in0                      0 1 in0

to

7767517
1061 1342
SherpaMetaData           sherpa_meta_data1        0 0 0=1 1=12 2=32 3=31 4=8 5=32 6=8 7=512
Input                    in0                      0 1 in0

Caution

Please follow 6. Modify the exported encoder for sherpa-ncnn to change the values for SherpaMetaData if your model uses a different setting.

Next, let us compile sherpa-ncnn since we will quantize our models within sherpa-ncnn.

# We will download sherpa-ncnn to $HOME/open-source/
# You can change it to anywhere you like.
cd $HOME
mkdir -p open-source

cd open-source
git clone https://github.com/k2-fsa/sherpa-ncnn
cd sherpa-ncnn
mkdir build
cd build
cmake ..
make -j 4

./bin/generate-int8-scale-table

export PATH=$HOME/open-source/sherpa-ncnn/build/bin:$PATH

The output of the above commands are:

(py38) kuangfangjun:build$ generate-int8-scale-table
Please provide 10 arg. Currently given: 1
Usage:
generate-int8-scale-table encoder.param encoder.bin decoder.param decoder.bin joiner.param joiner.bin encoder-scale-table.txt joiner-scale-table.txt wave_filenames.txt

Each line in wave_filenames.txt is a path to some 16k Hz mono wave file.

We need to create a file wave_filenames.txt, in which we need to put some calibration wave files. For testing purpose, we put the test_wavs from the pre-trained model repository https://huggingface.co/Zengwei/icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05

cd egs/librispeech/ASR
cd icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp/

cat <<EOF > wave_filenames.txt
../test_wavs/1089-134686-0001.wav
../test_wavs/1221-135766-0001.wav
../test_wavs/1221-135766-0002.wav
EOF

Now we can calculate the scales needed for quantization with the calibration data:

cd egs/librispeech/ASR
cd icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp/

generate-int8-scale-table \
  ./encoder_jit_trace-pnnx.ncnn.param \
  ./encoder_jit_trace-pnnx.ncnn.bin \
  ./decoder_jit_trace-pnnx.ncnn.param \
  ./decoder_jit_trace-pnnx.ncnn.bin \
  ./joiner_jit_trace-pnnx.ncnn.param \
  ./joiner_jit_trace-pnnx.ncnn.bin \
  ./encoder-scale-table.txt \
  ./joiner-scale-table.txt \
  ./wave_filenames.txt

The output logs are in the following:

Don't Use GPU. has_gpu: 0, config.use_vulkan_compute: 1
num encoder conv layers: 88
num joiner conv layers: 3
num files: 3
Processing ../test_wavs/1089-134686-0001.wav
Processing ../test_wavs/1221-135766-0001.wav
Processing ../test_wavs/1221-135766-0002.wav
Processing ../test_wavs/1089-134686-0001.wav
Processing ../test_wavs/1221-135766-0001.wav
Processing ../test_wavs/1221-135766-0002.wav
----------encoder----------
conv_87                                  : max = 15.942385        threshold = 15.938493        scale = 7.968131
conv_88                                  : max = 35.442448        threshold = 15.549335        scale = 8.167552
conv_89                                  : max = 23.228289        threshold = 8.001738         scale = 15.871552
linear_90                                : max = 3.976146         threshold = 1.101789         scale = 115.267128
linear_91                                : max = 6.962030         threshold = 5.162033         scale = 24.602713
linear_92                                : max = 12.323041        threshold = 3.853959         scale = 32.953129
linear_94                                : max = 6.905416         threshold = 4.648006         scale = 27.323545
linear_93                                : max = 6.905416         threshold = 5.474093         scale = 23.200188
linear_95                                : max = 1.888012         threshold = 1.403563         scale = 90.483986
linear_96                                : max = 6.856741         threshold = 5.398679         scale = 23.524273
linear_97                                : max = 9.635942         threshold = 2.613655         scale = 48.590950
linear_98                                : max = 6.460340         threshold = 5.670146         scale = 22.398010
linear_99                                : max = 9.532276         threshold = 2.585537         scale = 49.119396
linear_101                               : max = 6.585871         threshold = 5.719224         scale = 22.205809
linear_100                               : max = 6.585871         threshold = 5.751382         scale = 22.081648
linear_102                               : max = 1.593344         threshold = 1.450581         scale = 87.551147
linear_103                               : max = 6.592681         threshold = 5.705824         scale = 22.257959
linear_104                               : max = 8.752957         threshold = 1.980955         scale = 64.110489
linear_105                               : max = 6.696240         threshold = 5.877193         scale = 21.608953
linear_106                               : max = 9.059659         threshold = 2.643138         scale = 48.048950
linear_108                               : max = 6.975461         threshold = 4.589567         scale = 27.671457
linear_107                               : max = 6.975461         threshold = 6.190381         scale = 20.515701
linear_109                               : max = 3.710759         threshold = 2.305635         scale = 55.082436
linear_110                               : max = 7.531228         threshold = 5.731162         scale = 22.159557
linear_111                               : max = 10.528083        threshold = 2.259322         scale = 56.211544
linear_112                               : max = 8.148807         threshold = 5.500842         scale = 23.087374
linear_113                               : max = 8.592566         threshold = 1.948851         scale = 65.166611
linear_115                               : max = 8.437109         threshold = 5.608947         scale = 22.642395
linear_114                               : max = 8.437109         threshold = 6.193942         scale = 20.503904
linear_116                               : max = 3.966980         threshold = 3.200896         scale = 39.676392
linear_117                               : max = 9.451303         threshold = 6.061664         scale = 20.951344
linear_118                               : max = 12.077262        threshold = 3.965800         scale = 32.023804
linear_119                               : max = 9.671615         threshold = 4.847613         scale = 26.198460
linear_120                               : max = 8.625638         threshold = 3.131427         scale = 40.556595
linear_122                               : max = 10.274080        threshold = 4.888716         scale = 25.978189
linear_121                               : max = 10.274080        threshold = 5.420480         scale = 23.429659
linear_123                               : max = 4.826197         threshold = 3.599617         scale = 35.281532
linear_124                               : max = 11.396383        threshold = 7.325849         scale = 17.335875
linear_125                               : max = 9.337198         threshold = 3.941410         scale = 32.221970
linear_126                               : max = 9.699965         threshold = 4.842878         scale = 26.224073
linear_127                               : max = 8.775370         threshold = 3.884215         scale = 32.696438
linear_129                               : max = 9.872276         threshold = 4.837319         scale = 26.254213
linear_128                               : max = 9.872276         threshold = 7.180057         scale = 17.687883
linear_130                               : max = 4.150427         threshold = 3.454298         scale = 36.765789
linear_131                               : max = 11.112692        threshold = 7.924847         scale = 16.025545
linear_132                               : max = 11.852893        threshold = 3.116593         scale = 40.749626
linear_133                               : max = 11.517084        threshold = 5.024665         scale = 25.275314
linear_134                               : max = 10.683807        threshold = 3.878618         scale = 32.743618
linear_136                               : max = 12.421055        threshold = 6.322729         scale = 20.086264
linear_135                               : max = 12.421055        threshold = 5.309880         scale = 23.917679
linear_137                               : max = 4.827781         threshold = 3.744595         scale = 33.915554
linear_138                               : max = 14.422395        threshold = 7.742882         scale = 16.402161
linear_139                               : max = 8.527538         threshold = 3.866123         scale = 32.849449
linear_140                               : max = 12.128619        threshold = 4.657793         scale = 27.266134
linear_141                               : max = 9.839593         threshold = 3.845993         scale = 33.021378
linear_143                               : max = 12.442304        threshold = 7.099039         scale = 17.889746
linear_142                               : max = 12.442304        threshold = 5.325038         scale = 23.849592
linear_144                               : max = 5.929444         threshold = 5.618206         scale = 22.605080
linear_145                               : max = 13.382126        threshold = 9.321095         scale = 13.625010
linear_146                               : max = 9.894987         threshold = 3.867645         scale = 32.836517
linear_147                               : max = 10.915313        threshold = 4.906028         scale = 25.886522
linear_148                               : max = 9.614287         threshold = 3.908151         scale = 32.496181
linear_150                               : max = 11.724932        threshold = 4.485588         scale = 28.312899
linear_149                               : max = 11.724932        threshold = 5.161146         scale = 24.606939
linear_151                               : max = 7.164453         threshold = 5.847355         scale = 21.719223
linear_152                               : max = 13.086471        threshold = 5.984121         scale = 21.222834
linear_153                               : max = 11.099524        threshold = 3.991601         scale = 31.816805
linear_154                               : max = 10.054585        threshold = 4.489706         scale = 28.286930
linear_155                               : max = 12.389185        threshold = 3.100321         scale = 40.963501
linear_157                               : max = 9.982999         threshold = 5.154796         scale = 24.637253
linear_156                               : max = 9.982999         threshold = 8.537706         scale = 14.875190
linear_158                               : max = 8.420287         threshold = 6.502287         scale = 19.531588
linear_159                               : max = 25.014746        threshold = 9.423280         scale = 13.477261
linear_160                               : max = 45.633553        threshold = 5.715335         scale = 22.220921
linear_161                               : max = 20.371849        threshold = 5.117830         scale = 24.815203
linear_162                               : max = 12.492933        threshold = 3.126283         scale = 40.623318
linear_164                               : max = 20.697504        threshold = 4.825712         scale = 26.317358
linear_163                               : max = 20.697504        threshold = 5.078367         scale = 25.008038
linear_165                               : max = 9.023975         threshold = 6.836278         scale = 18.577358
linear_166                               : max = 34.860619        threshold = 7.259792         scale = 17.493614
linear_167                               : max = 30.380934        threshold = 5.496160         scale = 23.107042
linear_168                               : max = 20.691216        threshold = 4.733317         scale = 26.831076
linear_169                               : max = 9.723948         threshold = 3.952728         scale = 32.129707
linear_171                               : max = 21.034811        threshold = 5.366547         scale = 23.665123
linear_170                               : max = 21.034811        threshold = 5.356277         scale = 23.710501
linear_172                               : max = 10.556884        threshold = 5.729481         scale = 22.166058
linear_173                               : max = 20.033039        threshold = 10.207264        scale = 12.442120
linear_174                               : max = 11.597379        threshold = 2.658676         scale = 47.768131
----------joiner----------
linear_2                                 : max = 19.293503        threshold = 14.305265        scale = 8.877850
linear_1                                 : max = 10.812222        threshold = 8.766452         scale = 14.487047
linear_3                                 : max = 0.999999         threshold = 0.999755         scale = 127.031174
ncnn int8 calibration table create success, best wish for your int8 inference has a low accuracy loss...\(^0^)/...233...

It generates the following two files:

$ ls -lh encoder-scale-table.txt joiner-scale-table.txt
-rw-r--r-- 1 kuangfangjun root 955K Jan 11 17:28 encoder-scale-table.txt
-rw-r--r-- 1 kuangfangjun root  18K Jan 11 17:28 joiner-scale-table.txt

Caution

Definitely, you need more calibration data to compute the scale table.

Finally, let us use the scale table to quantize our models into int8.

ncnn2int8

usage: ncnn2int8 [inparam] [inbin] [outparam] [outbin] [calibration table]

First, we quantize the encoder model:

cd egs/librispeech/ASR
cd icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp/

ncnn2int8 \
  ./encoder_jit_trace-pnnx.ncnn.param \
  ./encoder_jit_trace-pnnx.ncnn.bin \
  ./encoder_jit_trace-pnnx.ncnn.int8.param \
  ./encoder_jit_trace-pnnx.ncnn.int8.bin \
  ./encoder-scale-table.txt

Next, we quantize the joiner model:

ncnn2int8 \
  ./joiner_jit_trace-pnnx.ncnn.param \
  ./joiner_jit_trace-pnnx.ncnn.bin \
  ./joiner_jit_trace-pnnx.ncnn.int8.param \
  ./joiner_jit_trace-pnnx.ncnn.int8.bin \
  ./joiner-scale-table.txt

The above two commands generate the following 4 files:

-rw-r--r-- 1 kuangfangjun root  99M Jan 11 17:34 encoder_jit_trace-pnnx.ncnn.int8.bin
-rw-r--r-- 1 kuangfangjun root  78K Jan 11 17:34 encoder_jit_trace-pnnx.ncnn.int8.param
-rw-r--r-- 1 kuangfangjun root 774K Jan 11 17:35 joiner_jit_trace-pnnx.ncnn.int8.bin
-rw-r--r-- 1 kuangfangjun root  496 Jan 11 17:35 joiner_jit_trace-pnnx.ncnn.int8.param

Congratulations! You have successfully quantized your model from float32 to int8.

Caution

ncnn.int8.param and ncnn.int8.bin must be used in pairs.

You can replace ncnn.param and ncnn.bin with ncnn.int8.param and ncnn.int8.bin in sherpa-ncnn if you like.

For instance, to use only the int8 encoder in sherpa-ncnn, you can replace the following invocation:

cd egs/librispeech/ASR
cd icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp/

sherpa-ncnn \
  ../data/lang_bpe_500/tokens.txt \
  ./encoder_jit_trace-pnnx.ncnn.param \
  ./encoder_jit_trace-pnnx.ncnn.bin \
  ./decoder_jit_trace-pnnx.ncnn.param \
  ./decoder_jit_trace-pnnx.ncnn.bin \
  ./joiner_jit_trace-pnnx.ncnn.param \
  ./joiner_jit_trace-pnnx.ncnn.bin \
  ../test_wavs/1089-134686-0001.wav

with

cd egs/librispeech/ASR
cd icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp/

sherpa-ncnn \
  ../data/lang_bpe_500/tokens.txt \
  ./encoder_jit_trace-pnnx.ncnn.int8.param \
  ./encoder_jit_trace-pnnx.ncnn.int8.bin \
  ./decoder_jit_trace-pnnx.ncnn.param \
  ./decoder_jit_trace-pnnx.ncnn.bin \
  ./joiner_jit_trace-pnnx.ncnn.param \
  ./joiner_jit_trace-pnnx.ncnn.bin \
  ../test_wavs/1089-134686-0001.wav

The following table compares again the file sizes:

File name

File size

encoder_jit_trace-pnnx.pt

283 MB

decoder_jit_trace-pnnx.pt

1010 KB

joiner_jit_trace-pnnx.pt

3.0 MB

encoder_jit_trace-pnnx.ncnn.bin (fp16)

142 MB

decoder_jit_trace-pnnx.ncnn.bin (fp16)

503 KB

joiner_jit_trace-pnnx.ncnn.bin (fp16)

1.5 MB

encoder_jit_trace-pnnx.ncnn.bin (fp32)

283 MB

joiner_jit_trace-pnnx.ncnn.bin (fp32)

3.0 MB

encoder_jit_trace-pnnx.ncnn.int8.bin

99 MB

joiner_jit_trace-pnnx.ncnn.int8.bin

774 KB

You can see that the file sizes of the model after int8 quantization are much smaller.

Hint

Currently, only linear layers and convolutional layers are quantized with int8, so you don’t see an exact 4x reduction in file sizes.

Note

You need to test the recognition accuracy after int8 quantization.

You can find the speed comparison at https://github.com/k2-fsa/sherpa-ncnn/issues/44.

That’s it! Have fun with sherpa-ncnn!