Convert and quantize AI models - Qualcomm Dragonwing Documentation

Neural Processing Engine
AI Engine Direct

Port a model using Qualcomm Neural Processing Engine SDK

Model conversion

A pretrained floating point, 32-bit precision model from PyTorch, ONNX, TensorFlow, or TFLite is input to SNPE converter tools (snpe-<framework>-to-dlc) to convert the model to a Qualcomm-specific intermediate representation of the model called a deep learning container (DLC).In addition to the input model from a source framework, the converters require additional details about the input model, such as the input node name, its corresponding input dimensions, and any output tensor names (for models with multiple outputs).Refer to converters for all available configurable parameters or see the command line help by running:

snpe-<framework>-to-dlc --help

Output:

required arguments:

-d INPUT_NAME INPUT_DIM, --input_dim INPUT_NAME INPUT_DIM
    The names and dimensions of the network input layers specified in the format
    [input_name comma-separated-dimensions], for example:
    'data' 1,224,224,3
     Note that the quotes should always be included in order to handle special
     characters, spaces, etc.
     For multiple inputs specify multiple --input_dim on the command line like:
     --input_dim 'data1' 1,224,224,3 --input_dim 'data2' 1,50,100,3
--out_node OUT_NAMES, --out_name OUT_NAMES
     Name of the graph's output Tensor Names. Multiple output names should be
     provided separately like:
     --out_name out_1 --out_name out_2
--input_network INPUT_NETWORK, -i INPUT_NETWORK
     Path to the source framework model.

If the yaml package is not present in your working environment, install it using the following command:

pip install pyyaml

The following example uses an ONNX model (inception_v3_opset16.onnx) downloaded from the ONNX Model Zoo.Download the model as inception_v3.onnx to your workspace. In this example, we download the model to the ~/models directory.Run the following command to generate the inception_v3.dlc model.

${QAIRT_ROOT}/bin/x86_64-linux-clang/snpe-onnx-to-dlc --input_network ~/models/inception_v3.onnx --output_path ~/models/inception_v3.dlc --input_dim 'x' 1,3,299,299

Model quantization

To run a model on Hexagon Tensor Processor (HTP), the converted DLC must be quantized. SNPE offers a tool (snpe-dlc-quant) to quantize a DLC model to INT8/INT16 DLC using its own quantization algorithm. For more information about SNPE quantization, see Quantized models.The quantization process in SNPE requires two steps:

Quantization of weights and biases within the model. Quantization of weights and biases is a static step, i.e., no additional input data is required from the user.
Quantization of activation layers (or layers with no weights). Quantizing activation layers requires a set of input images from a training dataset as calibration data. These calibration dataset images are input as a list of preprocessed image files in .raw format. The file sizes of these input .raw files must match the input size of the model.

Inputs to snpe-dlc-quant are a converted DLC model and a plain text file with the paths to the calibration dataset images. This input list holds paths to preprocessed images saved as NumPy arrays in .raw format. The size of the preprocessed image must match the input resolution of the model.The output of the snpe-dlc-quant tool is a quantized DLC.

[ --input_dlc=<val> ]
             Path to the dlc container
             containing the model for which fixed-point encoding metadata should be generated.
             This argument is required.
[ --input_list=<val> ] Path to a file
             specifying the trial inputs. This file should be a plain text file, containing one
             or more absolute file paths per line. These files will be taken to constitute the
             trial set. Each path is expected to point to a binary file containing one trial
             input in the 'raw' format, ready to be consumed by the tool without any further
             modifications. This is similar to how input is provided to snpe-net-run
             application.
[ --output_dlc=<val> ] Path at which the
             metadata-included quantized model container should be written. If this argument is
             omitted, the quantized model will be written at
             <unquantized_model_name>_quantized.dlc.

Use Netron graph visualization tool to identify the model’s input/output layer dimensions.

For demo purposes, we can evaluate the quantization process with random input files. The input file can be generated using a simple Python script shown below for the inception_v3.onnx model. Save the script as generate_random_input.py in your workspace ~/models/ directory and run it using python ~/models/generate_random_input.py on your host computer.The following example Python code creates an input_list that holds paths to calibration dataset images used to quantize the model.

import os
import numpy as np

input_path_list =[]
BASE_PATH = "/tmp/RandomInputsForInceptionV3"

if not os.path.exists(BASE_PATH):
    os.mkdir(BASE_PATH)

# generate 10 random inputs and save as raw
NUM_IMAGES = 10

#binary files
for img in range(NUM_IMAGES):
    filename = "input_{}.raw".format(img)
    randomTensor = np.random.random((1, 299, 299, 3)).astype(np.float32)
    filename = os.path.join(BASE_PATH, filename)
    randomTensor.tofile(filename)
    input_path_list.append(filename)

#for saving as input_list text
with open("input_list.txt", "w") as f:
    for path in input_path_list:
        f.write(path)
        f.write("\n")

The above script generates 10 sample input files saved in the /tmp/RandomInputsForInceptionV3/ directory and an input_list.txt file that contains the path to each sample generated.Now that all needed inputs to the snpe-dlc-quant tool are available, the model can be quantized.

${QAIRT_ROOT}/bin/x86_64-linux-clang/snpe-dlc-quant --input_dlc ~/models/inception_v3.dlc --output_dlc ~/models/inception_v3_quantized.dlc --input_list ~/models/input_list.txt

This generates a quantized inception_v3 DLC model (inception_v3_quantized.dlc). By default, the model is quantized for INT8 bit width.Customize the quantization to use 16-bit instead of default INT8 by specifying the --act_bitwidth 16 and/or --weights_bitwidth 16 options to the snpe-dlc-quant tool.Refer to the snpe-dlc-quant tool documentation, or run snpe-dlc-quant --help to view all available customizations including quantization modes, optimizations, etc.

Model optimization

Quantized model DLC requires a graph preparation step that optimizes the model for execution on HTP. To prepare the model DLC to execute on HTP, SNPE provides a snpe-dlc-graph-prepare tool that takes a quantized model and hardware-specific details, such as chipset, as input.

Optimizations for hardware, such as HTP, depend on the specific version of HTP present on the chipset. To ensure the correct set of optimizations are applied to the execution graph for optimal utilization of the HTP, it is important to provide the correct chipset ID to the snpe-dlc-graph-prepare tool.

Based on the HTP version and chipset ID, the tool creates a cache that contains an execution strategy to execute model DLC on the HTP hardware. Without this step, there will be additional overhead during network initialization as the SNPE runtime will have to create an execution strategy on the fly.

${QAIRT_ROOT}/bin/x86_64-linux-clang/snpe-dlc-graph-prepare --input_dlc ~/models/inception_v3_quantized.dlc --output_dlc ~/models/inception_v3_quantized_with_htp_cache.dlc --htp_socs qcs6490

HTP cache information

Once the snpe-dlc-graph-prepare step is completed, the HTP cache record is added to the DLC. This cache information can be viewed using the snpe-dlc-info tool.

${QAIRT_ROOT}/bin/x86_64-linux-clang/snpe-dlc-info -i ~/models/inception_v3_quantized_with_htp_cache.dlc

Port a model using AI Engine Direct

Model conversion and quantization

A pretrained FP32 model from PyTorch, ONNX, TensorFlow, or TFLite is input to the QNN converter tool (qnn-<framework>-converter) to convert to a QNN graph representation in the form of a high-level readable C++ graph.When accelerating the model on HTP, the model must be quantized. Model quantization can be done in the same step as conversion. A calibration dataset must be provided to perform this quantization step to perform static quantization.To enable quantization along with conversion, use the --input_list INPUT_LIST option for static quantization.For more information, see quantization support.The following example uses an ONNX model (inception_v3_opset16.onnx) downloaded from the ONNX Model Zoo.Download the model as inception_v3.onnx to your workspace. In this example, the model is downloaded to the ~/models directory.

Model conversion: CPU backend

To convert the model to run on x86/Arm-based CPU, run the following command to generate inception_v3.cpp and inception_v3.bin.

${QAIRT_ROOT}/bin/x86_64-linux-clang/qnn-onnx-converter --input_network ~/models/inception_v3.onnx --output_path ~/models/inception_v3.cpp --input_dim 'x' 1,3,299,299

The inception_v3.cpp file contains a high-level graph representation of the converted model.The inception_v3.bin file contains weights/biases from the model.

Model conversion and quantization: HTP backend

To run the model on HTP, the quantization step is required.For quantization in AI Engine Direct (QNN) SDK, a representative dataset of 50 to 200 images from a training dataset are provided to the QNN converter as a calibration dataset. The images in the calibration dataset are preprocessed (resized, normalized, etc.) and saved as NumPy arrays in .raw format. The size of these input .raw files must match the input size of the model.

Use the Netron graph visualization tool to identify the model’s input/output layer dimensions.

For demonstration purposes, you can evaluate the quantization process with random input files. The input files can be generated using the Python script shown below for the inception_v3.onnx model. Save the script as generate_random_input.py in the ~/models/ directory and run it using python ~/models/generate_random_input.py.The following Python code creates an input_list that contains the calibration dataset used to quantize the model.

import os
import numpy as np

input_path_list =[]

BASE_PATH = "/tmp/RandomInputsForInceptionV3"

if not os.path.exists(BASE_PATH):
    os.mkdir(BASE_PATH)

# generate 10 random inputs and save as raw
NUM_IMAGES = 10

#binary files
for img in range(NUM_IMAGES):
    filename = "input_{}.raw".format(img)
    randomTensor = np.random.random((1, 299, 299, 3)).astype(np.float32)
    filename = os.path.join(BASE_PATH, filename)
    randomTensor.tofile(filename)
    input_path_list.append(filename)

#for saving as input_list text
with open("input_list.txt", "w") as f:
    for path in input_path_list:
        f.write(path)
        f.write("\n")

Run the following command to convert and quantize.By default, the model is quantized for INT8 bit width. You can specify --act_bitwidth 16 and/or --weights_bitwidth 16 to use INT16 quantization.

${QAIRT_ROOT}/bin/x86_64-linux-clang/qnn-onnx-converter --input_network ~/models/inception_v3.onnx --output_path ~/models/inception_v3_quantized.cpp --input_list ~/models/input_list.txt --input_dim "x" 1,3,299,299

This generates inception_v3_quantized.cpp and inception_v3_quantized.bin files in the ~/models/ directory.See qnn-<framework>-converter or run qnn-<framework>-converter --help to view all available customizations to quantization, including quantization modes, optimizations, etc.

Model compilation

Once the conversion/quantization step is complete, qnn-model-lib-generator is used to compile the generated C++ graph into a shared object (.so) enabling the model to be dynamically loaded by an application to perform inference.For x86, the Clang compiler toolchain is used to compile the C++ graph into a .so library. For a Linux Embedded device, such as Qualcomm Dragonwing™ RB3 Gen 2 and IQ-9075, the appropriate compiler toolchain (aarch64-oe-linux-gcc11.2) must be used.

Compiling a model to run on x86

Install the GNU standard C++ library development package for GCC version 12.
```
sudo apt install libstdc++-12-dev
```
This package includes standard library headers (like <limits>, <vector>, and <string>), static libraries, and support files necessary to compile and link C++ programs.
Generate a shared object model to run on an x86-based Linux machine.
```
${QAIRT_ROOT}/bin/x86_64-linux-clang/qnn-model-lib-generator -c ~/models/inception_v3.cpp -b ~/models/inception_v3.bin -o ~/models/libs/ -t x86_64-linux-clang
```
This generates inception_v3.so using the Clang-14 compiler toolchain to compile the C++ graph to a QNN model .so compatible with the x86 host computer.

Compiling a model to run on target

When compiling a model for on-device execution (aarch64 architecture), it is important to use the right cross-compiler toolchain to ensure the compiled shared object (.so) is compatible with the device OS.The following steps install the cross-compiler toolchain required to compile a model cpp file to a .so library. Instructions to install the appropriate cross-compiler toolchain are available under Download and install the Platform SDK.After installing the platform SDK, setup the cross-compiler environment in a new command line terminal.

Source the environment setup script under $SDK_ROOT.

source $SDK_ROOT/environment-setup-armv8a-qcom-linux

Check if the environment is properly setup.
```
echo $SDKTARGETSYSROOT
```
```
echo $TARGET_PREFIX
```
If the above environmental variables were not populated, repeat Step 1 in a new command line terminal.
Setup the QAIRT environment.
```
source ${QAIRT_ROOT}/bin/envsetup.sh
```

Compiling a model to run on Arm-based CPU

Once the cross-compiler is setup, use the following command to generate libinception_v3.so in ~/model/libs/aarch64-oe-linux-gcc11.2. Provide this location to the qnn-model-lib-generator tool through a command line argument.

The compiler toolchain used here is aarch64-oe-linux-gcc11.2.

${QAIRT_ROOT}/bin/x86_64-linux-clang/qnn-model-lib-generator -c ~/models/inception_v3.cpp -b ~/models/inception_v3.bin -o ~/models/libs -t aarch64-oe-linux-gcc11.2

Compiling a model to run on HTP

To run the model on HTP, the following command generates libinception_v3_quantized.so in ~/models/libs/aarch64-oe-linux-gcc11.2.

The compiler toolchain used here is aarch64-oe-linux-gcc11.2.

${QAIRT_ROOT}/bin/x86_64-linux-clang/qnn-model-lib-generator -c ~/models/inception_v3_quantized.cpp -b ~/models/inception_v3_quantized.bin -o ~/models/libs/ -t aarch64-oe-linux-gcc11.2

​Port a model using Qualcomm Neural Processing Engine SDK

​Model conversion

​Model quantization

​Model optimization

​HTP cache information

​Port a model using AI Engine Direct

​Model conversion and quantization

​Model conversion: CPU backend

​Model conversion and quantization: HTP backend

​Model compilation

​Compiling a model to run on x86

​Compiling a model to run on target

​Compiling a model to run on Arm-based CPU

​Compiling a model to run on HTP

Port a model using Qualcomm Neural Processing Engine SDK

Model conversion

Model quantization

Model optimization

HTP cache information

Port a model using AI Engine Direct

Model conversion and quantization

Model conversion: CPU backend

Model conversion and quantization: HTP backend

Model compilation

Compiling a model to run on x86

Compiling a model to run on target

Compiling a model to run on Arm-based CPU

Compiling a model to run on HTP