Prepare a GenAI model using AI Hub - Qualcomm Dragonwing Documentation

Qualcomm AI Hub provides a streamlined workflow to prepare and deploy large language models (LLMs) on Qualcomm Dragonwing™ products using Qualcomm GenerativeAI Inference Extensions (Genie). This approach enables efficient on-device execution of generative AI models by leveraging the neural processing unit (NPU) and optimized binaries. The following image shows the high-level GenAI model workflow from preparation to execution.

High-level GenAI model workflow from preparation to execution using AI Hub

The following are the steps to create LLM model binaries using AI Hub:

The following is an overview of LLM on-device deployment:

Prepare the model. a. Start with the desired LLM (for example, Llama 3.x series) from Hugging Face or another source.
1. Use the qai_hub_models Python package to export the model: This process: i. Downloads the model weights.
  1. Uploads them to AI Hub for compilation.
  2. Generates QNN binaries split into multiple parts for NPU execution.
  3. Creates a deployable folder (genie_bundle) with all required assets (context binaries, configs, tokenizer).
Compile and quantize the model. a. AI Hub compiles models into optimized binaries for Qualcomm AI Runtime (QAIRT) SDK.
1. AI Hub supports quantization (typically 4-bit internally, though weights may be stored as 8-bit for compatibility).
2. Export scripts handle splitting large models into prompt processors and token generator components.
Deploy the model. The following are high-level steps of the deployment process. For detailed instructions and commands, see Run LLMs with Genie. a. Install the Qualcomm AI Runtime (QAIRT) SDK on the target device (Android, Windows, Linux).
1. Copy the compiled binaries and configuration files to the device.
2. Use Qualcomm GenerativeAI Inference Extensions (Genie) CLI tools (for example, genie-t2t-run) or Genie dialog API for inference.
3. Ensure the target device meets the following requirements. The steps in this section are validated for QCS9100, which uses Hexagon architecture V73.
  - Hexagon architecture: v73 or newer
  - Required RAM:
    - 16 GB for 7B models
    - ~12 GB for 3B models
Run the model on-device using Genie APIs integrated with Qualcomm AI Engine Direct.
- Genie manages multiple binaries and execution orders for optimal NPU utilization.

Important notes

AI Hub advantages
- automatically handles model compilation, quantization, and splitting.
- Provides pre-optimized models and bring your own model (BYOM) support.
Genie
- Simplifies inference by abstracting complex execution steps.
- Offers APIs for text-to-text and dialogue-based interactions.
Customization
- Export flow defaults to 4-bit quantization for runtime efficiency.
- No direct option to store weights as 4-bit; they remain 8-bit but load as 4-bit during execution.

​Important notes

Important notes