> ## Documentation Index
> Fetch the complete documentation index at: https://dragonwingdocs.qualcomm.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Prepare a GenAI model using AI Hub

> Use Qualcomm AI Hub to prepare and deploy large language models (LLMs) on Qualcomm Dragonwing IoT platforms using the Genie framework.

Qualcomm AI Hub provides a streamlined workflow to prepare and deploy
large language models (LLMs) on Qualcomm Dragonwing™ products using
Qualcomm GenerativeAI Inference Extensions (Genie).

This approach enables efficient on-device execution of generative AI
models by leveraging the neural processing unit (NPU) and optimized binaries.

The following image shows the high-level GenAI model workflow from preparation to
execution.

<img src="https://mintcdn.com/qualcomm-prod/L-jqwrTTz49ZAgVX/Key-Documents/AI-Developer-Workflow/_images/genai-prepare-ai-hub.png?fit=max&auto=format&n=L-jqwrTTz49ZAgVX&q=85&s=e31c5b0d6df9dbba3e64efcc79e67be0" alt="High-level GenAI model workflow from preparation to execution using AI Hub" width="2929" height="411" data-path="Key-Documents/AI-Developer-Workflow/_images/genai-prepare-ai-hub.png" />

The following are the steps to create LLM model binaries using AI Hub:

* [Prerequisites](https://github.com/qualcomm/ai-hub-apps/tree/main/tutorials/llm_on_genie#requirements)

* [Detailed instructions](https://github.com/qualcomm/ai-hub-apps/tree/main/tutorials/llm_on_genie#step-2-export-qairt-compatible-llm-models-on-the-host-machine)

The following is an overview of LLM on-device deployment:

1. Prepare the model.

   a. Start with the desired LLM (for example, Llama 3.x series) from Hugging Face or another source.

   1. Use the `qai_hub_models` Python package to export the model:

      This process:

      i. Downloads the model weights.

      1. Uploads them to AI Hub for compilation.
      2. Generates QNN binaries split into multiple parts for NPU execution.
      3. Creates a deployable folder (`genie_bundle`) with all required assets (context binaries,
         configs, tokenizer).

2. Compile and quantize the model.

   a. AI Hub compiles models into optimized binaries for [Qualcomm AI Runtime (QAIRT) SDK](https://docs.qualcomm.com/doc/80-63442-10/).

   1. AI Hub supports quantization (typically 4-bit internally, though weights may be stored as 8-bit for compatibility).
   2. Export scripts handle splitting large models into prompt processors and token generator components.

3. Deploy the model.

   The following are high-level steps of the deployment process. For detailed instructions and commands, see [Run LLMs with Genie](#run-llms-with-genie).

   a. Install the Qualcomm AI Runtime (QAIRT) SDK on the target device (Android, Windows, Linux).

   1. Copy the compiled binaries and configuration files to the device.
   2. Use [Qualcomm GenerativeAI Inference Extensions (Genie) CLI tools](https://docs.qualcomm.com/doc/80-63442-10/topic/tools_tools.html)
      (for example, `genie-t2t-run`) or [Genie dialog API](https://docs.qualcomm.com/doc/80-63442-10/topic/api-rst_file_include_Genie_GenieDialog_h.html#file-include-Genie-GenieDialog.h) for inference.
   3. Ensure the target device meets the following requirements. The steps in this section are validated for QCS9100, which uses Hexagon architecture V73.

      * Hexagon architecture: v73 or newer
      * Required RAM:

        * 16 GB for 7B models
        * \~12 GB for 3B models

4. Run the model on-device using Genie APIs integrated with [Qualcomm AI Engine Direct](https://docs.qualcomm.com/doc/80-63442-10/topic/index_QNN.html).

   * Genie manages multiple binaries and execution orders for optimal NPU utilization.

### Important notes

* AI Hub advantages

  * automatically handles model compilation, quantization, and splitting.
  * Provides pre-optimized models and bring your own model (BYOM) support.

* Genie

  * Simplifies inference by abstracting complex execution steps.
  * Offers APIs for text-to-text and dialogue-based interactions.

* Customization

  * Export flow defaults to 4-bit quantization for runtime efficiency.
  * No direct option to store weights as 4-bit; they remain 8-bit but load as 4-bit during execution.