Use GenAI models with Qualcomm Generative AI (GenAI) Inference Extensions (Genie)

Genie is a high-level framework to run GenAI models, such as LLMs, vision transformers, and multimodal models on Qualcomm platforms. It abstracts the complexity of managing multiple binaries and orchestrates execution across heterogeneous compute units (CPU, GPU, NPU), delivering low latency, power efficiency, and user simplicity.

Genie framework architecture for running GenAI models on Qualcomm platforms

The following JSON configurations, Genie tools, and C APIs are the essential components for preparing, executing, and managing GenAI models on-device.

Component	Purpose
JSON configurations	Key functions:	Specifies backend selection (CPU or NPU).	Configures model paths, tokenizer settings, and memory allocation.	Manages dialogue session parameters for multi-turn conversations. Example fields:	`backend`: “CPU” or “NPU”	`model_bundle_path`: Path to Genie model binaries	`max_tokens`: Token generation limit For more details, see Qualcomm AI Engine Direct
Genie tools	Command like tools for model execution, and profiling. Common tools:	`genie-t2t-run`: Text-to-text inference.	`genie-profile`: Performance profiling For more details, see Qualcomm AI Runtime (QAIRT) SDK.
Genie API	Provides programmatic access to integrate Genie into GenAI applications. Features:	Dialogue API: Supports multi-turn conversational workflows.	Token generation API: Handles incremental token streaming for LLMs. Advantages:	Enables custom application logic.	Offers fine-grained control over inference sessions. Integration:	Works with QAIRT SDK for backend execution.	Supports both CPU and NPU targets For more details, see QAIRT

Run LLMs with Genie

Once your large language model (LLM) has been prepared and optimized (using AI Hub or the Jupyter notebooks), Genie provides a streamlined way to execute it on Qualcomm platforms.

Prerequisites

Before running a language model with Genie, confirm that the following prerequisites are met.

The model bundle is exported and prepared for the correct backend (CPU or NPU) and it includes AI Engine Direct (QNN) binaries, tokenizer files, and configuration files.
QAIRT is installed on the target device and Genie tools and libraries are available as part of the SDK.
The target hardware uses a Qualcomm platform with sufficient memory (RAM and storage to copy and execute model binaries).

The following image shows the required inputs to use Genie to run an LLM with genie-t2t-run.

Required inputs to run an LLM with genie-t2t-run

The genie-t2t-run tool is a test application to do text-to-text inference on a provided LLM network. It takes a user prompt in text format and outputs the result in the text format. It provides a ready-to-use command line interface (CLI) to run LLM inference on supported Qualcomm devices using CPU, GPU, and Hexagon Tensor Processor (HTP) backends. Genie streamlines multibinary LLM execution into a single job using pre-optimized model assets. The following snippet shows a sample genie-t2t-run command.

genie-t2t-run -c genie_config.json -p "Tell me about Qualcomm"

The following image shows the call flow of the genie-t2t-run command.

The following steps detail how to push the LLM artifacts to the target device and run the LLM model with Genie. Before running genie-t2t-run you need to copy the genie-bundle generated from the model preparation step to the target device.

From the host computer, connect to the target device using its IP address.
```
ssh root@<IP_ADDRESS_OF_TARGET_DEVICE>
```
On the target device, in the SSH session from the previous step, create a directory to hold the model artifacts:
```
mkdir -p /tmp
```

From the host computer, push the libraries and binaries needed to run the model to the target device.

scp ${QNN_SDK_ROOT}/bin/aarch64-oe-linux-gcc11.2/genie-t2t-run root@<TARGET_IP>:/tmp/

scp ${QNN_SDK_ROOT}/lib/aarch64-oe-linux-gcc11.2/libGenie.so root@<TARGET_IP>:/tmp/

scp ${QNN_SDK_ROOT}/lib/aarch64-oe-linux-gcc11.2/libQnnHtp.so root@<TARGET_IP>:/tmp/

scp ${QNN_SDK_ROOT}/lib/aarch64-oe-linux-gcc11.2/libQnnSystem.so root@<TARGET_IP>:/tmp/

scp ${QNN_SDK_ROOT}/lib/aarch64-oe-linux-gcc11.2/libQnnHtpPrepare.so root@<TARGET_IP>:/tmp/

scp ${QNN_SDK_ROOT}/lib/aarch64-oe-linux-gcc11.2/libQnnHtpNetRunExtensions.so root@<TARGET_IP>:/tmp/

In the following commands, replace <ARCHITECTURE> with the DSP Hexagon architecture library version.

Device	Architecture
IQ-8275	`75`
IQ-9075/QCS9100	`73`
Qualcomm Dragonwing™ RB3 Gen 2	`68`

scp ${QNN_SDK_ROOT}/lib/aarch64-oe-linux-gcc11.2/libQnnHtpV<ARCHITECTURE>Stub.so root@<TARGET_IP>:/tmp/

scp ${QNN_SDK_ROOT}/lib/hexagon-v<ARCHITECTURE>/unsigned/libQnnHtpV<ARCHITECTURE>Skel.so root@<TARGET_IP>:/tmp/

From the host computer, push model binaries and configuration files to the target device.

scp <path to htp_backend_ext_config.json> root@<TARGET_IP>:/tmp/

scp <path to genie_config.json> root@<TARGET_IP>:/tmp/

scp <path to tokenizer.json> root@<TARGET_IP>:/tmp/

scp <path to model bin files> root@<TARGET_IP>:/tmp/

From the target device, run the model.

export LD_LIBRARY_PATH=/tmp/

export PATH=$LD_LIBRARY_PATH:$PATH

cd $LD_LIBRARY_PATH

./genie-t2t-run -c <path to genie_config.json> -p "What's the most popular cookie in the world?"

Select the runtime using the backend::type parameter from the JSON configuration file

Parameter	Backends	Description
`backend::type`	All backends	Engine to use: QNN HTP backend: `QnnHtp`, QNN AI transformer backend: `QnnGenAiTransformer`, QNN GPU backend: `QnnGpu`

Sample Genie configurations for different models are available through AI Hub.

Run multimodal models with Genie

To run multimodal models with Genie, use the GenAI tutorials (available in Qualcomm Package Manager) to generate the models and run them with Genie tools.

No multimodal models are hosted in AI Hub. Users must use the Jupyter notebooks to generate and run the models.

Multimodal execution with Genie happens with the Genie pipeline.

Genie Node APIs: Used to create individual blocks. Each node creation requires a standalone JSON configuration, similar to the dialog configuration.
- text-encoder
- image-encoder
- text-generator
Genie Pipeline APIs: Used to connect nodes and streamline execution. The Genie pipeline manages internal data type translations, re-quantization, and concatenation operations.

Genie pipeline for multimodal model execution with text and image encoder nodes

Genie pipeline stages

Create the nodes. a. Create the node configuration. Node JSON configurations vary based on the node type. For more information, see node JSON

Genie_Status_t GenieNodeConfig_createFromJson(const char* str, 
                                              GenieNodeConfig_Handle_t* configHandle);

Create the node.

Genie_Status_t GenieNode_create(const GenieNodeConfig_Handle_t nodeConfigHandle, 
                                GenieNode_Handle_t* nodeHandle);

Construct the pipeline. a. Create a pipeline from the node configuration.

Genie_Status_t GeniePipelineConfig_createFromJson(const char* str, 
                                                  GeniePipelineConfig_Handle_t* configHandle);

Add nodes to the pipeline.

Genie_Status_t GeniePipeline_create(const GeniePipelineConfig_Handle_t configHandle, 
                                    GeniePipeline_Handle_t* pipelineHandle);

Connect the nodes.
- Each node type has a set of predefined IO names. These are defined in GenieNode.h. For example, the text generator node has one of the two possible inputs and one output:
  - Input: GENIE_NODE_TEXT_GENERATOR_TEXT_INPUT
  - Input: GENIE_NODE_TEXT_GENERATOR_EMBEDDING_INPUT
  - Output: GENIE_NODE_TEXT_GENERATOR_TEXT_OUTPUT

The Genie pipeline connect API defines one connection from one producer node output to one consumer node input. For example:

GeniePipeline_connect(pipelineHandle, 
                      lutEncoder, 
                      GENIE_NODE_TEXT_ENCODER_EMBEDDING_OUTPUT,
                      textGenerator,
                      GENIE_NODE_TEXT_GENERATOR_EMBEDDING_INPUT)

Run the pipeline. a. Set data for the input nodes.

Genie_Status_t GenieNode_setData(const GenieNode_Handle_t nodeHandle, 
                                 const GenieNode_IOName_t nodeIOName, 
                                 const void* data, 
                                 const size_t dataSize, 
                                 const char* dataConfig);

Use the predefined IO name (as in connect API)
The IO name implicitly defines the data type

Run the pipeline.

Genie_Status_t GeniePipeline_execute(const GeniePipeline_Handle_t pipelineHandle, 
                                     void* userData);

Callbacks registered to the output node(s) return the output.

Genie API

The Genie API is a high-level interface for running LLM pipelines on Qualcomm devices. It wraps the tokenizer, engine (QNN backend), KV-cache management, decoding, and sampling into dialog and token-generation flows, while delegating device execution to QAIRT with HTP/NPU and CPU backends. The following diagram shows the high-level functions handled internally by Genie APIs. You can use the Genie C APIs to configure each component according to your LLM application needs.

The Genie API has the following components. For complete function definitions and header-level details, see Genie API documentation

GeniePipeline: Orchestrates the entire inference workflow by chaining tokenizer, engine, sampler, and other nodes into a runnable pipeline. It manages data flow and execution order for text generation or embedding tasks.
GenieNode: Represents an individual processing unit (tokenizer, engine) within the pipeline. Nodes encapsulate specific functionality and can be used to build custom inference graphs.
GenieDialog: A high-level abstraction for conversational tasks. It wires together tokenizer, model, backend, and sampler using JSON configuration and provides APIs for token generation and embeddings.
GenieEmbedding: Handles embedding queries for retrieval-augmented generation (RAG) or semantic search. Converts input text into dense vector representations using the loaded model.
GenieProfile: Stores configuration and runtime parameters such as backend selection, sampling strategies, and performance settings. Profiles allow quick switching between different inference setups.
GenieSampler: Implements decoding strategies (for example, greedy, top-k, or top-p sampling) to convert model logits into output tokens. Supports advanced speedup techniques like speculative decoding (SPD), self-speculative decoding (SSD), and look ahead decoding (LADE).
GenieEngine: Executes the model forward pass on the chosen backend (HTP, GenAI transformer, or CPU). It manages graph execution, memory allocation, and hardware acceleration.
GenieTokenizer: Converts text to token IDs and back to text during inference. Works with model-specific vocabularies and supports efficient encoding/decoding for multi-turn dialogs.

The following diagram shows a sample call flow for the Genie API in an end-to-end LLM chatbot application.

Call flow for the Genie API in a chatbot application

​Run LLMs with Genie

​Prerequisites

​Run multimodal models with Genie

​Genie pipeline stages

​Genie API

Run LLMs with Genie

Prerequisites

Run multimodal models with Genie

Genie pipeline stages

Genie API