> ## Documentation Index
> Fetch the complete documentation index at: https://dragonwingdocs.qualcomm.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Run Gemma-4 E2B on the IQ8 NPU with LiteRT-LM

> Build LiteRT-LM from source and run Google's Gemma-4 E2B on the Hexagon NPU of a Dragonwing IQ-8275, from an Ubuntu prototype to a Qualcomm Linux production image.

<div style={{ marginBottom: "2rem" }}>
  <div
    style={{
fontSize: "0.72rem",
fontWeight: 700,
color: "#31017D",
letterSpacing: "1.5px",
textTransform: "uppercase",
marginBottom: "0.5rem"
}}
  >
    AI / ML
  </div>

  <div style={{ fontSize: "0.85rem", color: "#888", display: "flex", gap: "0.5rem", flexWrap: "wrap", alignItems: "center" }}>
    <span>Dragonwing Team</span>
    <span>·</span>
    <span>Jun 29, 2026</span>
    <span>·</span>
    <a href="/blog" style={{ color: "#31017D", fontWeight: 600, textDecoration: "none" }}>← All posts</a>
  </div>
</div>

<hr style={{ border: "none", borderTop: "1px solid #eee", margin: "0 0 2rem" }} />

This guide runs Google's **Gemma-4 E2B** on the **Hexagon NPU** of a Dragonwing **IQ-8275 (QCS8275)**. You build Google's **LiteRT-LM** runtime from source against the *exact* Qualcomm AI runtime the model was compiled with, then run the public, Apache-2.0 `.litertlm` directly on the NPU at about **28 tokens per second** decode. Every command is copy-pasteable.

* **Target:** IQ-8275 EVK, Ubuntu 24.04 (aarch64), Hexagon **v75** NSP.
* **Model:** [`litert-community/gemma-4-E2B-it-litert-lm`](https://huggingface.co/litert-community/gemma-4-E2B-it-litert-lm), specifically `gemma-4-E2B-it_qualcomm_qcs8275.litertlm` (3.29 GB). This file is **NPU-only**.

Total time is about 15 minutes of device setup plus a one-time source build (around 45 minutes building on the IQ8 itself, faster on a beefier aarch64 box).

<Note>
  **How to read this guide.** Every command runs **on the IQ-8275** as the `ubuntu` user. All work lives in one directory, `~/iq8-gemma`, and **every code block starts with its own `cd`**, so you can paste any block into any fresh terminal, in order, without tracking which directory you are in. Nothing needs editing before you paste it.
</Note>

## What you will do

1. Set up the IQ8 device and confirm FastRPC is present.
2. Download the public NPU model and read the exact QAIRT version it needs.
3. Build `litert_lm_main` from LiteRT-LM against that QAIRT, with two Linux-enablement patches.
4. Assemble a run directory and run the model on the NPU.
5. (Optional) Move from the Ubuntu prototype to a Qualcomm Linux (Yocto) production image.

## Prerequisites

Before anything below works, the board needs its Qualcomm peripherals enabled and the AI runtime installed: the **FastRPC userland** (`libcdsprpc.so`), the **QNN** libraries (`libqnn-dev`, `qnn-tools`, `snpe-tools`, `tensorflow-lite-qcom-apps`), `qcom-libdmabufheap`, and the GStreamer QCOM plugins, plus a firmware update and reboot.

Set this up first by following the IQ8 device pages, then come back here:

* [First-time setup for the Dragonwing IQ8](/Ubuntu/devices/iq8275-evk/setup)
* [Install the required software packages](/Ubuntu/devices/iq8275-evk/Install_required_software_packages)

After the reboot, reconnect to the board and confirm FastRPC is present:

```bash theme={null}
ls /dev/fastrpc-cdsp                 # must exist
ldconfig -p | grep cdsprpc           # libcdsprpc.so[.1] present
```

If `/dev/fastrpc-*` does not exist, the kernel lacks FastRPC support. Stop here: that is a BSP or image problem, not something you can fix in userland.

## Get the model and find the QAIRT version it needs

Do not guess the version. Create the working directory and download the public NPU model into it (3.29 GB; `-C -` resumes if the connection drops):

```bash theme={null}
mkdir -p ~/iq8-gemma
cd ~/iq8-gemma
curl -fL -C - -o gemma-4-E2B-it_qualcomm_qcs8275.litertlm \
  "https://huggingface.co/litert-community/gemma-4-E2B-it-litert-lm/resolve/main/gemma-4-E2B-it_qualcomm_qcs8275.litertlm"
```

A Qualcomm `.litertlm` embeds **QNN context binaries** that are version-locked to the QAIRT release they were compiled with, and the on-device runtime must match. The version is not advertised up front, and LiteRT's source pins a *newer* one, so read it straight from the file:

```bash theme={null}
cd ~/iq8-gemma
strings gemma-4-E2B-it_qualcomm_qcs8275.litertlm | grep -Eo '2\.4[0-9]\.0\.[0-9]{6}' | sort -u
# -> 2.44.0.260225
```

This model needs **QAIRT 2.44.0.260225**. Download that exact SDK and unpack it:

```bash theme={null}
cd ~/iq8-gemma
curl -fL -o v2.44.0.260225.zip \
  "https://softwarecenter.qualcomm.com/api/download/software/sdks/Qualcomm_AI_Runtime_Community/All/2.44.0.260225/v2.44.0.260225.zip"
mkdir -p v2.44.0.260225 && unzip -q v2.44.0.260225.zip -d v2.44.0.260225
# QAIRT root is now: ~/iq8-gemma/v2.44.0.260225/qairt/2.44.0.260225
```

## Build `litert_lm_main` against that QAIRT

### Build toolchain

`Pre Requisites ` does not install a compiler or Bazel, so add them (LiteRT-LM builds with **clang-18**; the resulting binaries link GNU `libstdc++`):

```bash theme={null}
sudo apt-get install -y build-essential curl git git-lfs openjdk-17-jdk python3 python3-pip \
  python3-dev unzip wget zip llvm-18 clang-18 libc++-dev libc++abi-dev

# bazelisk as `bazel` (LiteRT-LM pins its Bazel version via .bazeliskrc)
curl -L -o /tmp/bazelisk \
  https://github.com/bazelbuild/bazelisk/releases/latest/download/bazelisk-linux-arm64
chmod +x /tmp/bazelisk && sudo mv /tmp/bazelisk /usr/local/bin/bazel
```

### Check out the LiteRT-LM commit that pins your QAIRT

`LITERT_QAIRT_SDK` lets Bazel use a local SDK, but the workspace's `strip_prefix` must match the *pinned* version's layout, so check out the commit on the **2.44 line**. The newest such commit is **`cbf463d97fa3`** (it pins LiteRT `d865fd82` to QAIRT 2.44.0.260225; the next commit jumps to 2.46).

```bash theme={null}
cd ~/iq8-gemma
git clone https://github.com/google-ai-edge/LiteRT-LM.git
cd ~/iq8-gemma/LiteRT-LM
git checkout cbf463d97fa3
git lfs install && git lfs pull          # fetches the real libGemmaModelConstraintProvider.so (22 MB ELF, not an LFS pointer)
```

### Apply two Linux-enablement patches

LiteRT-LM gates two pieces of Qualcomm setup behind `#if defined(__ANDROID__)`, so on desktop or embedded **Linux aarch64** they silently do not run. Both are one-liners that add `|| defined(__linux__)`.

**Patch 1: dispatch-library directory.** LiteRT-LM only derives the directory where it finds `libLiteRtDispatch_Qualcomm.so` under `__ANDROID__` or `__EMSCRIPTEN__`. Without this the NPU accelerator never registers and `DISPATCH_OP` stays unresolved:

```bash theme={null}
cd ~/iq8-gemma/LiteRT-LM
sed -i 's/#if defined(__ANDROID__) || defined(__EMSCRIPTEN__)$/#if defined(__ANDROID__) || defined(__EMSCRIPTEN__) || defined(__linux__)/' \
  runtime/util/litert_util.cc
```

**Patch 2: HTP burst mode.** This is the difference between 16 and about 28 tokens per second. `CreateLiteRtNpuOptions()` calls `SetHtpPerformanceMode(kBurst)` and `SetLogLevel(kOff)` **only** under `#if defined(__ANDROID__)` (there is an in-source `TODO … Bug: 498622107` admitting it). On Linux those calls are skipped, so the dispatch plugin gets `HtpPerformanceMode::kDefault`: the DSP never votes itself up to burst and decode runs at about **16 tokens per second** with QNN debug logs spamming stdout. Enable the block for Linux:

```bash theme={null}
cd ~/iq8-gemma/LiteRT-LM
python3 - runtime/executor/llm_litert_npu_compiled_model_executor.cc <<'PY'
p = "runtime/executor/llm_litert_npu_compiled_model_executor.cc"
s = open(p).read()
anchor = ("#if defined(__ANDROID__)\n"
          "  LITERT_ASSIGN_OR_RETURN(::litert::qualcomm::QualcommOptions & qnn_opts,\n"
          "                          options.GetQualcommOptions());")
assert anchor in s, "anchor not found (different commit?)"
s = s.replace(anchor, anchor.replace("#if defined(__ANDROID__)",
                                     "#if defined(__ANDROID__) || defined(__linux__)", 1), 1)
open(p, "w").write(s)
print("burst patch applied")
PY
```

(It is a targeted patch rather than a `sed` because the file has other bare `#if defined(__ANDROID__)` lines we must not touch.)

### Build

```bash theme={null}
cd ~/iq8-gemma/LiteRT-LM
export LITERT_QAIRT_SDK="$HOME/iq8-gemma/v2.44.0.260225/"     # TRAILING SLASH is required

bazel build -c opt --repo_env=CC=clang-18 --repo_env=CXX=clang++-18 \
  //runtime/engine:litert_lm_main \
  @litert//litert/vendors/qualcomm/dispatch:dispatch_api_so
```

Outputs (under `~/iq8-gemma/LiteRT-LM/bazel-bin/`):

* `runtime/engine/litert_lm_main`
* `libLiteRtDispatch_Qualcomm.so` (under the `.../qualcomm/dispatch/` tree)
* `libLiteRt.so` (the core LiteRT runtime lib)

## Assemble a run directory

Collect the binary, the dispatch plugin, the core LiteRT lib, the constraint provider, and the model into `~/iq8-gemma/run`. The `find` calls locate the build outputs wherever Bazel placed them, so this block works as is:

```bash theme={null}
cd ~/iq8-gemma/LiteRT-LM
mkdir -p ~/iq8-gemma/run
cp -fL bazel-bin/runtime/engine/litert_lm_main ~/iq8-gemma/run/
cp -fL "$(find -L bazel-bin -name libLiteRtDispatch_Qualcomm.so | head -n1)" ~/iq8-gemma/run/
cp -fL "$(find -L bazel-bin -name libLiteRt.so | head -n1)" ~/iq8-gemma/run/
cp -fL prebuilt/linux_arm64/libGemmaModelConstraintProvider.so ~/iq8-gemma/run/
ln -sf ~/iq8-gemma/gemma-4-E2B-it_qualcomm_qcs8275.litertlm ~/iq8-gemma/run/
```

The run directory now holds:

```text theme={null}
~/iq8-gemma/run/
├── litert_lm_main
├── libLiteRtDispatch_Qualcomm.so
├── libLiteRt.so
├── libGemmaModelConstraintProvider.so      # from prebuilt/linux_arm64/
└── gemma-4-E2B-it_qualcomm_qcs8275.litertlm # symlink to the 3.29 GB model
```

Verify the plugin's shared-library deps all resolve. A clang-18 build links **GNU `libstdc++`**, which is already present from `build-essential`, so this should print nothing:

```bash theme={null}
cd ~/iq8-gemma/run
ldd libLiteRtDispatch_Qualcomm.so | grep 'not found'   # should print nothing
```

(If you built with `-stdlib=libc++` instead, you would need `sudo apt-get install -y libc++1 libc++abi1`; the default build here uses `libstdc++`, so you do not.)

## Run it on the NPU

This block points `LD_LIBRARY_PATH` at the run dir plus the matching QAIRT host libs, and `ADSP_LIBRARY_PATH` at the **Hexagon v75** skel, then runs as root (FastRPC and cDSP need it). The `$HOME` and `$PWD` paths are expanded by your shell before `sudo`, so it works unedited:

```bash theme={null}
cd ~/iq8-gemma/run
QAIRT="$HOME/iq8-gemma/v2.44.0.260225/qairt/2.44.0.260225"
sudo -E env \
  LD_LIBRARY_PATH="$PWD:$QAIRT/lib/aarch64-oe-linux-gcc11.2:/usr/lib" \
  ADSP_LIBRARY_PATH="$QAIRT/lib/hexagon-v75/unsigned" \
  ./litert_lm_main --backend npu \
    --model_path "$PWD/gemma-4-E2B-it_qualcomm_qcs8275.litertlm" \
    --input_prompt "Explain what Qualcomm is in two sentences."
```

Expected:

```text theme={null}
Qualcomm is a global technology company that designs, develops, and sells wireless
communication solutions and processors. They are a leading provider of technology for
smartphones, tablets, IoT, and other mobile devices, as well as for various other industries.

BenchmarkInfo:
  Time to first token: 0.08 s
  Prefill Turns (Total 1 turns):
    Prefill Turn 1: Processed 17 tokens in 39.4ms duration.
      Prefill Speed: ~1240 tokens/sec.
  Decode Turns (Total 1 turns):
    Decode Turn 1: Processed 46 tokens in ~1.6s duration.
      Decode Speed: ~28 tokens/sec.
```

(`litert_lm_main` prints the benchmark by default. With Patch 2 applied you will see `Set HTP performance mode: 2` early in the run and the QNN debug logs go quiet: that is burst mode engaging.)

## Confirm it is really on the NPU

The model is **NPU-only**: it has no CPU graph. Asking for the CPU backend proves it:

```bash theme={null}
cd ~/iq8-gemma/run
LD_LIBRARY_PATH="$PWD:/usr/lib" ./litert_lm_main --backend cpu \
  --model_path "$PWD/gemma-4-E2B-it_qualcomm_qcs8275.litertlm" \
  --input_prompt "hi"
# INVALID_ARGUMENT: Main backend constraint mismatch.
#                   Model requires one of [npu] but Main backend is CPU
```

Since it refuses CPU and still generates correct text under `--backend npu`, execution is on the Hexagon NSP (HTP burst mode, `HtpPerformanceMode: 2`, visible in the logs).

## Performance

Measured on the IQ-8275, public model, NPU backend, both Linux patches applied, warm, CPU governor pinned to `performance`:

| Metric                           |                  Measured |
| -------------------------------- | ------------------------: |
| Decode (short output, \~200 tok) | **\~28 tok/s** (26 to 29) |
| Decode (long output, \~800 tok)  |            **\~25 tok/s** |
| Time to first token              |              **\~0.08 s** |
| Prefill (49-tok prompt)          |         **\~1,240 tok/s** |
| Model load (executor init)       |                     \~2 s |

Notes:

* **Burst mode is everything.** *Without* Patch 2 the from-source build runs decode at about **16 tok/s**; with it, 25 to 29. If you see about 16 and a wall of QNN logs, Patch 2 did not take.
* **Decode falls off as the output grows.** Each decode step attends over the whole KV cache, so a 200-token answer averages about 28 tok/s and an 800-token one about 25; the first tokens (short context) are the fastest. This is not thermal: the SoC sits around 43 °C throughout.
* **Pin the CPU governor** for stable numbers (the DSP still ramps its own clock on the first inference after boot):
  ```bash theme={null}
  echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
  ```
* **Prefill tok/s is not comparable across prompt lengths.** On a one-line prompt the \~1,240 figure is dominated by fixed overhead, so it is not meaningful on its own.

## Under the hood: what is actually happening

Worth understanding, because every step above maps to a layer in this stack.

### The `.litertlm` is a container, not a tflite file

The file starts with the magic bytes `LITERTLM`. Inside it bundles everything the runtime needs: the **SentencePiece tokenizer**, model metadata (chat template, EOS/EOA tokens, the backend constraint that makes it NPU-only), the LiteRT model graph, and, the important part for NPU, pre-compiled **QNN context binaries**. For this model those are two graphs, `qnn_partition_0` and `qnn_partition_1` (the transformer is split across two HTP contexts). The weights are **w4a16** (4-bit weights, 16-bit activations): that is how a roughly 2B-parameter model fits and runs fast on the NSP.

### The execution path, layer by layer

```text theme={null}
litert_lm_main
  └─ LiteRT-LM Engine (tokenizer, sampler, KV-cache, prefill/decode loop)
       └─ LiteRT CompiledModel  ── graph contains a custom op: DISPATCH_OP
            └─ Dispatch delegate  → libLiteRtDispatch_Qualcomm.so   (the "NPU accelerator")
                 └─ QNN HTP backend → libQnnHtp.so / libQnnSystem.so   (host side)
                      └─ FastRPC → libcdsprpc.so → /dev/fastrpc-cdsp   (the RPC transport)
                           └─ Hexagon v75 NSP runs libQnnHtpV75Skel.so (the DSP side)
```

The LiteRT graph is not a normal tflite network: it is mostly a single **`DISPATCH_OP`**, a custom op that is a placeholder for "run this pre-compiled vendor graph." When the NPU accelerator registers, LiteRT loads `libLiteRtDispatch_Qualcomm.so`, which hands the QNN context binary to the **QNN HTP backend**. QNN talks to the Hexagon NSP over **FastRPC** (a remote-procedure-call transport to the DSP, via `libcdsprpc.so` and `/dev/fastrpc-cdsp`); the actual matmuls execute inside `libQnnHtpV75Skel.so`, the QNN "skeleton" loaded **on** the v75 NSP and found via `ADSP_LIBRARY_PATH`. So three things must agree: the **host** QNN libs, the **DSP** skel, and the context binary inside the model, all the same QAIRT release. That is why the version step matters.

### Why the version must match exactly

A QNN context binary is **ahead-of-time compiled and serialized** for one QAIRT version: its graph format, op-package set, and the skel ABI it expects are all baked in. Load it on a different runtime and, best case, deserialization is refused. The version is not documented up front and LiteRT's `main` pins a newer 2.47, so neither is authoritative. The serialized build-id inside the file (`v2.44.0.260225…`) is, which is why we `strings` it rather than trust a pin.

### Prefill versus decode, and the KV cache

Generation is two phases. **Prefill** runs the whole prompt through the transformer once to build the **KV cache** (per-layer key/value tensors): it is compute-bound and embarrassingly parallel, so it is fast per token. **Decode** then generates one token at a time, each step attending over the growing KV cache: it is memory-bandwidth-bound (you stream the 4-bit weights through the NSP every token), which is why decode (about 28 tok/s on Ubuntu, about 32 on QLI) is far slower per token than prefill and is the number that actually bounds interactive latency.

### Burst mode: why the build is 16 tok/s until you patch it

The Hexagon NSP runs under **DCVS** (dynamic clock and voltage scaling): left alone it idles at a low clock and only ramps under sustained load. QNN exposes a **performance mode** to override that. `HtpPerformanceMode::kBurst` makes the runtime *vote* the DSP up to its top clock and hold it there (plus RPC-polling to cut FastRPC latency). LiteRT-LM's NPU executor does request burst, but only inside `#if defined(__ANDROID__)` (Patch 2). On Linux, with that block compiled out, the dispatch plugin reports `Failed to parse qnn options … Null Qualcomm options`, falls back to `HtpPerformanceMode::kDefault`, and the DSP runs at its lazy default clock: decode about 16 tok/s. Apply Patch 2 and the log shows `Set HTP performance mode: 2`; decode jumps to 25 to 29. This single gate is the largest performance lever in the whole stack, far bigger than anything else here.

### The `err 1002` weight-buffer message

During graph init you will see `fastrpc memory map for fd: … length: 1172307968 failed … err 1002`. That is QNN trying to map the roughly **1.17 GB persistent weight buffer** into the cDSP's IOMMU in one shot via FastRPC. The stock Ubuntu BSP does not reserve a large FastRPC DMA region (`dmesg`: `no reserved DMA memory for FASTRPC`, CMA only about 164 MB), so that *single* map request is rejected. It is **non-fatal**: QNN falls back to another path to get the weights to the NSP and the graphs still execute on the v75 (the model generates correct text either way). Whether it leaves decode throughput on the table is hard to isolate from the KV-cache-length falloff above; on this BSP, with burst mode on, decode lands at 25 to 29 tok/s with the message present.

### Why decode slows as the answer grows

Decode is **memory-bandwidth-bound** and gets *slower per token as the sequence lengthens*: every step attends over the entire KV cache, which grows with each token emitted. So a 200-token answer averages about 28 tok/s while an 800-token one averages about 25. The first tokens (short context) are the fastest. There is also a small first-inference-after-boot ramp while DCVS spins up; pinning the CPU governor to `performance` and warming up collapses that part.

## Troubleshooting

| Symptom                                                                                                     | Cause and fix                                                                                                                                                                                                    |
| ----------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `NPU accelerator could not be loaded and registered: InvalidArgument`, then `DISPATCH_OP failed to prepare` | The Linux dispatch-dir gate. Apply the `\|\| defined(__linux__)` patch (Patch 1) and rebuild.                                                                                                                    |
| Decode stuck at **\~16 tok/s**, plus walls of QNN `[INFO]` logs and `Null Qualcomm options`                 | Burst mode is not set. Apply **Patch 2** and rebuild; you should then see `Set HTP performance mode: 2` and the logs go quiet.                                                                                   |
| `Failed to create device`, or no FastRPC                                                                    | Missing FastRPC userland. Re-run the software-package install so `libcdsprpc.so` is present.                                                                                                                     |
| `fastrpc memory map … err 1002`, or `Failed to map weights buffer`                                          | **Non-fatal: the model still runs and generates correct text.** It is QNN failing to map the 1.17 GB weight buffer in one shot (no reserved FastRPC DMA region on the stock BSP); it falls back to another path. |
| `TF_LITE_PREFILL_DECODE not found`, or mmap errors                                                          | Truncated `.litertlm`. Re-download; check the size is 3.29 GB and the sha256 matches Hugging Face.                                                                                                               |
| `Main backend constraint mismatch … requires [npu]`                                                         | Expected: the model is NPU-only. Use `--backend npu`.                                                                                                                                                            |

## Part 1 in one line

**Match the runtime to the model** (QAIRT 2.44.0.260225, read from the file), **build LiteRT-LM at the commit that pins it** (with the two Linux-enablement patches: dispatch dir plus **HTP burst mode**), **stage the matching QAIRT host libs and the Hexagon v75 skel**, and run `--backend npu`. Burst mode is what turns a 16 tok/s build into a roughly 28 tok/s one; the scary `err 1002` is non-fatal.

## Part 2: Production on Qualcomm Linux (Yocto)

Ubuntu (Part 1) is the fast way to prototype. **Qualcomm Linux (QLI) 2.0** is the Yocto-based embedded OS you would actually **ship** on these boards: a from-source image you build and control, with the Qualcomm AI stack baked in. Same model, same QAIRT 2.44, same `--backend npu`, but QLI exacts two small, specific costs that Ubuntu did not: **one rebuild patch** (a SoC-config fix for a QNN `14001`) and **one runtime line** (`ulimit -l unlimited`). So the arc is: build the image, flash it, rebuild the binary with the SoC patch, then stage and run.

What is different from Ubuntu:

* QLI ships the **FastRPC userland, cDSP firmware, and QNN runtime natively** (no `apt`; it is in the image). You do not run `Pre Requisites`.
* The rootfs is a Yocto image, not Debian, so you **stage** the LiteRT-LM binary, QAIRT 2.44, and model onto it (scp or a data partition) rather than `apt install`.
* You build the OS image yourself on a Linux PC, then flash it to the board.

### What changes from the Ubuntu build (the QLI delta)

You do not start over for QLI; you **carry the Part 1 work forward**. The model, QAIRT 2.44, the dispatch to QNN to FastRPC to Hexagon path, `--backend npu`, and the two Part 1 patches (dispatch-dir plus burst) are all **unchanged**. There are exactly **two functional deltas** to go from the working Ubuntu binary to a working QLI run, one at build time and one at runtime, plus the packaging change (Yocto image instead of `apt`):

| In Part 1 (Ubuntu)                             | For QLI you additionally need                                                                                                       | Why it is needed                                                                                                                             |
| ---------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------- |
| 2 build patches: dispatch-dir plus burst       | **+1 build patch:** an `htp_backend.cc` SoC-guard, then rebuild                                                                     | QLI's QNN runtime **rejects** the forced SoC config that Ubuntu silently tolerated, giving `QnnDevice_create 14001`                          |
| root mlock is unlimited by default             | **`ulimit -l unlimited`** before launching                                                                                          | QLI caps locked memory at 8 MB; FastRPC must *pin* the roughly 1.17 GB weight buffer, giving `Could not allocate persistent weights buffer!` |
| `Pre Requisites ` plus `apt install` the stack | **bake and stage** instead: build the Yocto image, stage QAIRT 2.44 plus the model (no `apt`; `libstdc++` is already in the rootfs) | QLI is a from-source Yocto rootfs, not Debian                                                                                                |
| (none)                                         | *(optional)* `download_mode=0` while bringing up                                                                                    | so a cDSP fault reboots instead of dropping to EDL ramdump (`900e`) needing a power-cycle                                                    |

Note the **burst patch is not a QLI thing**: it is required for *any* from-source Linux build, Ubuntu included (it is the 16 to 28 tok/s fix from Part 1). The genuinely QLI-specific deltas are just the **SoC-guard rebuild** and the **`ulimit -l`** line. The same binary that did about 28 tok/s on Ubuntu, rebuilt with that one extra patch, does about 32 on QLI.

### Build host: requirements and what to expect

**You do not build the image on the IQ8.** You build it on a Linux PC (the "build host") and flash the result to the board. Everything runs inside a container via `kas-container`, so the only host dependency is Docker.

**Build host prerequisites:**

|         | Requirement                                | Notes                                                                                                        |
| ------- | ------------------------------------------ | ------------------------------------------------------------------------------------------------------------ |
| OS      | Any x86\_64 Linux                          | The build runs in a kas/Docker container, so the host distro barely matters.                                 |
| Docker  | Installed, your user in the `docker` group | `kas-container` uses it. (Podman also works.)                                                                |
| Disk    | **\~250 GB free**                          | downloads about 30 GB, sstate cache about 30 GB, `build/tmp` about 100 to 150 GB. An SSD/NVMe matters a lot. |
| RAM     | **32 GB min, 64 GB comfortable**           | Parallel compiles and linking (LLVM, mesa, the kernel) are memory-hungry.                                    |
| CPU     | **As many cores as you can get**           | Yocto compiles about 14,800 tasks; it scales almost linearly with cores.                                     |
| Tools   | `git`, `wget`/`curl`, `docker`             | plus the `kas-container` script (one download, below).                                                       |
| Network | Fast, unmetered                            | the first build downloads tens of GB of sources.                                                             |

**Machine recommendation:** this is the one job where a **many-core workstation pays for itself**. A **Threadripper or EPYC (32 to 64 cores)** chews through a cold build in about **1 to 2 hours**; the same build on a typical 8-core laptop is an **all-afternoon (about 8 to 10 h)** affair. More cores means proportionally less wall-clock.

**Estimated build times:**

| Scenario                                     | 8-core laptop              | 16-core server | 32 to 64-core Threadripper/EPYC |
| -------------------------------------------- | -------------------------- | -------------- | ------------------------------- |
| **Cold** (empty cache, first ever build)     | \~8 to 10 h plus downloads | \~3 to 4 h     | **\~1 to 2 h**                  |
| **Warm** (sstate cache present, incremental) | \~20 to 40 min             | **\~7 min** ✅  | \~5 min                         |

<Tip>
  The **about 7 min warm** figure is measured here on a 16-core AMD EPYC 7763 with a populated `sstate-cache` and `downloads`. The cold figures are estimates: the variable is cores and download speed, not much else. Keep your `sstate-cache` and `downloads` directories between builds (point `SSTATE_DIR` and `DL_DIR` at them); that is the difference between 7 minutes and 4 hours.
</Tip>

### Set up the build tree

Install Docker (once), grab `kas-container`, and pull the QLI 2.0 layers at their locked revisions:

```bash theme={null}
# Docker (Ubuntu host example), once
sudo apt-get update && sudo apt-get install -y docker.io git
sudo usermod -aG docker "$USER"   # log out/in for this to take effect

# kas-container (the only build tool you need on the host)
wget -qO kas-container https://raw.githubusercontent.com/siemens/kas/refs/tags/5.1/kas-container
chmod +x kas-container

# QLI 2.0 release manifest plus all meta layers, pinned to one lockfile
git clone -b qli-2.0 https://github.com/qualcomm-linux/meta-qcom-releases
./kas-container checkout meta-qcom-releases/lock.yml     # clones meta-qcom + all deps at locked commits
cp meta-qcom-releases/lock.yml meta-qcom/ci/lock.yml
```

### Build the IQ-8275 image

The build target is a colon-joined list of kas config fragments: machine, image, kernel, lockfile:

```bash theme={null}
export KAS_CONTAINER_ENGINE=docker
./kas-container build \
  meta-qcom/ci/iq-8275-evk.yml:\
meta-qcom/ci/qcom-distro-multimedia-image.yml:\
meta-qcom/ci/linux-qcom-6.18.yml:\
meta-qcom/ci/lock.yml
```

This produces the flashable bundle (about 927 MB):

```text theme={null}
build/tmp/deploy/images/iq-8275-evk/qcom-multimedia-image-iq-8275-evk.rootfs.qcomflash.tar.gz
```

Inside it: the firehose programmer (`prog_firehose_ddr.elf`), partition tables, the SAIL bootloader chain, and `rawprogram*.xml`/`patch*.xml`, everything `qdl` needs. Confirm the AI stack is in the image:

```bash theme={null}
grep -E 'fastrpc|qairt|hexagon-dsp-binaries|tensorflow' \
  build/tmp/deploy/images/iq-8275-evk/qcom-multimedia-image-*.manifest
# fastrpc 1.0.4 / kernel-module-fastrpc / hexagon-dsp-binaries-…-iq8275-evk-cdsp / qairt-sdk-hexagon-v75 2.43 …
```

Note the image ships **QAIRT 2.43**; our model needs **2.44**, so (as on Ubuntu) we stage 2.44 ourselves below.

### Flash the board (EDL plus qdl)

Put the IQ8 into **EDL (emergency download) mode** and flash with **`qdl`** (Linux/macOS) or `qdl.exe` (Windows). The QLI build guide has the full matrix; the short path:

```bash theme={null}
# 1) extract the bundle
tar -xzf qcom-multimedia-image-iq-8275-evk.rootfs.qcomflash.tar.gz
cd qcom-multimedia-image-iq-8275-evk

# 2) put the board in EDL: from a running shell `sudo reboot edl`, or the boot button method
#    (host then enumerates a "Qualcomm HS-USB QDLoader 9008" device)

# 3) flash
qdl prog_firehose_ddr.elf rawprogram*.xml patch*.xml
```

Power-cycle out of EDL; the board boots QLI 2.0. Log in as **`root`** (password `oelinux123` on this image), then confirm:

```bash theme={null}
tr '\0' '\n' < /proc/device-tree/compatible   # qcom,monaco-evk / qcom,qcs8300
cat /sys/devices/soc0/machine                 # QCS8275
ls /dev/fastrpc-cdsp                           # FastRPC present (shipped in the image)
```

Note the **device tree calls the SoC `qcs8300`** even though `machine` reads `QCS8275` (they are the same v75 part; `soc_id` 675). That naming is what trips the QNN bug we patch next.

### One more patch for QLI: the SoC-config fix (`QnnDevice_create` 14001)

The Part 1 binary runs on Ubuntu, but on QLI it dies at init with `Failed to set up QNN manager` or `QnnDevice_create … 14001`. Root cause: LiteRT's QNN backend, on **aarch64**, *forces* a `QnnHtpDevice_CustomConfig` SOC option (`htp_backend.cc`) built from the online-detected SoC. The Ubuntu QNN runtime tolerates that forced override; **QLI's rejects it**. (It is not even a wrong value: the SoC table maps both `QCS8275` and `QCS8300` to the same enum, v75, and 8 MB VTCM. QLI simply will not accept an explicit SOC override on this path.) The fix is to let aarch64 auto-detect by compiling the forced block out. It lives in the `@litert` external, so patch *after* `bazel fetch` and *before* `bazel build`:

```bash theme={null}
cd ~/iq8-gemma/LiteRT-LM
export LITERT_QAIRT_SDK="$HOME/iq8-gemma/v2.44.0.260225/"

# 1) materialize the @litert external
bazel fetch -c opt --repo_env=CC=clang-18 --repo_env=CXX=clang++-18 \
  //runtime/engine:litert_lm_main @litert//litert/vendors/qualcomm/dispatch:dispatch_api_so

# 2) guard the forced SOC custom-config with #if x86 (so aarch64 auto-detects)
HB=$(find "$(bazel info output_base)/external" -path '*qualcomm/core/backends/htp_backend.cc' | head -1)
python3 - "$HB" <<'PY'
import sys; p=sys.argv[1]; s=open(p).read()
a1="  std::vector<QnnDevice_CustomConfig_t> device_custom_configs;\n"
a2=("  device_custom_configs.emplace_back(\n"
    "      static_cast<QnnDevice_CustomConfig_t>(htp_device_custom_config));\n")
assert a1 in s and a2 in s, "anchors not found (different commit?)"
s=s.replace(a1, a1+"#if defined(__x86_64__) || defined(_M_X64)\n",1)
s=s.replace(a2, a2+"#endif\n",1)
open(p,"w").write(s); print("htp soc-config patch applied")
PY
chmod u+w "$HB"

# 3) rebuild WITHOUT re-fetching (keeps the patched external)
bazel build --nofetch -c opt --repo_env=CC=clang-18 --repo_env=CXX=clang++-18 \
  //runtime/engine:litert_lm_main \
  @litert//litert/vendors/qualcomm/dispatch:dispatch_api_so
```

This binary is a **superset**: dropping the forced SOC config is a no-op on Ubuntu, so the one binary runs on both operating systems. (If you only target QLI, build with all three patches from the start.)

### Stage the runtime on QLI

QLI gives you FastRPC plus cDSP firmware for free, but it is a Yocto rootfs with no `apt`. You stage three things the image does not ship: the **`litert_lm_main` and its `.so` files** (the SoC-patched rebuild), the **QAIRT 2.44** SDK, and the **model**. There is no C++ runtime to add: the binaries link GNU `libstdc++.so.6`, which the rootfs already has. The board has networking, so pull the big files directly:

```bash theme={null}
# on the QLI board
mkdir -p /opt/iq8-gemma/run && cd /opt/iq8-gemma

# QAIRT 2.44 (matches the model)
curl -fL -o q.zip "https://softwarecenter.qualcomm.com/api/download/software/sdks/Qualcomm_AI_Runtime_Community/All/2.44.0.260225/v2.44.0.260225.zip"
mkdir -p v2.44.0.260225 && unzip -q q.zip -d v2.44.0.260225

# the model
curl -fL -o run/gemma-4-E2B-it_qualcomm_qcs8275.litertlm \
  "https://huggingface.co/litert-community/gemma-4-E2B-it-litert-lm/resolve/main/gemma-4-E2B-it_qualcomm_qcs8275.litertlm"
```

Then `scp` the three rebuilt artifacts into `run/`: `litert_lm_main`, `libLiteRtDispatch_Qualcomm.so`, and `libGemmaModelConstraintProvider.so`.

### Run on the NPU (plus the one runtime gotcha: `ulimit -l`)

QLI's default **`max locked memory` is 8 MB** (`ulimit -l` returns 8192); Ubuntu's root is unlimited. FastRPC **pins** the model's roughly 1.17 GB persistent weight buffer, which blows past 8 MB, so QNN reports `Could not allocate persistent weights buffer!` and the load aborts (the first cold attempt actually faulted the cDSP). **Raise it before running** and the `err 1002` weights-map degrades to the exact same harmless fallback you saw on Ubuntu:

```bash theme={null}
cd /opt/iq8-gemma/run
ulimit -l unlimited                       # <-- the QLI fix; without it the model won't load
QAIRT=/opt/iq8-gemma/v2.44.0.260225/qairt/2.44.0.260225
LD_LIBRARY_PATH="$PWD:$QAIRT/lib/aarch64-oe-linux-gcc11.2:/usr/lib" \
ADSP_LIBRARY_PATH="$QAIRT/lib/hexagon-v75/unsigned" \
  ./litert_lm_main --backend npu \
    --model_path "$PWD/gemma-4-E2B-it_qualcomm_qcs8275.litertlm" \
    --input_prompt "Explain what Qualcomm is in two sentences."
```

<Warning>
  **Crash safety while bringing this up.** QLI ships `qcom_scm.download_mode=1`, so a cDSP or kernel fault dumps the SoC into EDL ramdump mode (USB `900e`) and needs a *physical* power-cycle to recover. Run `echo 0 > /sys/module/qcom_scm/parameters/download_mode` so a fault just reboots instead. (With `ulimit -l` raised there is no fault; this is only a safety net.)
</Warning>

You will see `HtpPerformanceMode: 2`, **no 14001**, correct generated text, and on a long run:

```text theme={null}
BenchmarkInfo:
  Time to first token: 0.06 s
  Prefill Turn 1: Processed 49 tokens in 32.3ms duration.
    Prefill Speed: ~1520 tokens/sec.
  Decode  Turn 1: Processed 799 tokens in 24.6s duration.
    Decode  Speed: 32.49 tokens/sec.
```

### Prototype versus production: the payoff

Same model, same NPU, same QAIRT, measured on the **same physical board**, first as the Ubuntu prototype, then reflashed to QLI 2.0 (governor `performance`, warm):

| Metric                   |     Ubuntu (measured) |        QLI 2.0 (measured) |
| ------------------------ | --------------------: | ------------------------: |
| Decode, \~200-tok output | \~28 tok/s (26 to 29) | **\~32 tok/s (31 to 33)** |
| Decode, \~800-tok output |            \~25 tok/s |          **\~32.5 tok/s** |
| Time to first token      |                0.08 s |                **0.06 s** |
| Prefill (49-tok prompt)  |         \~1,240 tok/s |         **\~1,520 tok/s** |

The punchline of the prototype-to-production arc: **QLI 2.0 is not a downgrade, it is faster and steadier.** Unlike Ubuntu, it **holds about 32 tok/s even across an 800-token generation** instead of sagging to about 25. The likely reason is exactly what makes it a production target: the lean, single-purpose image leaves the CPU and scheduler far less contended, so the NSP's DCVS holds its clock. You give up `apt` convenience and pay two QLI-specific costs (one rebuild patch for the SoC config and one runtime line, `ulimit -l`), and in return you get a reproducible, version-pinned, from-source image that runs the model *better* than the box you prototyped on.

## Next steps

* Swap in your own prompt, or wire `litert_lm_main` behind a small local API for an on-device assistant.
* Try other LiteRT-LM models built for the v75 NSP, reading each one's required QAIRT version from the file before staging.
* For the production path, fold the SoC-config patch in from the start and bake your run directory into the Yocto image.
* Compare throughput on other Dragonwing parts by repeating the version-match step for that device's NSP.
