Run Gemma-4 E2B on the IQ8 NPU with LiteRT-LM - Qualcomm Dragonwing Documentation

AI / ML

Dragonwing Team·Jun 29, 2026·← All posts

This guide runs Google’s Gemma-4 E2B on the Hexagon NPU of a Dragonwing IQ-8275 (QCS8275). You build Google’s LiteRT-LM runtime from source against the exact Qualcomm AI runtime the model was compiled with, then run the public, Apache-2.0 .litertlm directly on the NPU at about 28 tokens per second decode. Every command is copy-pasteable.

Target: IQ-8275 EVK, Ubuntu 24.04 (aarch64), Hexagon v75 NSP.
Model: litert-community/gemma-4-E2B-it-litert-lm, specifically gemma-4-E2B-it_qualcomm_qcs8275.litertlm (3.29 GB). This file is NPU-only.

Total time is about 15 minutes of device setup plus a one-time source build (around 45 minutes building on the IQ8 itself, faster on a beefier aarch64 box).

How to read this guide. Every command runs on the IQ-8275 as the ubuntu user. All work lives in one directory, ~/iq8-gemma, and every code block starts with its own cd, so you can paste any block into any fresh terminal, in order, without tracking which directory you are in. Nothing needs editing before you paste it.

What you will do

Set up the IQ8 device and confirm FastRPC is present.
Download the public NPU model and read the exact QAIRT version it needs.
Build litert_lm_main from LiteRT-LM against that QAIRT, with two Linux-enablement patches.
Assemble a run directory and run the model on the NPU.
(Optional) Move from the Ubuntu prototype to a Qualcomm Linux (Yocto) production image.

Prerequisites

Before anything below works, the board needs its Qualcomm peripherals enabled and the AI runtime installed: the FastRPC userland (libcdsprpc.so), the QNN libraries (libqnn-dev, qnn-tools, snpe-tools, tensorflow-lite-qcom-apps), qcom-libdmabufheap, and the GStreamer QCOM plugins, plus a firmware update and reboot. Set this up first by following the IQ8 device pages, then come back here:

After the reboot, reconnect to the board and confirm FastRPC is present:

ls /dev/fastrpc-cdsp                 # must exist
ldconfig -p | grep cdsprpc           # libcdsprpc.so[.1] present

If /dev/fastrpc-* does not exist, the kernel lacks FastRPC support. Stop here: that is a BSP or image problem, not something you can fix in userland.

Get the model and find the QAIRT version it needs

Do not guess the version. Create the working directory and download the public NPU model into it (3.29 GB; -C - resumes if the connection drops):

mkdir -p ~/iq8-gemma
cd ~/iq8-gemma
curl -fL -C - -o gemma-4-E2B-it_qualcomm_qcs8275.litertlm \
  "https://huggingface.co/litert-community/gemma-4-E2B-it-litert-lm/resolve/main/gemma-4-E2B-it_qualcomm_qcs8275.litertlm"

A Qualcomm .litertlm embeds QNN context binaries that are version-locked to the QAIRT release they were compiled with, and the on-device runtime must match. The version is not advertised up front, and LiteRT’s source pins a newer one, so read it straight from the file:

cd ~/iq8-gemma
strings gemma-4-E2B-it_qualcomm_qcs8275.litertlm | grep -Eo '2\.4[0-9]\.0\.[0-9]{6}' | sort -u
# -> 2.44.0.260225

This model needs QAIRT 2.44.0.260225. Download that exact SDK and unpack it:

cd ~/iq8-gemma
curl -fL -o v2.44.0.260225.zip \
  "https://softwarecenter.qualcomm.com/api/download/software/sdks/Qualcomm_AI_Runtime_Community/All/2.44.0.260225/v2.44.0.260225.zip"
mkdir -p v2.44.0.260225 && unzip -q v2.44.0.260225.zip -d v2.44.0.260225
# QAIRT root is now: ~/iq8-gemma/v2.44.0.260225/qairt/2.44.0.260225

Build `litert_lm_main` against that QAIRT

Build toolchain

Pre Requisites does not install a compiler or Bazel, so add them (LiteRT-LM builds with clang-18; the resulting binaries link GNU libstdc++):

sudo apt-get install -y build-essential curl git git-lfs openjdk-17-jdk python3 python3-pip \
  python3-dev unzip wget zip llvm-18 clang-18 libc++-dev libc++abi-dev

# bazelisk as `bazel` (LiteRT-LM pins its Bazel version via .bazeliskrc)
curl -L -o /tmp/bazelisk \
  https://github.com/bazelbuild/bazelisk/releases/latest/download/bazelisk-linux-arm64
chmod +x /tmp/bazelisk && sudo mv /tmp/bazelisk /usr/local/bin/bazel

Check out the LiteRT-LM commit that pins your QAIRT

LITERT_QAIRT_SDK lets Bazel use a local SDK, but the workspace’s strip_prefix must match the pinned version’s layout, so check out the commit on the 2.44 line. The newest such commit is cbf463d97fa3 (it pins LiteRT d865fd82 to QAIRT 2.44.0.260225; the next commit jumps to 2.46).

cd ~/iq8-gemma
git clone https://github.com/google-ai-edge/LiteRT-LM.git
cd ~/iq8-gemma/LiteRT-LM
git checkout cbf463d97fa3
git lfs install && git lfs pull          # fetches the real libGemmaModelConstraintProvider.so (22 MB ELF, not an LFS pointer)

Apply two Linux-enablement patches

LiteRT-LM gates two pieces of Qualcomm setup behind #if defined(__ANDROID__), so on desktop or embedded Linux aarch64 they silently do not run. Both are one-liners that add || defined(__linux__). Patch 1: dispatch-library directory. LiteRT-LM only derives the directory where it finds libLiteRtDispatch_Qualcomm.so under __ANDROID__ or __EMSCRIPTEN__. Without this the NPU accelerator never registers and DISPATCH_OP stays unresolved:

cd ~/iq8-gemma/LiteRT-LM
sed -i 's/#if defined(__ANDROID__) || defined(__EMSCRIPTEN__)$/#if defined(__ANDROID__) || defined(__EMSCRIPTEN__) || defined(__linux__)/' \
  runtime/util/litert_util.cc

Patch 2: HTP burst mode. This is the difference between 16 and about 28 tokens per second. CreateLiteRtNpuOptions() calls SetHtpPerformanceMode(kBurst) and SetLogLevel(kOff) only under #if defined(__ANDROID__) (there is an in-source TODO … Bug: 498622107 admitting it). On Linux those calls are skipped, so the dispatch plugin gets HtpPerformanceMode::kDefault: the DSP never votes itself up to burst and decode runs at about 16 tokens per second with QNN debug logs spamming stdout. Enable the block for Linux:

cd ~/iq8-gemma/LiteRT-LM
python3 - runtime/executor/llm_litert_npu_compiled_model_executor.cc <<'PY'
p = "runtime/executor/llm_litert_npu_compiled_model_executor.cc"
s = open(p).read()
anchor = ("#if defined(__ANDROID__)\n"
          "  LITERT_ASSIGN_OR_RETURN(::litert::qualcomm::QualcommOptions & qnn_opts,\n"
          "                          options.GetQualcommOptions());")
assert anchor in s, "anchor not found (different commit?)"
s = s.replace(anchor, anchor.replace("#if defined(__ANDROID__)",
                                     "#if defined(__ANDROID__) || defined(__linux__)", 1), 1)
open(p, "w").write(s)
print("burst patch applied")
PY

(It is a targeted patch rather than a sed because the file has other bare #if defined(__ANDROID__) lines we must not touch.)

Build

cd ~/iq8-gemma/LiteRT-LM
export LITERT_QAIRT_SDK="$HOME/iq8-gemma/v2.44.0.260225/"     # TRAILING SLASH is required

bazel build -c opt --repo_env=CC=clang-18 --repo_env=CXX=clang++-18 \
  //runtime/engine:litert_lm_main \
  @litert//litert/vendors/qualcomm/dispatch:dispatch_api_so

Outputs (under ~/iq8-gemma/LiteRT-LM/bazel-bin/):

runtime/engine/litert_lm_main
libLiteRtDispatch_Qualcomm.so (under the .../qualcomm/dispatch/ tree)
libLiteRt.so (the core LiteRT runtime lib)

Assemble a run directory

Collect the binary, the dispatch plugin, the core LiteRT lib, the constraint provider, and the model into ~/iq8-gemma/run. The find calls locate the build outputs wherever Bazel placed them, so this block works as is:

cd ~/iq8-gemma/LiteRT-LM
mkdir -p ~/iq8-gemma/run
cp -fL bazel-bin/runtime/engine/litert_lm_main ~/iq8-gemma/run/
cp -fL "$(find -L bazel-bin -name libLiteRtDispatch_Qualcomm.so | head -n1)" ~/iq8-gemma/run/
cp -fL "$(find -L bazel-bin -name libLiteRt.so | head -n1)" ~/iq8-gemma/run/
cp -fL prebuilt/linux_arm64/libGemmaModelConstraintProvider.so ~/iq8-gemma/run/
ln -sf ~/iq8-gemma/gemma-4-E2B-it_qualcomm_qcs8275.litertlm ~/iq8-gemma/run/

The run directory now holds:

~/iq8-gemma/run/
├── litert_lm_main
├── libLiteRtDispatch_Qualcomm.so
├── libLiteRt.so
├── libGemmaModelConstraintProvider.so      # from prebuilt/linux_arm64/
└── gemma-4-E2B-it_qualcomm_qcs8275.litertlm # symlink to the 3.29 GB model

Verify the plugin’s shared-library deps all resolve. A clang-18 build links GNU libstdc++, which is already present from build-essential, so this should print nothing:

cd ~/iq8-gemma/run
ldd libLiteRtDispatch_Qualcomm.so | grep 'not found'   # should print nothing

(If you built with -stdlib=libc++ instead, you would need sudo apt-get install -y libc++1 libc++abi1; the default build here uses libstdc++, so you do not.)

Run it on the NPU

This block points LD_LIBRARY_PATH at the run dir plus the matching QAIRT host libs, and ADSP_LIBRARY_PATH at the Hexagon v75 skel, then runs as root (FastRPC and cDSP need it). The $HOME and $PWD paths are expanded by your shell before sudo, so it works unedited:

cd ~/iq8-gemma/run
QAIRT="$HOME/iq8-gemma/v2.44.0.260225/qairt/2.44.0.260225"
sudo -E env \
  LD_LIBRARY_PATH="$PWD:$QAIRT/lib/aarch64-oe-linux-gcc11.2:/usr/lib" \
  ADSP_LIBRARY_PATH="$QAIRT/lib/hexagon-v75/unsigned" \
  ./litert_lm_main --backend npu \
    --model_path "$PWD/gemma-4-E2B-it_qualcomm_qcs8275.litertlm" \
    --input_prompt "Explain what Qualcomm is in two sentences."

Expected:

Qualcomm is a global technology company that designs, develops, and sells wireless
communication solutions and processors. They are a leading provider of technology for
smartphones, tablets, IoT, and other mobile devices, as well as for various other industries.

BenchmarkInfo:
  Time to first token: 0.08 s
  Prefill Turns (Total 1 turns):
    Prefill Turn 1: Processed 17 tokens in 39.4ms duration.
      Prefill Speed: ~1240 tokens/sec.
  Decode Turns (Total 1 turns):
    Decode Turn 1: Processed 46 tokens in ~1.6s duration.
      Decode Speed: ~28 tokens/sec.

(litert_lm_main prints the benchmark by default. With Patch 2 applied you will see Set HTP performance mode: 2 early in the run and the QNN debug logs go quiet: that is burst mode engaging.)

Confirm it is really on the NPU

The model is NPU-only: it has no CPU graph. Asking for the CPU backend proves it:

cd ~/iq8-gemma/run
LD_LIBRARY_PATH="$PWD:/usr/lib" ./litert_lm_main --backend cpu \
  --model_path "$PWD/gemma-4-E2B-it_qualcomm_qcs8275.litertlm" \
  --input_prompt "hi"
# INVALID_ARGUMENT: Main backend constraint mismatch.
#                   Model requires one of [npu] but Main backend is CPU

Since it refuses CPU and still generates correct text under --backend npu, execution is on the Hexagon NSP (HTP burst mode, HtpPerformanceMode: 2, visible in the logs).

Performance

Measured on the IQ-8275, public model, NPU backend, both Linux patches applied, warm, CPU governor pinned to performance:

Metric	Measured
Decode (short output, ~200 tok)	~28 tok/s (26 to 29)
Decode (long output, ~800 tok)	~25 tok/s
Time to first token	~0.08 s
Prefill (49-tok prompt)	~1,240 tok/s
Model load (executor init)	~2 s

Notes:

Burst mode is everything. Without Patch 2 the from-source build runs decode at about 16 tok/s; with it, 25 to 29. If you see about 16 and a wall of QNN logs, Patch 2 did not take.
Decode falls off as the output grows. Each decode step attends over the whole KV cache, so a 200-token answer averages about 28 tok/s and an 800-token one about 25; the first tokens (short context) are the fastest. This is not thermal: the SoC sits around 43 °C throughout.
Pin the CPU governor for stable numbers (the DSP still ramps its own clock on the first inference after boot):
```
echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
```
Prefill tok/s is not comparable across prompt lengths. On a one-line prompt the ~1,240 figure is dominated by fixed overhead, so it is not meaningful on its own.

Under the hood: what is actually happening

Worth understanding, because every step above maps to a layer in this stack.

The `.litertlm` is a container, not a tflite file

The file starts with the magic bytes LITERTLM. Inside it bundles everything the runtime needs: the SentencePiece tokenizer, model metadata (chat template, EOS/EOA tokens, the backend constraint that makes it NPU-only), the LiteRT model graph, and, the important part for NPU, pre-compiled QNN context binaries. For this model those are two graphs, qnn_partition_0 and qnn_partition_1 (the transformer is split across two HTP contexts). The weights are w4a16 (4-bit weights, 16-bit activations): that is how a roughly 2B-parameter model fits and runs fast on the NSP.

The execution path, layer by layer

litert_lm_main
  └─ LiteRT-LM Engine (tokenizer, sampler, KV-cache, prefill/decode loop)
       └─ LiteRT CompiledModel  ── graph contains a custom op: DISPATCH_OP
            └─ Dispatch delegate  → libLiteRtDispatch_Qualcomm.so   (the "NPU accelerator")
                 └─ QNN HTP backend → libQnnHtp.so / libQnnSystem.so   (host side)
                      └─ FastRPC → libcdsprpc.so → /dev/fastrpc-cdsp   (the RPC transport)
                           └─ Hexagon v75 NSP runs libQnnHtpV75Skel.so (the DSP side)

The LiteRT graph is not a normal tflite network: it is mostly a single DISPATCH_OP, a custom op that is a placeholder for “run this pre-compiled vendor graph.” When the NPU accelerator registers, LiteRT loads libLiteRtDispatch_Qualcomm.so, which hands the QNN context binary to the QNN HTP backend. QNN talks to the Hexagon NSP over FastRPC (a remote-procedure-call transport to the DSP, via libcdsprpc.so and /dev/fastrpc-cdsp); the actual matmuls execute inside libQnnHtpV75Skel.so, the QNN “skeleton” loaded on the v75 NSP and found via ADSP_LIBRARY_PATH. So three things must agree: the host QNN libs, the DSP skel, and the context binary inside the model, all the same QAIRT release. That is why the version step matters.

Why the version must match exactly

A QNN context binary is ahead-of-time compiled and serialized for one QAIRT version: its graph format, op-package set, and the skel ABI it expects are all baked in. Load it on a different runtime and, best case, deserialization is refused. The version is not documented up front and LiteRT’s main pins a newer 2.47, so neither is authoritative. The serialized build-id inside the file (v2.44.0.260225…) is, which is why we strings it rather than trust a pin.

Prefill versus decode, and the KV cache

Generation is two phases. Prefill runs the whole prompt through the transformer once to build the KV cache (per-layer key/value tensors): it is compute-bound and embarrassingly parallel, so it is fast per token. Decode then generates one token at a time, each step attending over the growing KV cache: it is memory-bandwidth-bound (you stream the 4-bit weights through the NSP every token), which is why decode (about 28 tok/s on Ubuntu, about 32 on QLI) is far slower per token than prefill and is the number that actually bounds interactive latency.

Burst mode: why the build is 16 tok/s until you patch it

The Hexagon NSP runs under DCVS (dynamic clock and voltage scaling): left alone it idles at a low clock and only ramps under sustained load. QNN exposes a performance mode to override that. HtpPerformanceMode::kBurst makes the runtime vote the DSP up to its top clock and hold it there (plus RPC-polling to cut FastRPC latency). LiteRT-LM’s NPU executor does request burst, but only inside #if defined(__ANDROID__) (Patch 2). On Linux, with that block compiled out, the dispatch plugin reports Failed to parse qnn options … Null Qualcomm options, falls back to HtpPerformanceMode::kDefault, and the DSP runs at its lazy default clock: decode about 16 tok/s. Apply Patch 2 and the log shows Set HTP performance mode: 2; decode jumps to 25 to 29. This single gate is the largest performance lever in the whole stack, far bigger than anything else here.

The `err 1002` weight-buffer message

During graph init you will see fastrpc memory map for fd: … length: 1172307968 failed … err 1002. That is QNN trying to map the roughly 1.17 GB persistent weight buffer into the cDSP’s IOMMU in one shot via FastRPC. The stock Ubuntu BSP does not reserve a large FastRPC DMA region (dmesg: no reserved DMA memory for FASTRPC, CMA only about 164 MB), so that single map request is rejected. It is non-fatal: QNN falls back to another path to get the weights to the NSP and the graphs still execute on the v75 (the model generates correct text either way). Whether it leaves decode throughput on the table is hard to isolate from the KV-cache-length falloff above; on this BSP, with burst mode on, decode lands at 25 to 29 tok/s with the message present.

Why decode slows as the answer grows

Decode is memory-bandwidth-bound and gets slower per token as the sequence lengthens: every step attends over the entire KV cache, which grows with each token emitted. So a 200-token answer averages about 28 tok/s while an 800-token one averages about 25. The first tokens (short context) are the fastest. There is also a small first-inference-after-boot ramp while DCVS spins up; pinning the CPU governor to performance and warming up collapses that part.

Troubleshooting

Symptom	Cause and fix
`NPU accelerator could not be loaded and registered: InvalidArgument`, then `DISPATCH_OP failed to prepare`	The Linux dispatch-dir gate. Apply the `\|\| defined(__linux__)` patch (Patch 1) and rebuild.
Decode stuck at ~16 tok/s, plus walls of QNN `[INFO]` logs and `Null Qualcomm options`	Burst mode is not set. Apply Patch 2 and rebuild; you should then see `Set HTP performance mode: 2` and the logs go quiet.
`Failed to create device`, or no FastRPC	Missing FastRPC userland. Re-run the software-package install so `libcdsprpc.so` is present.
`fastrpc memory map … err 1002`, or `Failed to map weights buffer`	Non-fatal: the model still runs and generates correct text. It is QNN failing to map the 1.17 GB weight buffer in one shot (no reserved FastRPC DMA region on the stock BSP); it falls back to another path.
`TF_LITE_PREFILL_DECODE not found`, or mmap errors	Truncated `.litertlm`. Re-download; check the size is 3.29 GB and the sha256 matches Hugging Face.
`Main backend constraint mismatch … requires [npu]`	Expected: the model is NPU-only. Use `--backend npu`.

Part 1 in one line

Match the runtime to the model (QAIRT 2.44.0.260225, read from the file), build LiteRT-LM at the commit that pins it (with the two Linux-enablement patches: dispatch dir plus HTP burst mode), stage the matching QAIRT host libs and the Hexagon v75 skel, and run --backend npu. Burst mode is what turns a 16 tok/s build into a roughly 28 tok/s one; the scary err 1002 is non-fatal.

Part 2: Production on Qualcomm Linux (Yocto)

Ubuntu (Part 1) is the fast way to prototype. Qualcomm Linux (QLI) 2.0 is the Yocto-based embedded OS you would actually ship on these boards: a from-source image you build and control, with the Qualcomm AI stack baked in. Same model, same QAIRT 2.44, same --backend npu, but QLI exacts two small, specific costs that Ubuntu did not: one rebuild patch (a SoC-config fix for a QNN 14001) and one runtime line (ulimit -l unlimited). So the arc is: build the image, flash it, rebuild the binary with the SoC patch, then stage and run. What is different from Ubuntu:

QLI ships the FastRPC userland, cDSP firmware, and QNN runtime natively (no apt; it is in the image). You do not run Pre Requisites.
The rootfs is a Yocto image, not Debian, so you stage the LiteRT-LM binary, QAIRT 2.44, and model onto it (scp or a data partition) rather than apt install.
You build the OS image yourself on a Linux PC, then flash it to the board.

What changes from the Ubuntu build (the QLI delta)

You do not start over for QLI; you carry the Part 1 work forward. The model, QAIRT 2.44, the dispatch to QNN to FastRPC to Hexagon path, --backend npu, and the two Part 1 patches (dispatch-dir plus burst) are all unchanged. There are exactly two functional deltas to go from the working Ubuntu binary to a working QLI run, one at build time and one at runtime, plus the packaging change (Yocto image instead of apt):

In Part 1 (Ubuntu)	For QLI you additionally need	Why it is needed
2 build patches: dispatch-dir plus burst	+1 build patch: an `htp_backend.cc` SoC-guard, then rebuild	QLI’s QNN runtime rejects the forced SoC config that Ubuntu silently tolerated, giving `QnnDevice_create 14001`
root mlock is unlimited by default	`ulimit -l unlimited` before launching	QLI caps locked memory at 8 MB; FastRPC must pin the roughly 1.17 GB weight buffer, giving `Could not allocate persistent weights buffer!`
`Pre Requisites` plus `apt install` the stack	bake and stage instead: build the Yocto image, stage QAIRT 2.44 plus the model (no `apt`; `libstdc++` is already in the rootfs)	QLI is a from-source Yocto rootfs, not Debian
(none)	(optional) `download_mode=0` while bringing up	so a cDSP fault reboots instead of dropping to EDL ramdump (`900e`) needing a power-cycle

Note the burst patch is not a QLI thing: it is required for any from-source Linux build, Ubuntu included (it is the 16 to 28 tok/s fix from Part 1). The genuinely QLI-specific deltas are just the SoC-guard rebuild and the ulimit -l line. The same binary that did about 28 tok/s on Ubuntu, rebuilt with that one extra patch, does about 32 on QLI.

Build host: requirements and what to expect

You do not build the image on the IQ8. You build it on a Linux PC (the “build host”) and flash the result to the board. Everything runs inside a container via kas-container, so the only host dependency is Docker. Build host prerequisites:

	Requirement	Notes
OS	Any x86_64 Linux	The build runs in a kas/Docker container, so the host distro barely matters.
Docker	Installed, your user in the `docker` group	`kas-container` uses it. (Podman also works.)
Disk	~250 GB free	downloads about 30 GB, sstate cache about 30 GB, `build/tmp` about 100 to 150 GB. An SSD/NVMe matters a lot.
RAM	32 GB min, 64 GB comfortable	Parallel compiles and linking (LLVM, mesa, the kernel) are memory-hungry.
CPU	As many cores as you can get	Yocto compiles about 14,800 tasks; it scales almost linearly with cores.
Tools	`git`, `wget`/`curl`, `docker`	plus the `kas-container` script (one download, below).
Network	Fast, unmetered	the first build downloads tens of GB of sources.

Machine recommendation: this is the one job where a many-core workstation pays for itself. A Threadripper or EPYC (32 to 64 cores) chews through a cold build in about 1 to 2 hours; the same build on a typical 8-core laptop is an all-afternoon (about 8 to 10 h) affair. More cores means proportionally less wall-clock. Estimated build times:

Scenario	8-core laptop	16-core server	32 to 64-core Threadripper/EPYC
Cold (empty cache, first ever build)	~8 to 10 h plus downloads	~3 to 4 h	~1 to 2 h
Warm (sstate cache present, incremental)	~20 to 40 min	~7 min ✅	~5 min

The about 7 min warm figure is measured here on a 16-core AMD EPYC 7763 with a populated sstate-cache and downloads. The cold figures are estimates: the variable is cores and download speed, not much else. Keep your sstate-cache and downloads directories between builds (point SSTATE_DIR and DL_DIR at them); that is the difference between 7 minutes and 4 hours.

Set up the build tree

Install Docker (once), grab kas-container, and pull the QLI 2.0 layers at their locked revisions:

# Docker (Ubuntu host example), once
sudo apt-get update && sudo apt-get install -y docker.io git
sudo usermod -aG docker "$USER"   # log out/in for this to take effect

# kas-container (the only build tool you need on the host)
wget -qO kas-container https://raw.githubusercontent.com/siemens/kas/refs/tags/5.1/kas-container
chmod +x kas-container

# QLI 2.0 release manifest plus all meta layers, pinned to one lockfile
git clone -b qli-2.0 https://github.com/qualcomm-linux/meta-qcom-releases
./kas-container checkout meta-qcom-releases/lock.yml     # clones meta-qcom + all deps at locked commits
cp meta-qcom-releases/lock.yml meta-qcom/ci/lock.yml

Build the IQ-8275 image

The build target is a colon-joined list of kas config fragments: machine, image, kernel, lockfile:

export KAS_CONTAINER_ENGINE=docker
./kas-container build \
  meta-qcom/ci/iq-8275-evk.yml:\
meta-qcom/ci/qcom-distro-multimedia-image.yml:\
meta-qcom/ci/linux-qcom-6.18.yml:\
meta-qcom/ci/lock.yml

This produces the flashable bundle (about 927 MB):

build/tmp/deploy/images/iq-8275-evk/qcom-multimedia-image-iq-8275-evk.rootfs.qcomflash.tar.gz

Inside it: the firehose programmer (prog_firehose_ddr.elf), partition tables, the SAIL bootloader chain, and rawprogram*.xml/patch*.xml, everything qdl needs. Confirm the AI stack is in the image:

grep -E 'fastrpc|qairt|hexagon-dsp-binaries|tensorflow' \
  build/tmp/deploy/images/iq-8275-evk/qcom-multimedia-image-*.manifest
# fastrpc 1.0.4 / kernel-module-fastrpc / hexagon-dsp-binaries-…-iq8275-evk-cdsp / qairt-sdk-hexagon-v75 2.43 …

Note the image ships QAIRT 2.43; our model needs 2.44, so (as on Ubuntu) we stage 2.44 ourselves below.

Flash the board (EDL plus qdl)

Put the IQ8 into EDL (emergency download) mode and flash with qdl (Linux/macOS) or qdl.exe (Windows). The QLI build guide has the full matrix; the short path:

# 1) extract the bundle
tar -xzf qcom-multimedia-image-iq-8275-evk.rootfs.qcomflash.tar.gz
cd qcom-multimedia-image-iq-8275-evk

# 2) put the board in EDL: from a running shell `sudo reboot edl`, or the boot button method
#    (host then enumerates a "Qualcomm HS-USB QDLoader 9008" device)

# 3) flash
qdl prog_firehose_ddr.elf rawprogram*.xml patch*.xml

Power-cycle out of EDL; the board boots QLI 2.0. Log in as root (password oelinux123 on this image), then confirm:

tr '\0' '\n' < /proc/device-tree/compatible   # qcom,monaco-evk / qcom,qcs8300
cat /sys/devices/soc0/machine                 # QCS8275
ls /dev/fastrpc-cdsp                           # FastRPC present (shipped in the image)

Note the device tree calls the SoC qcs8300 even though machine reads QCS8275 (they are the same v75 part; soc_id 675). That naming is what trips the QNN bug we patch next.

One more patch for QLI: the SoC-config fix (`QnnDevice_create` 14001)

The Part 1 binary runs on Ubuntu, but on QLI it dies at init with Failed to set up QNN manager or QnnDevice_create … 14001. Root cause: LiteRT’s QNN backend, on aarch64, forces a QnnHtpDevice_CustomConfig SOC option (htp_backend.cc) built from the online-detected SoC. The Ubuntu QNN runtime tolerates that forced override; QLI’s rejects it. (It is not even a wrong value: the SoC table maps both QCS8275 and QCS8300 to the same enum, v75, and 8 MB VTCM. QLI simply will not accept an explicit SOC override on this path.) The fix is to let aarch64 auto-detect by compiling the forced block out. It lives in the @litert external, so patch after bazel fetch and before bazel build:

cd ~/iq8-gemma/LiteRT-LM
export LITERT_QAIRT_SDK="$HOME/iq8-gemma/v2.44.0.260225/"

# 1) materialize the @litert external
bazel fetch -c opt --repo_env=CC=clang-18 --repo_env=CXX=clang++-18 \
  //runtime/engine:litert_lm_main @litert//litert/vendors/qualcomm/dispatch:dispatch_api_so

# 2) guard the forced SOC custom-config with #if x86 (so aarch64 auto-detects)
HB=$(find "$(bazel info output_base)/external" -path '*qualcomm/core/backends/htp_backend.cc' | head -1)
python3 - "$HB" <<'PY'
import sys; p=sys.argv[1]; s=open(p).read()
a1="  std::vector<QnnDevice_CustomConfig_t> device_custom_configs;\n"
a2=("  device_custom_configs.emplace_back(\n"
    "      static_cast<QnnDevice_CustomConfig_t>(htp_device_custom_config));\n")
assert a1 in s and a2 in s, "anchors not found (different commit?)"
s=s.replace(a1, a1+"#if defined(__x86_64__) || defined(_M_X64)\n",1)
s=s.replace(a2, a2+"#endif\n",1)
open(p,"w").write(s); print("htp soc-config patch applied")
PY
chmod u+w "$HB"

# 3) rebuild WITHOUT re-fetching (keeps the patched external)
bazel build --nofetch -c opt --repo_env=CC=clang-18 --repo_env=CXX=clang++-18 \
  //runtime/engine:litert_lm_main \
  @litert//litert/vendors/qualcomm/dispatch:dispatch_api_so

This binary is a superset: dropping the forced SOC config is a no-op on Ubuntu, so the one binary runs on both operating systems. (If you only target QLI, build with all three patches from the start.)

Stage the runtime on QLI

QLI gives you FastRPC plus cDSP firmware for free, but it is a Yocto rootfs with no apt. You stage three things the image does not ship: the litert_lm_main and its .so files (the SoC-patched rebuild), the QAIRT 2.44 SDK, and the model. There is no C++ runtime to add: the binaries link GNU libstdc++.so.6, which the rootfs already has. The board has networking, so pull the big files directly:

# on the QLI board
mkdir -p /opt/iq8-gemma/run && cd /opt/iq8-gemma

# QAIRT 2.44 (matches the model)
curl -fL -o q.zip "https://softwarecenter.qualcomm.com/api/download/software/sdks/Qualcomm_AI_Runtime_Community/All/2.44.0.260225/v2.44.0.260225.zip"
mkdir -p v2.44.0.260225 && unzip -q q.zip -d v2.44.0.260225

# the model
curl -fL -o run/gemma-4-E2B-it_qualcomm_qcs8275.litertlm \
  "https://huggingface.co/litert-community/gemma-4-E2B-it-litert-lm/resolve/main/gemma-4-E2B-it_qualcomm_qcs8275.litertlm"

Then scp the three rebuilt artifacts into run/: litert_lm_main, libLiteRtDispatch_Qualcomm.so, and libGemmaModelConstraintProvider.so.

Run on the NPU (plus the one runtime gotcha: `ulimit -l`)

QLI’s default max locked memory is 8 MB (ulimit -l returns 8192); Ubuntu’s root is unlimited. FastRPC pins the model’s roughly 1.17 GB persistent weight buffer, which blows past 8 MB, so QNN reports Could not allocate persistent weights buffer! and the load aborts (the first cold attempt actually faulted the cDSP). Raise it before running and the err 1002 weights-map degrades to the exact same harmless fallback you saw on Ubuntu:

cd /opt/iq8-gemma/run
ulimit -l unlimited                       # <-- the QLI fix; without it the model won't load
QAIRT=/opt/iq8-gemma/v2.44.0.260225/qairt/2.44.0.260225
LD_LIBRARY_PATH="$PWD:$QAIRT/lib/aarch64-oe-linux-gcc11.2:/usr/lib" \
ADSP_LIBRARY_PATH="$QAIRT/lib/hexagon-v75/unsigned" \
  ./litert_lm_main --backend npu \
    --model_path "$PWD/gemma-4-E2B-it_qualcomm_qcs8275.litertlm" \
    --input_prompt "Explain what Qualcomm is in two sentences."

Crash safety while bringing this up. QLI ships qcom_scm.download_mode=1, so a cDSP or kernel fault dumps the SoC into EDL ramdump mode (USB 900e) and needs a physical power-cycle to recover. Run echo 0 > /sys/module/qcom_scm/parameters/download_mode so a fault just reboots instead. (With ulimit -l raised there is no fault; this is only a safety net.)

You will see HtpPerformanceMode: 2, no 14001, correct generated text, and on a long run:

BenchmarkInfo:
  Time to first token: 0.06 s
  Prefill Turn 1: Processed 49 tokens in 32.3ms duration.
    Prefill Speed: ~1520 tokens/sec.
  Decode  Turn 1: Processed 799 tokens in 24.6s duration.
    Decode  Speed: 32.49 tokens/sec.

Prototype versus production: the payoff

Same model, same NPU, same QAIRT, measured on the same physical board, first as the Ubuntu prototype, then reflashed to QLI 2.0 (governor performance, warm):

Metric	Ubuntu (measured)	QLI 2.0 (measured)
Decode, ~200-tok output	~28 tok/s (26 to 29)	~32 tok/s (31 to 33)
Decode, ~800-tok output	~25 tok/s	~32.5 tok/s
Time to first token	0.08 s	0.06 s
Prefill (49-tok prompt)	~1,240 tok/s	~1,520 tok/s

The punchline of the prototype-to-production arc: QLI 2.0 is not a downgrade, it is faster and steadier. Unlike Ubuntu, it holds about 32 tok/s even across an 800-token generation instead of sagging to about 25. The likely reason is exactly what makes it a production target: the lean, single-purpose image leaves the CPU and scheduler far less contended, so the NSP’s DCVS holds its clock. You give up apt convenience and pay two QLI-specific costs (one rebuild patch for the SoC config and one runtime line, ulimit -l), and in return you get a reproducible, version-pinned, from-source image that runs the model better than the box you prototyped on.

Next steps

Swap in your own prompt, or wire litert_lm_main behind a small local API for an on-device assistant.
Try other LiteRT-LM models built for the v75 NSP, reading each one’s required QAIRT version from the file before staging.
For the production path, fold the SoC-config patch in from the start and bake your run directory into the Yocto image.
Compare throughput on other Dragonwing parts by repeating the version-match step for that device’s NSP.

​What you will do

​Prerequisites

​Get the model and find the QAIRT version it needs

​Build litert_lm_main against that QAIRT

​Build toolchain

​Check out the LiteRT-LM commit that pins your QAIRT

​Apply two Linux-enablement patches

​Build

​Assemble a run directory

​Run it on the NPU

​Confirm it is really on the NPU

​Performance

​Under the hood: what is actually happening

​The .litertlm is a container, not a tflite file

​The execution path, layer by layer

​Why the version must match exactly

​Prefill versus decode, and the KV cache

​Burst mode: why the build is 16 tok/s until you patch it

​The err 1002 weight-buffer message

​Why decode slows as the answer grows

​Troubleshooting

​Part 1 in one line

​Part 2: Production on Qualcomm Linux (Yocto)

​What changes from the Ubuntu build (the QLI delta)

​Build host: requirements and what to expect

​Set up the build tree

​Build the IQ-8275 image

​Flash the board (EDL plus qdl)

​One more patch for QLI: the SoC-config fix (QnnDevice_create 14001)

​Stage the runtime on QLI

​Run on the NPU (plus the one runtime gotcha: ulimit -l)

​Prototype versus production: the payoff

​Next steps

What you will do

Prerequisites

Get the model and find the QAIRT version it needs

Build `litert_lm_main` against that QAIRT

Build toolchain

Check out the LiteRT-LM commit that pins your QAIRT

Apply two Linux-enablement patches

Build

Assemble a run directory

Run it on the NPU

Confirm it is really on the NPU

Performance

Under the hood: what is actually happening

The `.litertlm` is a container, not a tflite file

The execution path, layer by layer

Why the version must match exactly

Prefill versus decode, and the KV cache

Burst mode: why the build is 16 tok/s until you patch it

The `err 1002` weight-buffer message

Why decode slows as the answer grows

Troubleshooting

Part 1 in one line

Part 2: Production on Qualcomm Linux (Yocto)

What changes from the Ubuntu build (the QLI delta)

Build host: requirements and what to expect

Set up the build tree

Build the IQ-8275 image

Flash the board (EDL plus qdl)

One more patch for QLI: the SoC-config fix (`QnnDevice_create` 14001)

Stage the runtime on QLI

Run on the NPU (plus the one runtime gotcha: `ulimit -l`)

Prototype versus production: the payoff

Next steps