> ## Documentation Index > Fetch the complete documentation index at: https://dragonwingdocs.qualcomm.com/llms.txt > Use this file to discover all available pages before exploring further. # Run LLMs with llama.cpp on Dragonwing > Build, install, and run llama.cpp with Hexagon HTP acceleration on Dragonwing IQ8, IQ9, and Ventuno Q devices.

AI / ML

Dragonwing Team · Jun 25, 2026 · ← All posts

This guide shows how to build and install [llama.cpp](https://github.com/ggml-org/llama.cpp) for Dragonwing devices, then run GGUF large language models with the Hexagon HTP backend. These instructions focus on the Snapdragon and Dragonwing llama.cpp build that exposes the Hexagon HTP device as `HTP0`. This is different from the OpenCL GPU workflow. ## What you will do 1. Prepare the Dragonwing target device. 2. Build a current llama.cpp package using the Snapdragon toolchain container. 3. Copy the package to the Dragonwing device. 4. Make the packaged `llama-cli` and `llama-server` the default commands on the device. 5. Download a GGUF model and run inference on `HTP0`. ## Prerequisites Before you begin, make sure you have: * Completed the first time setup for your Dragonwing device: * [Dragonwing IQ8 setup](/Ubuntu/devices/iq8275-evk/setup) * [Dragonwing IQ9 setup](/Ubuntu/devices/iq9075-evk/setup) * Access to the device by SSH, or by a directly connected display, keyboard, and mouse. * The setup guides linked above include instructions for networking, serial console access, display setup, and SSH access. * Installed the required Dragonwing software packages on the device: * [Dragonwing IQ8](/Ubuntu/devices/iq8275-evk/Install_required_software_packages) * [Dragonwing IQ9](/Ubuntu/devices/iq9075-evk/Install_required_software_packages) * Installed Docker on the build host. * Enough free space for the build output and models. Plan for several GB per model. The required software package setup installs the QNN runtime and tools, including `libqnn-dev` and `qnn-tools`, that llama.cpp needs for accelerated inference. ## Prepare the build host On your build host, clone llama.cpp or update an existing checkout. ```shell theme={null} mkdir -p ~/src cd ~/src # Clone if this is your first build git clone https://github.com/ggml-org/llama.cpp.git cd llama.cpp # For repeat builds, update to the latest upstream changes git fetch origin git checkout master git pull --ff-only ``` If you need a reproducible build, record the commit you built: ```shell theme={null} git rev-parse HEAD ``` ## Build llama.cpp with the Snapdragon toolchain container The easiest way to build llama.cpp for Dragonwing is to use the Snapdragon ARM64 Linux toolchain container. The container includes the ARM64 cross compiler, CMake, OpenCL SDK, and Hexagon SDK pieces needed by the Snapdragon preset. The Docker command below explicitly requests the `linux/amd64` image. From the root of your llama.cpp checkout, start the container: ```shell theme={null} docker run -it \ -u $(id -u):$(id -g) \ --volume $(pwd):/workspace \ --platform linux/amd64 \ ghcr.io/snapdragon-toolchain/arm64-linux:v0.1 ``` Inside the container, configure and build llama.cpp: ```shell theme={null} cd /workspace cp docs/backend/snapdragon/CMakeUserPresets.json . cmake --preset arm64-linux-snapdragon-release -B build-snapdragon cmake --build build-snapdragon -j $(nproc) ``` Create an installable package: ```shell theme={null} cmake --install build-snapdragon --prefix pkg-snapdragon zip -r pkg-snapdragon.zip pkg-snapdragon ``` Exit the container when the package is complete: ```shell theme={null} exit ``` The package archive is now available on the host at: ```text theme={null} ~/src/llama.cpp/pkg-snapdragon.zip ``` ## Rebuild when upstream llama.cpp changes llama.cpp changes frequently. To rebuild with the latest upstream code, repeat this update and build flow from your host checkout: ```shell theme={null} cd ~/src/llama.cpp git fetch origin git checkout master git pull --ff-only docker run -it \ -u $(id -u):$(id -g) \ --volume $(pwd):/workspace \ --platform linux/amd64 \ ghcr.io/snapdragon-toolchain/arm64-linux:v0.1 ``` Then, inside the container: ```shell theme={null} cd /workspace cp docs/backend/snapdragon/CMakeUserPresets.json . rm -rf build-snapdragon pkg-snapdragon pkg-snapdragon.zip cmake --preset arm64-linux-snapdragon-release -B build-snapdragon cmake --build build-snapdragon -j $(nproc) cmake --install build-snapdragon --prefix pkg-snapdragon zip -r pkg-snapdragon.zip pkg-snapdragon ``` If you are testing a branch, tag, or local llama.cpp changes, check out that source before running the Docker build command. ## Copy the package to the Dragonwing device Replace `ubuntu@DEVICE_IP` with your SSH user and target IP address. ```shell theme={null} scp ~/src/llama.cpp/pkg-snapdragon.zip ubuntu@DEVICE_IP:/home/ubuntu/ ``` Log in to the target device: ```shell theme={null} ssh ubuntu@DEVICE_IP ``` Unpack the package: ```shell theme={null} cd ~ unzip pkg-snapdragon.zip cd pkg-snapdragon ``` Set the runtime library paths for the current shell: ```shell theme={null} export LD_LIBRARY_PATH=$PWD/lib${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH} export ADSP_LIBRARY_PATH=$PWD/lib${ADSP_LIBRARY_PATH:+:$ADSP_LIBRARY_PATH} ``` Verify that the package runs: ```shell theme={null} ./bin/llama-cli --version ``` List the available llama.cpp devices: ```shell theme={null} ./bin/llama-cli --list-devices ``` Expected output includes: ```text theme={null} Available devices: HTP0: Hexagon ``` ## Make this the default llama.cpp install on the device The packaged binaries need `LD_LIBRARY_PATH` and `ADSP_LIBRARY_PATH` so they can find the packaged llama.cpp and Hexagon backend libraries. The safest way to make this the default install is to move the package into `/opt` and create wrapper commands in `/usr/local/bin`. Run the following on the Dragonwing device: ```shell theme={null} cd ~ sudo rm -rf /opt/llama.cpp-snapdragon sudo mkdir -p /opt sudo cp -a ~/pkg-snapdragon /opt/llama.cpp-snapdragon sudo chown -R root:root /opt/llama.cpp-snapdragon ``` Create a shared environment file: ```shell theme={null} sudo tee /etc/llama.cpp-snapdragon.env >/dev/null <<'EOF' LLAMA_CPP_SNAPDRAGON_HOME=/opt/llama.cpp-snapdragon LD_LIBRARY_PATH=/opt/llama.cpp-snapdragon/lib ADSP_LIBRARY_PATH=/opt/llama.cpp-snapdragon/lib EOF ``` Create wrapper commands for `llama-cli` and `llama-server`: ```shell theme={null} sudo tee /usr/local/bin/llama-cli >/dev/null <<'EOF' #!/usr/bin/env bash set -euo pipefail source /etc/llama.cpp-snapdragon.env exec "$LLAMA_CPP_SNAPDRAGON_HOME/bin/llama-cli" "$@" EOF sudo tee /usr/local/bin/llama-server >/dev/null <<'EOF' #!/usr/bin/env bash set -euo pipefail source /etc/llama.cpp-snapdragon.env exec "$LLAMA_CPP_SNAPDRAGON_HOME/bin/llama-server" "$@" EOF sudo chmod +x /usr/local/bin/llama-cli /usr/local/bin/llama-server hash -r ``` Confirm that the default commands now resolve to the wrappers: ```shell theme={null} which llama-cli which llama-server llama-cli --version llama-cli --list-devices ``` Expected paths: ```text theme={null} /usr/local/bin/llama-cli /usr/local/bin/llama-server ``` `/usr/local/bin` usually appears before `/usr/bin` in `PATH`, so these wrappers become the default commands without replacing system packages. ## Download a model llama.cpp uses models in GGUF format. A small instruct model is a good first test. Create a model directory on the Dragonwing device: ```shell theme={null} mkdir -p ~/models cd ~/models ``` Download a Llama 3.2 3B instruct GGUF model: ```shell theme={null} wget https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct-Q4_0.gguf ``` Model support and performance vary by architecture, quantization, context length, and llama.cpp commit. We are continually testing new models. Community testing and performance reports are encouraged, especially for model families and quantization formats that run well on Dragonwing devices. ## Run your first prompt on HTP0 Run `llama-cli` and offload layers to the Hexagon HTP device: ```shell theme={null} llama-cli \ -m ~/models/Llama-3.2-3B-Instruct-Q4_0.gguf \ --device HTP0 \ -ngl 99 \ -p "What is the most popular cookie in the world?" ``` Useful options: * `--device HTP0` selects the Hexagon HTP backend. * `-ngl 99` asks llama.cpp to offload model layers to the selected device. * `-m` points to your GGUF model file. * `-p` passes a prompt for single prompt testing. ## Start llama-server `llama-server` exposes a local web UI and an OpenAI compatible API. Start a server on the Dragonwing device: ```shell theme={null} llama-server \ -m ~/models/Llama-3.2-3B-Instruct-Q4_0.gguf \ --device HTP0 \ -ngl 99 \ --host 0.0.0.0 \ --port 8080 ``` Find the device IP address: ```shell theme={null} hostname -I ``` From another machine on the same network, open: ```text theme={null} http://DEVICE_IP:8080 ``` You can also test the API with `curl`: ```shell theme={null} curl http://DEVICE_IP:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Explain Dragonwing in one sentence."} ], "temperature": 0.7, "max_tokens": 128 }' ``` ## Update the default install after a rebuild After you rebuild and copy a new `pkg-snapdragon.zip` to the device, update `/opt/llama.cpp-snapdragon`: ```shell theme={null} cd ~ rm -rf pkg-snapdragon unzip pkg-snapdragon.zip sudo rm -rf /opt/llama.cpp-snapdragon sudo cp -a ~/pkg-snapdragon /opt/llama.cpp-snapdragon sudo chown -R root:root /opt/llama.cpp-snapdragon llama-cli --version llama-cli --list-devices ``` The wrapper commands in `/usr/local/bin` do not need to be recreated unless you change the install path. ## Troubleshooting ### `error while loading shared libraries` If you run binaries directly from the package directory, set the library paths first: ```shell theme={null} cd ~/pkg-snapdragon export LD_LIBRARY_PATH=$PWD/lib${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH} export ADSP_LIBRARY_PATH=$PWD/lib${ADSP_LIBRARY_PATH:+:$ADSP_LIBRARY_PATH} ./bin/llama-cli --version ``` If you are using the default install wrappers, confirm the wrapper is being used: ```shell theme={null} which llama-cli ``` It should print `/usr/local/bin/llama-cli`. ### `HTP0` does not appear Confirm that the required Dragonwing software packages are installed, especially `libqnn-dev` and `qnn-tools`. Then check devices again: ```shell theme={null} llama-cli --list-devices ``` Also confirm the package contains Hexagon backend libraries: ```shell theme={null} ls /opt/llama.cpp-snapdragon/lib/libggml-hexagon.so* ls /opt/llama.cpp-snapdragon/lib/libggml-htp-*.so ``` ### The model does not load or runs out of memory Try a smaller model or a smaller context length. Good first tests are 1B to 3B parameter instruct models with Q4 quantization. ### The command uses the wrong llama.cpp binary Check command resolution: ```shell theme={null} type -a llama-cli type -a llama-server ``` If another path appears before `/usr/local/bin`, update your `PATH` or call `/usr/local/bin/llama-cli` explicitly. ## Next steps * Try additional GGUF models and quantization formats. * Compare `llama-cli` and `llama-server` behavior for your application. * Rebuild frequently to pick up llama.cpp backend fixes and model compatibility updates. * Share model compatibility and performance findings with the community.