Skip to main content
AI / ML
Dragonwing Team·Jun 25, 2026·← All posts

This guide shows how to build and install llama.cpp for Dragonwing devices, then run GGUF large language models with the Hexagon HTP backend.
These instructions focus on the Snapdragon and Dragonwing llama.cpp build that exposes the Hexagon HTP device as HTP0. This is different from the OpenCL GPU workflow.

What you will do

  1. Prepare the Dragonwing target device.
  2. Build a current llama.cpp package using the Snapdragon toolchain container.
  3. Copy the package to the Dragonwing device.
  4. Make the packaged llama-cli and llama-server the default commands on the device.
  5. Download a GGUF model and run inference on HTP0.

Prerequisites

Before you begin, make sure you have:
  • Completed the first time setup for your Dragonwing device:
  • Access to the device by SSH, or by a directly connected display, keyboard, and mouse.
    • The setup guides linked above include instructions for networking, serial console access, display setup, and SSH access.
  • Installed the required Dragonwing software packages on the device:
  • Installed Docker on the build host.
  • Enough free space for the build output and models. Plan for several GB per model.
The required software package setup installs the QNN runtime and tools, including libqnn-dev and qnn-tools, that llama.cpp needs for accelerated inference.

Prepare the build host

On your build host, clone llama.cpp or update an existing checkout.
mkdir -p ~/src
cd ~/src

# Clone if this is your first build
git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp

# For repeat builds, update to the latest upstream changes
git fetch origin
git checkout master
git pull --ff-only
If you need a reproducible build, record the commit you built:
git rev-parse HEAD

Build llama.cpp with the Snapdragon toolchain container

The easiest way to build llama.cpp for Dragonwing is to use the Snapdragon ARM64 Linux toolchain container. The container includes the ARM64 cross compiler, CMake, OpenCL SDK, and Hexagon SDK pieces needed by the Snapdragon preset. The Docker command below explicitly requests the linux/amd64 image. From the root of your llama.cpp checkout, start the container:
docker run -it \
  -u $(id -u):$(id -g) \
  --volume $(pwd):/workspace \
  --platform linux/amd64 \
  ghcr.io/snapdragon-toolchain/arm64-linux:v0.1
Inside the container, configure and build llama.cpp:
cd /workspace
cp docs/backend/snapdragon/CMakeUserPresets.json .

cmake --preset arm64-linux-snapdragon-release -B build-snapdragon
cmake --build build-snapdragon -j $(nproc)
Create an installable package:
cmake --install build-snapdragon --prefix pkg-snapdragon
zip -r pkg-snapdragon.zip pkg-snapdragon
Exit the container when the package is complete:
exit
The package archive is now available on the host at:
~/src/llama.cpp/pkg-snapdragon.zip

Rebuild when upstream llama.cpp changes

llama.cpp changes frequently. To rebuild with the latest upstream code, repeat this update and build flow from your host checkout:
cd ~/src/llama.cpp
git fetch origin
git checkout master
git pull --ff-only

docker run -it \
  -u $(id -u):$(id -g) \
  --volume $(pwd):/workspace \
  --platform linux/amd64 \
  ghcr.io/snapdragon-toolchain/arm64-linux:v0.1
Then, inside the container:
cd /workspace
cp docs/backend/snapdragon/CMakeUserPresets.json .

rm -rf build-snapdragon pkg-snapdragon pkg-snapdragon.zip
cmake --preset arm64-linux-snapdragon-release -B build-snapdragon
cmake --build build-snapdragon -j $(nproc)
cmake --install build-snapdragon --prefix pkg-snapdragon
zip -r pkg-snapdragon.zip pkg-snapdragon
If you are testing a branch, tag, or local llama.cpp changes, check out that source before running the Docker build command.

Copy the package to the Dragonwing device

Replace ubuntu@DEVICE_IP with your SSH user and target IP address.
scp ~/src/llama.cpp/pkg-snapdragon.zip ubuntu@DEVICE_IP:/home/ubuntu/
Log in to the target device:
ssh ubuntu@DEVICE_IP
Unpack the package:
cd ~
unzip pkg-snapdragon.zip
cd pkg-snapdragon
Set the runtime library paths for the current shell:
export LD_LIBRARY_PATH=$PWD/lib${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}
export ADSP_LIBRARY_PATH=$PWD/lib${ADSP_LIBRARY_PATH:+:$ADSP_LIBRARY_PATH}
Verify that the package runs:
./bin/llama-cli --version
List the available llama.cpp devices:
./bin/llama-cli --list-devices
Expected output includes:
Available devices:
  HTP0: Hexagon

Make this the default llama.cpp install on the device

The packaged binaries need LD_LIBRARY_PATH and ADSP_LIBRARY_PATH so they can find the packaged llama.cpp and Hexagon backend libraries. The safest way to make this the default install is to move the package into /opt and create wrapper commands in /usr/local/bin. Run the following on the Dragonwing device:
cd ~
sudo rm -rf /opt/llama.cpp-snapdragon
sudo mkdir -p /opt
sudo cp -a ~/pkg-snapdragon /opt/llama.cpp-snapdragon
sudo chown -R root:root /opt/llama.cpp-snapdragon
Create a shared environment file:
sudo tee /etc/llama.cpp-snapdragon.env >/dev/null <<'EOF'
LLAMA_CPP_SNAPDRAGON_HOME=/opt/llama.cpp-snapdragon
LD_LIBRARY_PATH=/opt/llama.cpp-snapdragon/lib
ADSP_LIBRARY_PATH=/opt/llama.cpp-snapdragon/lib
EOF
Create wrapper commands for llama-cli and llama-server:
sudo tee /usr/local/bin/llama-cli >/dev/null <<'EOF'
#!/usr/bin/env bash
set -euo pipefail
source /etc/llama.cpp-snapdragon.env
exec "$LLAMA_CPP_SNAPDRAGON_HOME/bin/llama-cli" "$@"
EOF

sudo tee /usr/local/bin/llama-server >/dev/null <<'EOF'
#!/usr/bin/env bash
set -euo pipefail
source /etc/llama.cpp-snapdragon.env
exec "$LLAMA_CPP_SNAPDRAGON_HOME/bin/llama-server" "$@"
EOF

sudo chmod +x /usr/local/bin/llama-cli /usr/local/bin/llama-server
hash -r
Confirm that the default commands now resolve to the wrappers:
which llama-cli
which llama-server
llama-cli --version
llama-cli --list-devices
Expected paths:
/usr/local/bin/llama-cli
/usr/local/bin/llama-server
/usr/local/bin usually appears before /usr/bin in PATH, so these wrappers become the default commands without replacing system packages.

Download a model

llama.cpp uses models in GGUF format. A small instruct model is a good first test. Create a model directory on the Dragonwing device:
mkdir -p ~/models
cd ~/models
Download a Llama 3.2 3B instruct GGUF model:
wget https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct-Q4_0.gguf
Model support and performance vary by architecture, quantization, context length, and llama.cpp commit. We are continually testing new models. Community testing and performance reports are encouraged, especially for model families and quantization formats that run well on Dragonwing devices.

Run your first prompt on HTP0

Run llama-cli and offload layers to the Hexagon HTP device:
llama-cli \
  -m ~/models/Llama-3.2-3B-Instruct-Q4_0.gguf \
  --device HTP0 \
  -ngl 99 \
  -p "What is the most popular cookie in the world?"
Useful options:
  • --device HTP0 selects the Hexagon HTP backend.
  • -ngl 99 asks llama.cpp to offload model layers to the selected device.
  • -m points to your GGUF model file.
  • -p passes a prompt for single prompt testing.

Start llama-server

llama-server exposes a local web UI and an OpenAI compatible API. Start a server on the Dragonwing device:
llama-server \
  -m ~/models/Llama-3.2-3B-Instruct-Q4_0.gguf \
  --device HTP0 \
  -ngl 99 \
  --host 0.0.0.0 \
  --port 8080
Find the device IP address:
hostname -I
From another machine on the same network, open:
http://DEVICE_IP:8080
You can also test the API with curl:
curl http://DEVICE_IP:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Explain Dragonwing in one sentence."}
    ],
    "temperature": 0.7,
    "max_tokens": 128
  }'

Update the default install after a rebuild

After you rebuild and copy a new pkg-snapdragon.zip to the device, update /opt/llama.cpp-snapdragon:
cd ~
rm -rf pkg-snapdragon
unzip pkg-snapdragon.zip

sudo rm -rf /opt/llama.cpp-snapdragon
sudo cp -a ~/pkg-snapdragon /opt/llama.cpp-snapdragon
sudo chown -R root:root /opt/llama.cpp-snapdragon

llama-cli --version
llama-cli --list-devices
The wrapper commands in /usr/local/bin do not need to be recreated unless you change the install path.

Troubleshooting

error while loading shared libraries

If you run binaries directly from the package directory, set the library paths first:
cd ~/pkg-snapdragon
export LD_LIBRARY_PATH=$PWD/lib${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}
export ADSP_LIBRARY_PATH=$PWD/lib${ADSP_LIBRARY_PATH:+:$ADSP_LIBRARY_PATH}
./bin/llama-cli --version
If you are using the default install wrappers, confirm the wrapper is being used:
which llama-cli
It should print /usr/local/bin/llama-cli.

HTP0 does not appear

Confirm that the required Dragonwing software packages are installed, especially libqnn-dev and qnn-tools. Then check devices again:
llama-cli --list-devices
Also confirm the package contains Hexagon backend libraries:
ls /opt/llama.cpp-snapdragon/lib/libggml-hexagon.so*
ls /opt/llama.cpp-snapdragon/lib/libggml-htp-*.so

The model does not load or runs out of memory

Try a smaller model or a smaller context length. Good first tests are 1B to 3B parameter instruct models with Q4 quantization.

The command uses the wrong llama.cpp binary

Check command resolution:
type -a llama-cli
type -a llama-server
If another path appears before /usr/local/bin, update your PATH or call /usr/local/bin/llama-cli explicitly.

Next steps

  • Try additional GGUF models and quantization formats.
  • Compare llama-cli and llama-server behavior for your application.
  • Rebuild frequently to pick up llama.cpp backend fixes and model compatibility updates.
  • Share model compatibility and performance findings with the community.