Run LLMs with llama.cpp on Dragonwing - Qualcomm Dragonwing Documentation

AI / ML

Dragonwing Team·Jun 25, 2026·← All posts

This guide shows how to build and install llama.cpp for Dragonwing devices, then run GGUF large language models with the Hexagon HTP backend.

These instructions focus on the Snapdragon and Dragonwing llama.cpp build that exposes the Hexagon HTP device as HTP0. This is different from the OpenCL GPU workflow.

What you will do

Prepare the Dragonwing target device.
Build a current llama.cpp package using the Snapdragon toolchain container.
Copy the package to the Dragonwing device.
Make the packaged llama-cli and llama-server the default commands on the device.
Download a GGUF model and run inference on HTP0.

Prerequisites

Before you begin, make sure you have:

Completed the first time setup for your Dragonwing device:
- Dragonwing IQ8 setup
- Dragonwing IQ9 setup
Access to the device by SSH, or by a directly connected display, keyboard, and mouse.
- The setup guides linked above include instructions for networking, serial console access, display setup, and SSH access.
Installed the required Dragonwing software packages on the device:
- Dragonwing IQ8
- Dragonwing IQ9
Installed Docker on the build host.
Enough free space for the build output and models. Plan for several GB per model.

The required software package setup installs the QNN runtime and tools, including libqnn-dev and qnn-tools, that llama.cpp needs for accelerated inference.

Prepare the build host

On your build host, clone llama.cpp or update an existing checkout.

mkdir -p ~/src
cd ~/src

# Clone if this is your first build
git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp

# For repeat builds, update to the latest upstream changes
git fetch origin
git checkout master
git pull --ff-only

If you need a reproducible build, record the commit you built:

git rev-parse HEAD

Build llama.cpp with the Snapdragon toolchain container

The easiest way to build llama.cpp for Dragonwing is to use the Snapdragon ARM64 Linux toolchain container. The container includes the ARM64 cross compiler, CMake, OpenCL SDK, and Hexagon SDK pieces needed by the Snapdragon preset. The Docker command below explicitly requests the linux/amd64 image. From the root of your llama.cpp checkout, start the container:

docker run -it \
  -u $(id -u):$(id -g) \
  --volume $(pwd):/workspace \
  --platform linux/amd64 \
  ghcr.io/snapdragon-toolchain/arm64-linux:v0.1

Inside the container, configure and build llama.cpp:

cd /workspace
cp docs/backend/snapdragon/CMakeUserPresets.json .

cmake --preset arm64-linux-snapdragon-release -B build-snapdragon
cmake --build build-snapdragon -j $(nproc)

Create an installable package:

cmake --install build-snapdragon --prefix pkg-snapdragon
zip -r pkg-snapdragon.zip pkg-snapdragon

Exit the container when the package is complete:

exit

The package archive is now available on the host at:

~/src/llama.cpp/pkg-snapdragon.zip

Rebuild when upstream llama.cpp changes

llama.cpp changes frequently. To rebuild with the latest upstream code, repeat this update and build flow from your host checkout:

cd ~/src/llama.cpp
git fetch origin
git checkout master
git pull --ff-only

docker run -it \
  -u $(id -u):$(id -g) \
  --volume $(pwd):/workspace \
  --platform linux/amd64 \
  ghcr.io/snapdragon-toolchain/arm64-linux:v0.1

Then, inside the container:

cd /workspace
cp docs/backend/snapdragon/CMakeUserPresets.json .

rm -rf build-snapdragon pkg-snapdragon pkg-snapdragon.zip
cmake --preset arm64-linux-snapdragon-release -B build-snapdragon
cmake --build build-snapdragon -j $(nproc)
cmake --install build-snapdragon --prefix pkg-snapdragon
zip -r pkg-snapdragon.zip pkg-snapdragon

If you are testing a branch, tag, or local llama.cpp changes, check out that source before running the Docker build command.

Copy the package to the Dragonwing device

Replace ubuntu@DEVICE_IP with your SSH user and target IP address.

scp ~/src/llama.cpp/pkg-snapdragon.zip ubuntu@DEVICE_IP:/home/ubuntu/

ssh ubuntu@DEVICE_IP

Unpack the package:

cd ~
unzip pkg-snapdragon.zip
cd pkg-snapdragon

Set the runtime library paths for the current shell:

export LD_LIBRARY_PATH=$PWD/lib${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}
export ADSP_LIBRARY_PATH=$PWD/lib${ADSP_LIBRARY_PATH:+:$ADSP_LIBRARY_PATH}

Verify that the package runs:

./bin/llama-cli --version

List the available llama.cpp devices:

./bin/llama-cli --list-devices

Expected output includes:

Available devices:
  HTP0: Hexagon

Make this the default llama.cpp install on the device

The packaged binaries need LD_LIBRARY_PATH and ADSP_LIBRARY_PATH so they can find the packaged llama.cpp and Hexagon backend libraries. The safest way to make this the default install is to move the package into /opt and create wrapper commands in /usr/local/bin. Run the following on the Dragonwing device:

cd ~
sudo rm -rf /opt/llama.cpp-snapdragon
sudo mkdir -p /opt
sudo cp -a ~/pkg-snapdragon /opt/llama.cpp-snapdragon
sudo chown -R root:root /opt/llama.cpp-snapdragon

Create a shared environment file:

sudo tee /etc/llama.cpp-snapdragon.env >/dev/null <<'EOF'
LLAMA_CPP_SNAPDRAGON_HOME=/opt/llama.cpp-snapdragon
LD_LIBRARY_PATH=/opt/llama.cpp-snapdragon/lib
ADSP_LIBRARY_PATH=/opt/llama.cpp-snapdragon/lib
EOF

Create wrapper commands for llama-cli and llama-server:

sudo tee /usr/local/bin/llama-cli >/dev/null <<'EOF'
#!/usr/bin/env bash
set -euo pipefail
source /etc/llama.cpp-snapdragon.env
exec "$LLAMA_CPP_SNAPDRAGON_HOME/bin/llama-cli" "$@"
EOF

sudo tee /usr/local/bin/llama-server >/dev/null <<'EOF'
#!/usr/bin/env bash
set -euo pipefail
source /etc/llama.cpp-snapdragon.env
exec "$LLAMA_CPP_SNAPDRAGON_HOME/bin/llama-server" "$@"
EOF

sudo chmod +x /usr/local/bin/llama-cli /usr/local/bin/llama-server
hash -r

Confirm that the default commands now resolve to the wrappers:

which llama-cli
which llama-server
llama-cli --version
llama-cli --list-devices

Expected paths:

/usr/local/bin/llama-cli
/usr/local/bin/llama-server

/usr/local/bin usually appears before /usr/bin in PATH, so these wrappers become the default commands without replacing system packages.

Download a model

llama.cpp uses models in GGUF format. A small instruct model is a good first test. Create a model directory on the Dragonwing device:

mkdir -p ~/models
cd ~/models

Download a Llama 3.2 3B instruct GGUF model:

wget https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct-Q4_0.gguf

Model support and performance vary by architecture, quantization, context length, and llama.cpp commit. We are continually testing new models. Community testing and performance reports are encouraged, especially for model families and quantization formats that run well on Dragonwing devices.

Run your first prompt on HTP0

Run llama-cli and offload layers to the Hexagon HTP device:

llama-cli \
  -m ~/models/Llama-3.2-3B-Instruct-Q4_0.gguf \
  --device HTP0 \
  -ngl 99 \
  -p "What is the most popular cookie in the world?"

Useful options:

--device HTP0 selects the Hexagon HTP backend.
-ngl 99 asks llama.cpp to offload model layers to the selected device.
-m points to your GGUF model file.
-p passes a prompt for single prompt testing.

Start llama-server

llama-server exposes a local web UI and an OpenAI compatible API. Start a server on the Dragonwing device:

llama-server \
  -m ~/models/Llama-3.2-3B-Instruct-Q4_0.gguf \
  --device HTP0 \
  -ngl 99 \
  --host 0.0.0.0 \
  --port 8080

Find the device IP address:

hostname -I

From another machine on the same network, open:

http://DEVICE_IP:8080

You can also test the API with curl:

curl http://DEVICE_IP:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Explain Dragonwing in one sentence."}
    ],
    "temperature": 0.7,
    "max_tokens": 128
  }'

Update the default install after a rebuild

After you rebuild and copy a new pkg-snapdragon.zip to the device, update /opt/llama.cpp-snapdragon:

cd ~
rm -rf pkg-snapdragon
unzip pkg-snapdragon.zip

sudo rm -rf /opt/llama.cpp-snapdragon
sudo cp -a ~/pkg-snapdragon /opt/llama.cpp-snapdragon
sudo chown -R root:root /opt/llama.cpp-snapdragon

llama-cli --version
llama-cli --list-devices

The wrapper commands in /usr/local/bin do not need to be recreated unless you change the install path.

Troubleshooting

`error while loading shared libraries`

If you run binaries directly from the package directory, set the library paths first:

cd ~/pkg-snapdragon
export LD_LIBRARY_PATH=$PWD/lib${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}
export ADSP_LIBRARY_PATH=$PWD/lib${ADSP_LIBRARY_PATH:+:$ADSP_LIBRARY_PATH}
./bin/llama-cli --version

If you are using the default install wrappers, confirm the wrapper is being used:

which llama-cli

It should print /usr/local/bin/llama-cli.

`HTP0` does not appear

Confirm that the required Dragonwing software packages are installed, especially libqnn-dev and qnn-tools. Then check devices again:

llama-cli --list-devices

Also confirm the package contains Hexagon backend libraries:

ls /opt/llama.cpp-snapdragon/lib/libggml-hexagon.so*
ls /opt/llama.cpp-snapdragon/lib/libggml-htp-*.so

The model does not load or runs out of memory

Try a smaller model or a smaller context length. Good first tests are 1B to 3B parameter instruct models with Q4 quantization.

The command uses the wrong llama.cpp binary

Check command resolution:

type -a llama-cli
type -a llama-server

If another path appears before /usr/local/bin, update your PATH or call /usr/local/bin/llama-cli explicitly.

Next steps

Try additional GGUF models and quantization formats.
Compare llama-cli and llama-server behavior for your application.
Rebuild frequently to pick up llama.cpp backend fixes and model compatibility updates.
Share model compatibility and performance findings with the community.

​What you will do

​Prerequisites

​Prepare the build host

​Build llama.cpp with the Snapdragon toolchain container

​Rebuild when upstream llama.cpp changes

​Copy the package to the Dragonwing device

​Make this the default llama.cpp install on the device

​Download a model

​Run your first prompt on HTP0

​Start llama-server

​Update the default install after a rebuild

​Troubleshooting

​error while loading shared libraries

​HTP0 does not appear

​The model does not load or runs out of memory

​The command uses the wrong llama.cpp binary

​Next steps

What you will do

Prerequisites

Prepare the build host

Build llama.cpp with the Snapdragon toolchain container

Rebuild when upstream llama.cpp changes

Copy the package to the Dragonwing device

Make this the default llama.cpp install on the device

Download a model

Run your first prompt on HTP0

Start llama-server

Update the default install after a rebuild

Troubleshooting

`error while loading shared libraries`

`HTP0` does not appear

The model does not load or runs out of memory

The command uses the wrong llama.cpp binary

Next steps