> ## Documentation Index
> Fetch the complete documentation index at: https://dragonwingdocs.qualcomm.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Run LLMs with llama.cpp on Dragonwing

> Build, install, and run llama.cpp with Hexagon HTP acceleration on Dragonwing IQ8, IQ9, and Ventuno Q devices.

<div style={{ marginBottom: "2rem" }}>
  <div
    style={{
fontSize: "0.72rem",
fontWeight: 700,
color: "#31017D",
letterSpacing: "1.5px",
textTransform: "uppercase",
marginBottom: "0.5rem"
}}
  >
    AI / ML
  </div>

  <div style={{ fontSize: "0.85rem", color: "#888", display: "flex", gap: "0.5rem", flexWrap: "wrap", alignItems: "center" }}>
    <span>Dragonwing Team</span>
    <span>·</span>
    <span>Jun 25, 2026</span>
    <span>·</span>
    <a href="/tutorial" style={{ color: "#31017D", fontWeight: 600, textDecoration: "none" }}>← All posts</a>
  </div>
</div>

<hr style={{ border: "none", borderTop: "1px solid #eee", margin: "0 0 2rem" }} />

This guide shows how to build and install [llama.cpp](https://github.com/ggml-org/llama.cpp) for Dragonwing devices, then run GGUF large language models with the Hexagon HTP backend.

<Note>
  These instructions focus on the Snapdragon and Dragonwing llama.cpp build that exposes the Hexagon HTP device as `HTP0`. This is different from the OpenCL GPU workflow.
</Note>

## What you will do

1. Prepare the Dragonwing target device.
2. Build a current llama.cpp package using the Snapdragon toolchain container.
3. Copy the package to the Dragonwing device.
4. Make the packaged `llama-cli` and `llama-server` the default commands on the device.
5. Download a GGUF model and run inference on `HTP0`.

## Prerequisites

Before you begin, make sure you have:

* Completed the first time setup for your Dragonwing device:
  * [Dragonwing IQ8 setup](/Ubuntu/devices/iq8275-evk/setup)
  * [Dragonwing IQ9 setup](/Ubuntu/devices/iq9075-evk/setup)
* Access to the device by SSH, or by a directly connected display, keyboard, and mouse.
  * The setup guides linked above include instructions for networking, serial console access, display setup, and SSH access.
* Installed the required Dragonwing software packages on the device:
  * [Dragonwing IQ8](/Ubuntu/devices/iq8275-evk/Install_required_software_packages)
  * [Dragonwing IQ9](/Ubuntu/devices/iq9075-evk/Install_required_software_packages)
* Installed Docker on the build host.
* Enough free space for the build output and models. Plan for several GB per model.

The required software package setup installs the QNN runtime and tools, including `libqnn-dev` and `qnn-tools`, that llama.cpp needs for accelerated inference.

## Prepare the build host

On your build host, clone llama.cpp or update an existing checkout.

```shell theme={null}
mkdir -p ~/src
cd ~/src

# Clone if this is your first build
git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp

# For repeat builds, update to the latest upstream changes
git fetch origin
git checkout master
git pull --ff-only
```

If you need a reproducible build, record the commit you built:

```shell theme={null}
git rev-parse HEAD
```

## Build llama.cpp with the Snapdragon toolchain container

The easiest way to build llama.cpp for Dragonwing is to use the Snapdragon ARM64 Linux toolchain container. The container includes the ARM64 cross compiler, CMake, OpenCL SDK, and Hexagon SDK pieces needed by the Snapdragon preset. The Docker command below explicitly requests the `linux/amd64` image.

From the root of your llama.cpp checkout, start the container:

```shell theme={null}
docker run -it \
  -u $(id -u):$(id -g) \
  --volume $(pwd):/workspace \
  --platform linux/amd64 \
  ghcr.io/snapdragon-toolchain/arm64-linux:v0.1
```

Inside the container, configure and build llama.cpp:

```shell theme={null}
cd /workspace
cp docs/backend/snapdragon/CMakeUserPresets.json .

cmake --preset arm64-linux-snapdragon-release -B build-snapdragon
cmake --build build-snapdragon -j $(nproc)
```

Create an installable package:

```shell theme={null}
cmake --install build-snapdragon --prefix pkg-snapdragon
zip -r pkg-snapdragon.zip pkg-snapdragon
```

Exit the container when the package is complete:

```shell theme={null}
exit
```

The package archive is now available on the host at:

```text theme={null}
~/src/llama.cpp/pkg-snapdragon.zip
```

## Rebuild when upstream llama.cpp changes

llama.cpp changes frequently. To rebuild with the latest upstream code, repeat this update and build flow from your host checkout:

```shell theme={null}
cd ~/src/llama.cpp
git fetch origin
git checkout master
git pull --ff-only

docker run -it \
  -u $(id -u):$(id -g) \
  --volume $(pwd):/workspace \
  --platform linux/amd64 \
  ghcr.io/snapdragon-toolchain/arm64-linux:v0.1
```

Then, inside the container:

```shell theme={null}
cd /workspace
cp docs/backend/snapdragon/CMakeUserPresets.json .

rm -rf build-snapdragon pkg-snapdragon pkg-snapdragon.zip
cmake --preset arm64-linux-snapdragon-release -B build-snapdragon
cmake --build build-snapdragon -j $(nproc)
cmake --install build-snapdragon --prefix pkg-snapdragon
zip -r pkg-snapdragon.zip pkg-snapdragon
```

<Note>
  If you are testing a branch, tag, or local llama.cpp changes, check out that source before running the Docker build command.
</Note>

## Copy the package to the Dragonwing device

Replace `ubuntu@DEVICE_IP` with your SSH user and target IP address.

```shell theme={null}
scp ~/src/llama.cpp/pkg-snapdragon.zip ubuntu@DEVICE_IP:/home/ubuntu/
```

Log in to the target device:

```shell theme={null}
ssh ubuntu@DEVICE_IP
```

Unpack the package:

```shell theme={null}
cd ~
unzip pkg-snapdragon.zip
cd pkg-snapdragon
```

Set the runtime library paths for the current shell:

```shell theme={null}
export LD_LIBRARY_PATH=$PWD/lib${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}
export ADSP_LIBRARY_PATH=$PWD/lib${ADSP_LIBRARY_PATH:+:$ADSP_LIBRARY_PATH}
```

Verify that the package runs:

```shell theme={null}
./bin/llama-cli --version
```

List the available llama.cpp devices:

```shell theme={null}
./bin/llama-cli --list-devices
```

Expected output includes:

```text theme={null}
Available devices:
  HTP0: Hexagon
```

## Make this the default llama.cpp install on the device

The packaged binaries need `LD_LIBRARY_PATH` and `ADSP_LIBRARY_PATH` so they can find the packaged llama.cpp and Hexagon backend libraries. The safest way to make this the default install is to move the package into `/opt` and create wrapper commands in `/usr/local/bin`.

Run the following on the Dragonwing device:

```shell theme={null}
cd ~
sudo rm -rf /opt/llama.cpp-snapdragon
sudo mkdir -p /opt
sudo cp -a ~/pkg-snapdragon /opt/llama.cpp-snapdragon
sudo chown -R root:root /opt/llama.cpp-snapdragon
```

Create a shared environment file:

```shell theme={null}
sudo tee /etc/llama.cpp-snapdragon.env >/dev/null <<'EOF'
LLAMA_CPP_SNAPDRAGON_HOME=/opt/llama.cpp-snapdragon
LD_LIBRARY_PATH=/opt/llama.cpp-snapdragon/lib
ADSP_LIBRARY_PATH=/opt/llama.cpp-snapdragon/lib
EOF
```

Create wrapper commands for `llama-cli` and `llama-server`:

```shell theme={null}
sudo tee /usr/local/bin/llama-cli >/dev/null <<'EOF'
#!/usr/bin/env bash
set -euo pipefail
source /etc/llama.cpp-snapdragon.env
exec "$LLAMA_CPP_SNAPDRAGON_HOME/bin/llama-cli" "$@"
EOF

sudo tee /usr/local/bin/llama-server >/dev/null <<'EOF'
#!/usr/bin/env bash
set -euo pipefail
source /etc/llama.cpp-snapdragon.env
exec "$LLAMA_CPP_SNAPDRAGON_HOME/bin/llama-server" "$@"
EOF

sudo chmod +x /usr/local/bin/llama-cli /usr/local/bin/llama-server
hash -r
```

Confirm that the default commands now resolve to the wrappers:

```shell theme={null}
which llama-cli
which llama-server
llama-cli --version
llama-cli --list-devices
```

Expected paths:

```text theme={null}
/usr/local/bin/llama-cli
/usr/local/bin/llama-server
```

<Tip>
  `/usr/local/bin` usually appears before `/usr/bin` in `PATH`, so these wrappers become the default commands without replacing system packages.
</Tip>

## Download a model

llama.cpp uses models in GGUF format. A small instruct model is a good first test.

Create a model directory on the Dragonwing device:

```shell theme={null}
mkdir -p ~/models
cd ~/models
```

Download a Llama 3.2 3B instruct GGUF model:

```shell theme={null}
wget https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct-Q4_0.gguf
```

<Note>
  Model support and performance vary by architecture, quantization, context length, and llama.cpp commit. We are continually testing new models. Community testing and performance reports are encouraged, especially for model families and quantization formats that run well on Dragonwing devices.
</Note>

## Run your first prompt on HTP0

Run `llama-cli` and offload layers to the Hexagon HTP device:

```shell theme={null}
llama-cli \
  -m ~/models/Llama-3.2-3B-Instruct-Q4_0.gguf \
  --device HTP0 \
  -ngl 99 \
  -p "What is the most popular cookie in the world?"
```

Useful options:

* `--device HTP0` selects the Hexagon HTP backend.
* `-ngl 99` asks llama.cpp to offload model layers to the selected device.
* `-m` points to your GGUF model file.
* `-p` passes a prompt for single prompt testing.

## Start llama-server

`llama-server` exposes a local web UI and an OpenAI compatible API.

Start a server on the Dragonwing device:

```shell theme={null}
llama-server \
  -m ~/models/Llama-3.2-3B-Instruct-Q4_0.gguf \
  --device HTP0 \
  -ngl 99 \
  --host 0.0.0.0 \
  --port 8080
```

Find the device IP address:

```shell theme={null}
hostname -I
```

From another machine on the same network, open:

```text theme={null}
http://DEVICE_IP:8080
```

You can also test the API with `curl`:

```shell theme={null}
curl http://DEVICE_IP:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Explain Dragonwing in one sentence."}
    ],
    "temperature": 0.7,
    "max_tokens": 128
  }'
```

## Update the default install after a rebuild

After you rebuild and copy a new `pkg-snapdragon.zip` to the device, update `/opt/llama.cpp-snapdragon`:

```shell theme={null}
cd ~
rm -rf pkg-snapdragon
unzip pkg-snapdragon.zip

sudo rm -rf /opt/llama.cpp-snapdragon
sudo cp -a ~/pkg-snapdragon /opt/llama.cpp-snapdragon
sudo chown -R root:root /opt/llama.cpp-snapdragon

llama-cli --version
llama-cli --list-devices
```

The wrapper commands in `/usr/local/bin` do not need to be recreated unless you change the install path.

## Troubleshooting

### `error while loading shared libraries`

If you run binaries directly from the package directory, set the library paths first:

```shell theme={null}
cd ~/pkg-snapdragon
export LD_LIBRARY_PATH=$PWD/lib${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}
export ADSP_LIBRARY_PATH=$PWD/lib${ADSP_LIBRARY_PATH:+:$ADSP_LIBRARY_PATH}
./bin/llama-cli --version
```

If you are using the default install wrappers, confirm the wrapper is being used:

```shell theme={null}
which llama-cli
```

It should print `/usr/local/bin/llama-cli`.

### `HTP0` does not appear

Confirm that the required Dragonwing software packages are installed, especially `libqnn-dev` and `qnn-tools`. Then check devices again:

```shell theme={null}
llama-cli --list-devices
```

Also confirm the package contains Hexagon backend libraries:

```shell theme={null}
ls /opt/llama.cpp-snapdragon/lib/libggml-hexagon.so*
ls /opt/llama.cpp-snapdragon/lib/libggml-htp-*.so
```

### The model does not load or runs out of memory

Try a smaller model or a smaller context length. Good first tests are 1B to 3B parameter instruct models with Q4 quantization.

### The command uses the wrong llama.cpp binary

Check command resolution:

```shell theme={null}
type -a llama-cli
type -a llama-server
```

If another path appears before `/usr/local/bin`, update your `PATH` or call `/usr/local/bin/llama-cli` explicitly.

## Next steps

* Try additional GGUF models and quantization formats.
* Compare `llama-cli` and `llama-server` behavior for your application.
* Rebuild frequently to pick up llama.cpp backend fixes and model compatibility updates.
* Share model compatibility and performance findings with the community.
