> ## Documentation Index
> Fetch the complete documentation index at: https://dragonwingdocs.qualcomm.com/llms.txt
> Use this file to discover all available pages before exploring further.

# LLMs/VLMs using Llama.cpp

You can run a wide range of Large Language Models (LLMs) and Vision Language Models (VLMs) on your Dragonwing development boards using [llama.cpp](https://github.com/ggml-org/llama.cpp). Models running under llama.cpp run on the *GPU*, not on the *NPU*. You can run a subset of models on the NPU via [GENIE](/ai-workflows/genie).

## Building llama.cpp

You'll need to build some dependencies for llama.cpp. Open the terminal on your development board, or an ssh session to your development board, and run:

1. Install build dependencies:

   ```
   sudo apt update
   sudo apt install -y cmake ninja-build curl libcurl4-openssl-dev build-essential
   ```

2. Install the OpenCL headers and ICD loader library:

   ```shell theme={null}
   mkdir -p ~/dev/llm

   # Symlink the OpenCL shared library
   sudo rm -f /usr/lib/libOpenCL.so
   sudo ln -s /lib/aarch64-linux-gnu/libOpenCL.so.1.0.0 /usr/lib/libOpenCL.so

   # OpenCL headers
   cd ~/dev/llm
   git clone https://github.com/KhronosGroup/OpenCL-Headers
   cd OpenCL-Headers
   git checkout 5d52989617e7ca7b8bb83d7306525dc9f58cdd46
   mkdir -p build && cd build
   cmake .. -G Ninja \
       -DBUILD_TESTING=OFF \
       -DOPENCL_HEADERS_BUILD_TESTING=OFF \
       -DOPENCL_HEADERS_BUILD_CXX_TESTS=OFF \
       -DCMAKE_INSTALL_PREFIX="$HOME/dev/llm/opencl"
   cmake --build . --target install

   # ICD Loader
   cd ~/dev/llm
   git clone https://github.com/KhronosGroup/OpenCL-ICD-Loader
   cd OpenCL-ICD-Loader
   git checkout 02134b05bdff750217bf0c4c11a9b13b63957b04
   mkdir -p build && cd build
   cmake .. -G Ninja \
       -DCMAKE_BUILD_TYPE=Release \
       -DCMAKE_PREFIX_PATH="$HOME/dev/llm/opencl" \
       -DCMAKE_INSTALL_PREFIX="$HOME/dev/llm/opencl"
   cmake --build . --target install

   # Symlink OpenCL headers
   sudo rm -rf /usr/include/CL
   sudo ln -s ~/dev/llm/opencl/include/CL/ /usr/include/CL
   ```

3. Build llama.cpp with the OpenCL backend:

   ```
   cd ~/dev/llm

   # Clone repository
   git clone https://github.com/ggml-org/llama.cpp
   cd llama.cpp

   # We've tested this commit explicitly, you can try master if you want bleeding edge
   git checkout f6da8cb86a28f0319b40d9d2a957a26a7d875f8c
   git rev-parse HEAD
   # Expected: f6da8cb86a28f0319b40d9d2a957a26a7d875f8c

   # Build
   mkdir -p build
   cd build
   cmake .. -G Ninja \
       -DCMAKE_BUILD_TYPE=Release \
       -DBUILD_SHARED_LIBS=OFF \
       -DGGML_OPENCL=ON
   ninja -j`nproc`
   ```

4. Add the llama.cpp paths to your PATH:

   ```
   cd ~/dev/llm/llama.cpp/build/bin

   echo "" >> ~/.bash_profile
   echo "# Begin llama.cpp" >> ~/.bash_profile
   echo "export PATH=\$PATH:$PWD" >> ~/.bash_profile
   echo "# End llama.cpp" >> ~/.bash_profile
   echo "" >> ~/.bash_profile

   # To use the llama.cpp files in your current session
   source ~/.bash_profile
   ```

5. You now have llama.cpp:

   ```
   llama-cli --version
   # ggml_opencl: selected platform: 'QUALCOMM Snapdragon(TM)'
   # ggml_opencl: device: 'QUALCOMM Adreno(TM) 663 (OpenCL 3.0 Adreno(TM) 663)'
   # ggml_opencl: OpenCL driver: OpenCL 3.0 QUALCOMM build: 0808.0.7 Compiler E031.49.02.00
   # ggml_opencl: vector subgroup broadcast support: true
   ```

### Downloading and quantizing a model

To run GPU-accelerated models you'll want pure 4-bit quantized (`Q4_0`) models in GGUF format (the llama.cpp format, [conversion guide](https://github.com/ggml-org/llama.cpp/discussions/2948)). You can either find pre-quantized models, or quantize a model yourself using `llama-quantize`. For example, for Qwen2-1.5B-Instruct:

```
# Download fp16 model (https://huggingface.co/Qwen/Qwen2-1.5B-Instruct-GGUF)
wget https://huggingface.co/Qwen/Qwen2-1.5B-Instruct-GGUF/resolve/main/qwen2-1_5b-instruct-fp16.gguf

# Quantize (pure Q4_0)
llama-quantize --pure qwen2-1_5b-instruct-fp16.gguf qwen2-1_5b-instruct-q4_0-pure.gguf Q4_0
```

### Running your first LLM using llama-cli

You're now ready to run the LLM via `llama-cli`. It'll automatically offload layers to the GPU:

```
llama-cli -m ./qwen2-1_5b-instruct-q4_0-pure.gguf -no-cnv --no-warmup -b 128 -c 2048 -s 11 -n 128 -p "Knock knock, " -fa off

# ... You'll see:
# load_tensors: offloaded 29/29 layers to GPU
# ...
# Knock knock, 11:59 pm ... rest of the story
```

🚀 You now have an LLM running on the GPU of your device!

### Serving LLMs using llama-server

Next, you can use `llama-server` to start a web server with a chat interface, and an OpenAI compatible chat completions API.

1. First, find the IP address of your development board:

   ```
   ifconfig | grep -Eo 'inet (addr:)?([0-9]*\.){3}[0-9]*' | grep -Eo '([0-9]*\.){3}[0-9]*' | grep -v '127.0.0.1'

   # ... Example:
   # 192.168.1.253
   ```

2. Start the server via:

   ```
   llama-server -m ./qwen2-1_5b-instruct-q4_0-pure.gguf --no-warmup -b 128 -c 2048 -s 11 -n 128 --host 0.0.0.0 --port 9876
   ```

3. On your computer, open a web browser and navigate to `http://192.168.1.253:9876` (replace the IP address with the one you found in 1.):

   <Frame caption="Serving LLMs using llama-server">
     <img src="https://mintlify.s3.us-west-1.amazonaws.com/qualcomm-prod/images/ai-workflows/llamacpp1.png" />
   </Frame>

4. You can also programmatically access this server using the OpenAI Chat Completions API. E.g. from Python:

   1. Create a new venv and install `requests`:

      ```
      python3 -m venv .venv-chat
      source .venv-chat/bin/activate
      pip3 install requests
      ```

   2. Create a new file `chat.py`:

      ```
      import requests

      # if running from your own computer, replace localhost with the IP address of your development board
      url = "http://localhost:9876/v1/chat/completions"

      payload = {
          "messages": [
              {"role": "system", "content": "You are a helpful assistant."},
              {"role": "user", "content": "Explain Qualcomm in one sentence."}
          ],
          "temperature": 0.7,
          "max_tokens": 200
      }

      response = requests.post(url, headers={ "Content-Type": "application/json" }, json=payload)
      print(response.json())
      ```

   3. Run `chat.py`:

      ```
      python3 chat.py

      # ...
      # {'choices': [{'finish_reason': 'stop', 'index': 0, 'message': {'role': 'assistant', 'content': 'Qualcomm is a leading global technology company that designs, develops, licenses, and markets semiconductor-based products and mobile platform technologies to major telecommunications and consumer electronics manufacturers worldwide.'}}], 'created': 1757073340, 'model': 'gpt-3.5-turbo', 'system_fingerprint': 'b6362-f6da8cb8', 'object': 'chat.completion', 'usage': {'completion_tokens': 34, 'prompt_tokens': 26, 'total_tokens': 60}, 'id': 'chatcmpl-3O7l005WG1DzN191FTNomJNweHMoH8Is', 'timings': {'prompt_n': 12, 'prompt_ms': 303.581, 'prompt_per_token_ms': 25.298416666666668, 'prompt_per_second': 39.52816546490064, 'predicted_n': 34, 'predicted_ms': 4052.23, 'predicted_per_token_ms': 119.18323529411765, 'predicted_per_second': 8.390441806116632}}
      ```

### Serving multi-modal LLMs

You can also use multi-modal LLMs. For example [SmolVLM-500M-Instruct-GGUF](https://huggingface.co/ggml-org/SmolVLM-500M-Instruct-GGUF). Download both the Q4\_0 quantized weights (or quantize them yourself), and download the CLIP encoder `mmproj-*.gguf` file. For example:

```
# Download weights
wget https://huggingface.co/ggml-org/SmolVLM-500M-Instruct-GGUF/resolve/main/SmolVLM-500M-Instruct-f16.gguf
wget https://huggingface.co/ggml-org/SmolVLM-500M-Instruct-GGUF/resolve/main/mmproj-SmolVLM-500M-Instruct-f16.gguf

# Quantize model (mmproj- models are not quantizable via llama-quantize, see below)
llama-quantize --pure SmolVLM-500M-Instruct-f16.gguf SmolVLM-500M-Instruct-q4_0-pure.gguf Q4_0

# Serve the model
llama-server -m ./SmolVLM-500M-Instruct-q4_0-pure.gguf --mmproj ./mmproj-SmolVLM-500M-Instruct-f16.gguf --no-warmup -b 128 -c 2048 -s 11 -n 128 --host 0.0.0.0 --port 9876
```

<Frame caption="Serving multi-modal LLMs using llama-server">
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/qualcomm-prod/images/ai-workflows/llamacpp2.png" />
</Frame>

**CLIP model is still fp16:** The `mmproj` model is still fp16; and thus processing images will be slow. There is code to quantize the CLIP encoder in [older versions of llama.cpp](https://github.com/ggml-org/llama.cpp/pull/11644), that you can explore.

## Tips & tricks

### Comparing CPU performance

Add `-ngl 0` to the `llama-*` commands to skip offloading layers to the GPU. Models will run on CPU, and you can compare performance to that of the GPU.

For example, the Qwen2-1.5B-Instruct Q4\_0:

**GPU:**

```
llama-cli -m ./qwen2-1_5b-instruct-q4_0-pure.gguf -no-cnv --no-warmup -b 128 -c 2048 -s 11 -n 128 -p "Knock knock, " -fa off

# llama_perf_sampler_print:    sampling time =      26.33 ms /   133 runs   (    0.20 ms per token,  5050.70 tokens per second)
# llama_perf_context_print:        load time =    3535.69 ms
# llama_perf_context_print: prompt eval time =     192.38 ms /     5 tokens (   38.48 ms per token,    25.99 tokens per second)
# llama_perf_context_print:        eval time =    5679.81 ms /   127 runs   (   44.72 ms per token,    22.36 tokens per second)
# llama_perf_context_print:       total time =    9276.10 ms /   132 tokens
# llama_perf_context_print:    graphs reused =        122
```

**CPU:**

```
llama-cli -m ./qwen2-1_5b-instruct-q4_0-pure.gguf -no-cnv --no-warmup -b 128 -ngl 99 -c 2048 -s 11 -n 128 -p "Knock knock, " -fa off -ngl 0

# llama_perf_sampler_print:    sampling time =      15.44 ms /   133 runs   (    0.12 ms per token,  8615.66 tokens per second)
# llama_perf_context_print:        load time =    1061.95 ms
# llama_perf_context_print: prompt eval time =      51.75 ms /     5 tokens (   10.35 ms per token,    96.62 tokens per second)
# llama_perf_context_print:        eval time =    2789.13 ms /   127 runs   (   21.96 ms per token,    45.53 tokens per second)
# llama_perf_context_print:       total time =    3885.55 ms /   132 tokens
# llama_perf_context_print:    graphs reused =        122
```

Here the CPU evaluates tokens about twice as fast as the GPU.
