Building llama.cpp
You’ll need to build some dependencies for llama.cpp. Open the terminal on your development board, or an ssh session to your development board, and run:-
Install build dependencies:
-
Install the OpenCL headers and ICD loader library:
-
Build llama.cpp with the OpenCL backend:
-
Add the llama.cpp paths to your PATH:
-
You now have llama.cpp:
Downloading and quantizing a model
To run GPU-accelerated models you’ll want pure 4-bit quantized (Q4_0) models in GGUF format (the llama.cpp format, conversion guide). You can either find pre-quantized models, or quantize a model yourself using llama-quantize. For example, for Qwen2-1.5B-Instruct:
Running your first LLM using llama-cli
You’re now ready to run the LLM viallama-cli. It’ll automatically offload layers to the GPU:
Serving LLMs using llama-server
Next, you can usellama-server to start a web server with a chat interface, and an OpenAI compatible chat completions API.
-
First, find the IP address of your development board:
-
Start the server via:
-
On your computer, open a web browser and navigate to
http://192.168.1.253:9876(replace the IP address with the one you found in 1.):
-
You can also programmatically access this server using the OpenAI Chat Completions API. E.g. from Python:
-
Create a new venv and install
requests: -
Create a new file
chat.py: -
Run
chat.py:
-
Create a new venv and install
Serving multi-modal LLMs
You can also use multi-modal LLMs. For example SmolVLM-500M-Instruct-GGUF. Download both the Q4_0 quantized weights (or quantize them yourself), and download the CLIP encodermmproj-*.gguf file. For example:

mmproj model is still fp16; and thus processing images will be slow. There is code to quantize the CLIP encoder in older versions of llama.cpp, that you can explore.
Tips & tricks
Comparing CPU performance
Add-ngl 0 to the llama-* commands to skip offloading layers to the GPU. Models will run on CPU, and you can compare performance to that of the GPU.
For example, the Qwen2-1.5B-Instruct Q4_0:
GPU:

