Qualcomm has created a containerized service that exposes an OpenAI compatible API to allow for easy deployment and integration with any OpenAI API compatible frameworks such as LangChain and OpenWeb.

Key Features

OpenAI-Compatible API: Drop-in replacement for OpenAI Chat Completions API
Models run entirely on the NPU for fast performance, freeing up the CPU/GPU for other tasks.
Multi-Model Support: supports LLMs such as qwen3-4b-instruct and llama3.1-8b as well as VLMs such as qwen3-4b-VL. Additional LLM/VLMs can be added easily from AI Hub.
Automatic Context Management: Smart summarization when conversations get long
Thread-Based Sessions: Maintains conversation context across requests, allowing you to switch between calls to LLM and VLM within the same container instance.

Setup

The steps below assumes you have already set up your 9075 EVK using the instructions here. Once your device is set up, you can follow these steps to install and run the LLM/VLM microservice container. In this example, we will be copying both a LLM and a VLM model to the device and use a single instance of our container to run both.

Install required packages

# Add Qualcomm PPA repository
sudo add-apt-repository ppa:ubuntu-qcom-iot/qcom-ppa
sudo apt update
sudo apt install libqnn1

# Install docker-compose
sudo apt install docker-compose

Set up Docker

# Configure Docker group
sudo groupadd docker
sudo usermod -aG docker $USER
newgrp docker

Download the LLM/VLM models

Let’s download the qwen3-4b-instruct model optimized for QCS9075 from Hugging Face to get started. You can find a list of available LLM and VLM models on Qualcomm’s AI Hub page - you can use the filters on the left to select the chipset and model type you are interested in.
Direct Download:

IQ-8275

LLM: Qwen3-4b-instruct for QCS8275 direct download link.
VLM: Qwen2.5-VL-7B-Instruct for QCS8275 direct download link.

IQ-9075

LLM: Qwen3-4b-instruct for QCS9075 direct download link.
VLM: Qwen2.5-VL-7B-Instruct for QCS9075 direct download link.

Install the LLM/VLM Models

Now that you have downloaded the zip file, let’s copy it to your device. For this example we will assume you have created a models directory at ~/models.

Note that the name of the zip file from step 3 may be different than what is shown below so update accordingly.

# Copy model(s) from host to device (assuming ~/models is your chosen models directory)
scp -r qwen3_4b_instruct_2507-genie-w4a16-qualcomm_qcs9075.zip ubuntu@<device ip address>:~/models
scp -r qwen2_5_vl_7b_instruct-genie-w4a16-qualcomm_qcs9075.zip ubuntu@<device ip address>:~/models


# SSH into device and unzip
ssh ubuntu@<ip address>
cd ~/models
unzip qwen3_4b_instruct_2507-genie-w4a16-qualcomm_qcs9075.zip
unzip qwen2_5_vl_7b_instruct-genie-w4a16-qualcomm_qcs9075.zip

Your models directory on device should now look something like:
~/models/qwen3_4b_instruct_2507-genie-w4a16-qualcomm_qcs9075/
~/models/qwen2_5_vl_7b_instruct-genie-w4a16-qualcomm_qcs9075/

Install the docker compose file onto your device

Download the docker-compose.chatcompletion.yaml for your device + OS file and push to your device:
List of docker compose files

The compose file is currently named docker-compose-qcs9100-ubuntu.yaml in the deployment repository. Use that file for IQ-9075/QCS9075 unless a more specific compose file is published.

# Copy docker compose file to somewhere on your device
scp docker-compose-qcs9100-ubuntu.yaml ubuntu@<ip address>:~/tmp

Configure your LLM/VLM Container

Edit the docker-compose.yaml file you copied to the device from the previous step to configure:
Under environment:

GENAI_PORT: Change the port # where you want the service exposed (default: 9001)
MAX_ACTIVE_MODELS: Maximum number of distinct models the service will keep loaded simultaneously. When a new model is requested and the limit has been reached, the least-recently-used model is unloaded to make room. Each loaded model holds DSP/NPU resources and memory. (default: 2)
GENAI_CONTEXT_CAPPING: A safety cap on the context window each model is allowed to use. When set, every model’s native context size is reduced to min(original_size, cap). This limits memory usage and helps prevent out-of-memory situations on constrained devices. (default: 2048)

Under volumes:

GENAI_MODEL_DIR: Change to point to your models directory. In this example, we are using ~/models so change to:
${GENAI_MODEL_DIR:-~/models}:/mnt/work/models/
Make sure to set this to your models directory, or the container won’t be able to find your models

Start the container

The docker compose file will automatically download the container to your device if not present and run it with your configured settings.

# Start the container (add -d at the end to run detached)
docker-compose -f docker-compose-qcs9100-ubuntu.yaml up

Verify the LLM is working

Open a browser and navigate to http://<Device IP address>:<port>/docs to open the API browser. In the /v1/chat/completions/ API:

Select ‘Try it out’
Replace the request body with:

{
"messages": [
    {
    "content": "You are a helpful assistant",
    "role": "system"
    },
    {
    "content": "What is the capital of Italy?",
    "role": "user"
    }
],
"model": "qwen3_4b_instruct_2507",
"stream": false
}

Click on ‘Execute’ to send the request. You should see the response on the page below.

Verify the VLM is working

Open a browser and navigate to http://<Device IP address>:<port>/docs to open the API browser. In the /v1/chat/completions/ API:

Select ‘Try it out’
Replace the request body with:

{
    "model": "qwen2_5_vl_7b_instruct",
    "messages": [
    {
        "role": "user",
        "content": [
        {
            "type": "text",
            "text": "What is in this image?"
        },
        {
            "type": "image_url",
            "image_url": {
            "url": "https://images.pexels.com/photos/210186/pexels-photo-210186.jpeg?cs=srgb&dl=cascade-clouds-cool-wallpaper-210186.jpg&fm=jpg"
            }
        }
        ]
    }
    ]
}    

Click on ‘Execute’ to send the request. You should see the response on the page below. Note that the VLM may take a few seconds to run.
When you call the VLM for the first time after previously calling the LLM (or vice versa), there will be a slight delay in the response as the container is unloading the previous LLM and loading the VLM. Subsequent calls will execute much faster.

Stopping the Container

When you are finished, you can stop the container using the following:

docker-compose -f docker-compose-qcs9100-ubuntu.yaml down

​Key Features

​Setup

Key Features

Setup