> ## Documentation Index > Fetch the complete documentation index at: https://dragonwingdocs.qualcomm.com/llms.txt > Use this file to discover all available pages before exploring further. # Running the LLM/VLM container Qualcomm has created a containerized service that exposes an OpenAI compatible API to allow for easy deployment and integration with any OpenAI API compatible frameworks such as [LangChain](https://docs.langchain.com/) and [OpenWeb](https://github.com/open-webui/open-webui). ## Key Features * OpenAI-Compatible API: Drop-in replacement for OpenAI Chat Completions API * Models run entirely on the NPU for fast performance, freeing up the CPU/GPU for other tasks. * Multi-Model Support: supports LLMs such as qwen3-4b-instruct and llama3.1-8b as well as VLMs such as qwen3-4b-VL. Additional LLM/VLMs can be added easily from [AI Hub](http://aihub.qualcomm.com). * Automatic Context Management: Smart summarization when conversations get long * Thread-Based Sessions: Maintains conversation context across requests, allowing you to switch between calls to LLM and VLM within the same container instance. # Setup The steps below assumes you have already [set up your 9075 EVK using the instructions here](/devices/iq9075-evk/setup). Once your device is set up, you can follow these steps to install and run the LLM/VLM microservice container. In this example, we will be copying both a LLM and a VLM model to the device and use a single instance of our container to run both. ```bash theme={null} # Add Qualcomm PPA repository sudo add-apt-repository ppa:ubuntu-qcom-iot/qcom-ppa sudo apt update sudo apt install libqnn1 # Install docker-compose sudo apt install docker-compose ``` ```bash theme={null} # Configure Docker group sudo groupadd docker sudo usermod -aG docker $USER newgrp docker ``` Let's download the qwen3-4b-instruct model optimized for QCS9075 from [Hugging Face](https://huggingface.co/qualcomm/Qwen3-4B-Instruct-2507) to get started. You can find a list of available LLM and VLM models on [Qualcomm's AI Hub page](https://aihub.qualcomm.com/iot/models?domain=Generative+AI\&useCase=Text+Generation\&chipsets=qualcomm-qcs9075) - you can use the filters on the left to select the chipset and model type you are interested in.
**Direct Download:**
**LLM:** [Qwen3-4b-instruct for QCS8275 direct download link.](https://qaihub-public-assets.s3.us-west-2.amazonaws.com/qai-hub-models/models/qwen3_4b_instruct_2507/releases/v0.53.1/qwen3_4b_instruct_2507-genie-w4a16-qualcomm_qcs8275.zip)
**VLM:** [Qwen2.5-VL-7B-Instruct for QCS8275 direct download link.](https://qaihub-public-assets.s3.us-west-2.amazonaws.com/qai-hub-models/models/qwen2_5_vl_7b_instruct/releases/v0.53.1/qwen2_5_vl_7b_instruct-genie-w4a16-qualcomm_qcs8275.zip)
**LLM:** [Qwen3-4b-instruct for QCS9075 direct download link.](https://qaihub-public-assets.s3.us-west-2.amazonaws.com/qai-hub-models/models/qwen3_4b_instruct_2507/releases/v0.53.1/qwen3_4b_instruct_2507-genie-w4a16-qualcomm_qcs9075.zip)
**VLM:** [Qwen2.5-VL-7B-Instruct for QCS9075 direct download link.](https://qaihub-public-assets.s3.us-west-2.amazonaws.com/qai-hub-models/models/qwen2_5_vl_7b_instruct/releases/v0.53.1/qwen2_5_vl_7b_instruct-genie-w4a16-qualcomm_qcs9075.zip)
Now that you have downloaded the zip file, let's copy it to your device. For this example we will assume you have created a models directory at `~/models`.
Note that the name of the zip file from step 3 may be different than what is shown below so update accordingly. ```bash theme={null} # Copy model(s) from host to device (assuming ~/models is your chosen models directory) scp -r qwen3_4b_instruct_2507-genie-w4a16-qualcomm_qcs9075.zip ubuntu@:~/models scp -r qwen2_5_vl_7b_instruct-genie-w4a16-qualcomm_qcs9075.zip ubuntu@:~/models # SSH into device and unzip ssh ubuntu@ cd ~/models unzip qwen3_4b_instruct_2507-genie-w4a16-qualcomm_qcs9075.zip unzip qwen2_5_vl_7b_instruct-genie-w4a16-qualcomm_qcs9075.zip ``` Your models directory on device should now look something like:
`~/models/qwen3_4b_instruct_2507-genie-w4a16-qualcomm_qcs9075/`
`~/models/qwen2_5_vl_7b_instruct-genie-w4a16-qualcomm_qcs9075/`
Download the docker-compose.chatcompletion.yaml for your device + OS file and push to your device:
[List of docker compose files](https://git.codelinaro.org/clo/le/solutions-microservices/-/tree/iot-solutions.lnx.1.0/microservices/llm-vlm/chatcompletions/deploy) The compose file is currently named `docker-compose-qcs9100-ubuntu.yaml` in the deployment repository. Use that file for IQ-9075/QCS9075 unless a more specific compose file is published. ```bash theme={null} # Copy docker compose file to somewhere on your device scp docker-compose-qcs9100-ubuntu.yaml ubuntu@:~/tmp ``` Edit the `docker-compose.yaml` file you copied to the device from the previous step to configure:
Under `environment:` * **GENAI\_PORT**: Change the port # where you want the service exposed (default: 9001)
* **MAX\_ACTIVE\_MODELS**: Maximum number of distinct models the service will keep loaded simultaneously. When a new model is requested and the limit has been reached, the least-recently-used model is unloaded to make room. Each loaded model holds DSP/NPU resources and memory. (default: 2)
* **GENAI\_CONTEXT\_CAPPING**: A safety cap on the context window each model is allowed to use. When set, every model's native context size is reduced to min(original\_size, cap). This limits memory usage and helps prevent out-of-memory situations on constrained devices. (default: 2048)
Under `volumes:` * **GENAI\_MODEL\_DIR**: Change to point to your models directory. In this example, we are using \~/models so change to:
`${GENAI_MODEL_DIR:-~/models}:/mnt/work/models/` Make sure to set this to your models directory, or the container won't be able to find your models The docker compose file will automatically download the container to your device if not present and run it with your configured settings. ```bash theme={null} # Start the container (add -d at the end to run detached) docker-compose -f docker-compose-qcs9100-ubuntu.yaml up ``` Open a browser and navigate to `http://:/docs` to open the API browser. In the `/v1/chat/completions/` API: 1. Select 'Try it out' 2. Replace the request body with: ```json theme={null} { "messages": [ { "content": "You are a helpful assistant", "role": "system" }, { "content": "What is the capital of Italy?", "role": "user" } ], "model": "qwen3_4b_instruct_2507", "stream": false } ``` 3. Click on 'Execute' to send the request. You should see the response on the page below. Image of LLM response

Open a browser and navigate to `http://:/docs` to open the API browser. In the `/v1/chat/completions/` API: 1. Select 'Try it out' 2. Replace the request body with: ```json theme={null} { "model": "qwen2_5_vl_7b_instruct", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "What is in this image?" }, { "type": "image_url", "image_url": { "url": "https://images.pexels.com/photos/210186/pexels-photo-210186.jpeg?cs=srgb&dl=cascade-clouds-cool-wallpaper-210186.jpg&fm=jpg" } } ] } ] } ``` 3. Click on 'Execute' to send the request. You should see the response on the page below. Note that the VLM may take a few seconds to run. Image of VLM response

When you call the VLM for the first time after previously calling the LLM (or vice versa), there will be a slight delay in the response as the container is unloading the previous LLM and loading the VLM. Subsequent calls will execute much faster. When you are finished, you can stop the container using the following: ```bash theme={null} docker-compose -f docker-compose-qcs9100-ubuntu.yaml down ```