> ## Documentation Index
> Fetch the complete documentation index at: https://dragonwingdocs.qualcomm.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Running the LLM/VLM container

Qualcomm has created a containerized service that exposes an OpenAI compatible API to allow for easy deployment and integration with any OpenAI API compatible frameworks such as [LangChain](https://docs.langchain.com/) and [OpenWeb](https://github.com/open-webui/open-webui).

## Key Features

* OpenAI-Compatible API: Drop-in replacement for OpenAI Chat Completions API
* Models run entirely on the NPU for fast performance, freeing up the CPU/GPU for other tasks.
* Multi-Model Support: supports LLMs such as qwen3-4b-instruct and llama3.1-8b as well as VLMs such as qwen3-4b-VL.  Additional LLM/VLMs can be added easily from [AI Hub](http://aihub.qualcomm.com).
* Automatic Context Management: Smart summarization when conversations get long
* Thread-Based Sessions: Maintains conversation context across requests, allowing you to switch between calls to LLM and VLM within the same container instance.

# Setup

The steps below assumes you have already [set up your 9075 EVK using the instructions here](/devices/iq9075-evk/setup).
Once your device is set up, you can follow these steps to install and run the LLM/VLM microservice container.   In this example, we will be copying both a LLM and a VLM model to the device and use a single instance of our container to run both.

<Steps>
  <Step title="Install required packages">
    ```bash theme={null}
    # Add Qualcomm PPA repository
    sudo add-apt-repository ppa:ubuntu-qcom-iot/qcom-ppa
    sudo apt update
    sudo apt install libqnn1

    # Install docker-compose
    sudo apt install docker-compose
    ```
  </Step>

  <Step title="Set up Docker">
    ```bash theme={null}
    # Configure Docker group
    sudo groupadd docker
    sudo usermod -aG docker $USER
    newgrp docker
    ```
  </Step>

  <Step title="Download the LLM/VLM models">
    Let's download the qwen3-4b-instruct model optimized for QCS9075 from [Hugging Face](https://huggingface.co/qualcomm/Qwen3-4B-Instruct-2507) to get started.  You can find a list of available LLM and VLM models on [Qualcomm's AI Hub page](https://aihub.qualcomm.com/iot/models?domain=Generative+AI\&useCase=Text+Generation\&chipsets=qualcomm-qcs9075) - you can use the filters on the left to select the chipset and model type you are interested in.<br />
    **Direct Download:**<br />

    <AccordionGroup>
      <Accordion title="IQ-8275">
        **LLM:** [Qwen3-4b-instruct for QCS8275 direct download link.](https://qaihub-public-assets.s3.us-west-2.amazonaws.com/qai-hub-models/models/qwen3_4b_instruct_2507/releases/v0.53.1/qwen3_4b_instruct_2507-genie-w4a16-qualcomm_qcs8275.zip)<br />
        **VLM:** [Qwen2.5-VL-7B-Instruct for QCS8275 direct download link.](https://qaihub-public-assets.s3.us-west-2.amazonaws.com/qai-hub-models/models/qwen2_5_vl_7b_instruct/releases/v0.53.1/qwen2_5_vl_7b_instruct-genie-w4a16-qualcomm_qcs8275.zip)<br />
      </Accordion>

      <Accordion title="IQ-9075">
        **LLM:** [Qwen3-4b-instruct for QCS9075 direct download link.](https://qaihub-public-assets.s3.us-west-2.amazonaws.com/qai-hub-models/models/qwen3_4b_instruct_2507/releases/v0.53.1/qwen3_4b_instruct_2507-genie-w4a16-qualcomm_qcs9075.zip)<br />
        **VLM:** [Qwen2.5-VL-7B-Instruct for QCS9075 direct download link.](https://qaihub-public-assets.s3.us-west-2.amazonaws.com/qai-hub-models/models/qwen2_5_vl_7b_instruct/releases/v0.53.1/qwen2_5_vl_7b_instruct-genie-w4a16-qualcomm_qcs9075.zip)<br />
      </Accordion>
    </AccordionGroup>
  </Step>

  <Step title="Install the LLM/VLM Models">
    Now that you have downloaded the zip file, let's copy it to your device.  For this example we will assume you have created a models directory at `~/models`.<br />
    <Note>Note that the name of the zip file from step 3 may be different than what is shown below so update accordingly.</Note>

    ```bash theme={null}
    # Copy model(s) from host to device (assuming ~/models is your chosen models directory)
    scp -r qwen3_4b_instruct_2507-genie-w4a16-qualcomm_qcs9075.zip ubuntu@<device ip address>:~/models
    scp -r qwen2_5_vl_7b_instruct-genie-w4a16-qualcomm_qcs9075.zip ubuntu@<device ip address>:~/models


    # SSH into device and unzip
    ssh ubuntu@<ip address>
    cd ~/models
    unzip qwen3_4b_instruct_2507-genie-w4a16-qualcomm_qcs9075.zip
    unzip qwen2_5_vl_7b_instruct-genie-w4a16-qualcomm_qcs9075.zip
    ```

    Your models directory on device should now look something like:<br />
    `~/models/qwen3_4b_instruct_2507-genie-w4a16-qualcomm_qcs9075/`<br />
    `~/models/qwen2_5_vl_7b_instruct-genie-w4a16-qualcomm_qcs9075/`<br />
  </Step>

  <Step title="Install the docker compose file onto your device">
    Download the docker-compose.chatcompletion.yaml for your device + OS file and push to your device:<br />
    [List of docker compose files](https://git.codelinaro.org/clo/le/solutions-microservices/-/tree/iot-solutions.lnx.1.0/microservices/llm-vlm/chatcompletions/deploy)

    <Note>The compose file is currently named `docker-compose-qcs9100-ubuntu.yaml` in the deployment repository. Use that file for IQ-9075/QCS9075 unless a more specific compose file is published.</Note>

    ```bash theme={null}
    # Copy docker compose file to somewhere on your device
    scp docker-compose-qcs9100-ubuntu.yaml ubuntu@<ip address>:~/tmp
    ```
  </Step>

  <Step title="Configure your LLM/VLM Container">
    Edit the `docker-compose.yaml` file you copied to the device from the previous step to configure:<br />
    Under `environment:`

    * **GENAI\_PORT**: Change the port # where you want the service exposed (default: 9001)<br />
    * **MAX\_ACTIVE\_MODELS**: Maximum number of distinct models the service will keep loaded simultaneously. When a new model is requested and the limit has been reached, the least-recently-used model is unloaded to make room.  Each loaded model holds DSP/NPU resources and memory. (default: 2)<br />
    * **GENAI\_CONTEXT\_CAPPING**: A safety cap on the context window each model is allowed to use. When set, every model's native context size is reduced to min(original\_size, cap). This limits memory usage and helps prevent out-of-memory situations on constrained devices. (default: 2048)<br />

    Under `volumes:`

    * **GENAI\_MODEL\_DIR**: Change to point to your models directory.  In this example, we are using \~/models so change to:<br /> `${GENAI_MODEL_DIR:-~/models}:/mnt/work/models/`
      <Warning>Make sure to set this to your models directory, or the container won't be able to find your models</Warning>
  </Step>

  <Step title="Start the container">
    The docker compose file will automatically download the container to your device if not present and run it with your configured settings.

    ```bash theme={null}
    # Start the container (add -d at the end to run detached)
    docker-compose -f docker-compose-qcs9100-ubuntu.yaml up
    ```
  </Step>

  <Step title="Verify the LLM is working">
    Open a browser and navigate to `http://<Device IP address>:<port>/docs` to open the API browser.
    In the `/v1/chat/completions/` API:

    1. Select 'Try it out'
    2. Replace the request body with:

    ```json theme={null}
    {
    "messages": [
        {
        "content": "You are a helpful assistant",
        "role": "system"
        },
        {
        "content": "What is the capital of Italy?",
        "role": "user"
        }
    ],
    "model": "qwen3_4b_instruct_2507",
    "stream": false
    }
    ```

    3. Click on 'Execute' to send the request.  You should see the response on the page below.
           <img src="https://mintlify.s3.us-west-1.amazonaws.com/qualcomm-prod/images/running-building-ai-models/container-llm-response.png" alt="Image of LLM response" />
  </Step>

  <Step title="Verify the VLM is working">
    Open a browser and navigate to `http://<Device IP address>:<port>/docs` to open the API browser.
    In the `/v1/chat/completions/` API:

    1. Select 'Try it out'
    2. Replace the request body with:

    ```json theme={null}
    {
        "model": "qwen2_5_vl_7b_instruct",
        "messages": [
        {
            "role": "user",
            "content": [
            {
                "type": "text",
                "text": "What is in this image?"
            },
            {
                "type": "image_url",
                "image_url": {
                "url": "https://images.pexels.com/photos/210186/pexels-photo-210186.jpeg?cs=srgb&dl=cascade-clouds-cool-wallpaper-210186.jpg&fm=jpg"
                }
            }
            ]
        }
        ]
    }    
    ```

    3. Click on 'Execute' to send the request.  You should see the response on the page below.  Note that the VLM may take a few seconds to run.
           <img src="https://mintlify.s3.us-west-1.amazonaws.com/qualcomm-prod/images/running-building-ai-models/container-vlm-response.png" alt="Image of VLM response" />
       <Note>When you call the VLM for the first time after previously calling the LLM (or vice versa), there will be a slight delay in the response as the container is unloading the previous LLM and loading the VLM.  Subsequent calls will execute much faster.</Note>
  </Step>

  <Step title="Stopping the Container">
    When you are finished, you can stop the container using the following:

    ```bash theme={null}
    docker-compose -f docker-compose-qcs9100-ubuntu.yaml down
    ```
  </Step>
</Steps>
