> ## Documentation Index
> Fetch the complete documentation index at: https://dragonwingdocs.qualcomm.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Qualcomm® Intelligent Multimedia (IM) SDK

> Build zero-copy AI vision pipelines using GStreamer plugins that run on the GPU and NPU of Dragonwing devices.

The Qualcomm® Intelligent Multimedia (IM) SDK is a set of GStreamer plugins that let you run computer vision operations on the GPU of your Dragonwing development board; and that can create AI pipelines that run fully on GPU and NPU, without ever having to yield back to the CPU (zero-copy). Together this makes it possible to achieve much higher throughput rates than when you implement AI computer vision pipelines yourself in for example, OpenCV + TFLite.

## So... GStreamer pipelines?

The IM SDK is built on top of GStreamer. GStreamer is a multimedia framework that lets you describe a processing pipeline for video or audio, and it takes care of running each step in order. In "normal Python" you might write OpenCV code that grabs a frame from a webcam, resizes and crops it, calls into an inference function, draws bounding boxes on the result, and then outputs or displays the frame again — with each step running on the CPU unless you explicitly wire up GPU/NPU APIs yourself. With GStreamer + IM SDK, you declare that same sequence in a pipeline string, and the framework streams frames through the chain for you.

What IM SDK adds on Qualcomm hardware is the ability for those steps to be transparently accelerated: resize/crop and drawing bounding boxes can run on the GPU, inference can run on the NPU, and whole chains of operations (e.g. crop → resize → NN inference) can execute without ever yielding back to the CPU (zero-copy). From your application you only need to configure the pipeline; the underlying framework handles frame-by-frame scheduling, synchronization, and accelerator offload.

The IM SDK provides the special GStreamer plugins that make this possible. For example, `qtivtransform` offloads color conversion, cropping, and resizing to the GPU, while `qtimltflite` handles inference on the NPU. This way, the same high-level pipeline you'd write with standard GStreamer can now run almost entirely on dedicated accelerators, giving you real-time throughput with minimal CPU load.

## Setting up GStreamer and the IM SDK

Alright, let's go build some applications using the IM SDK.

1. Install GStreamer, the IM SDK and some extra dependencies we'll need in this example. Open the terminal on your development board, or an SSH session to your development board, and run:

   ```shell theme={null}
   if [ ! -f /etc/apt/sources.list.d/ubuntu-qcom-iot-ubuntu-qcom-ppa-noble.list ]; then
       sudo apt-add-repository -y ppa:ubuntu-qcom-iot/qcom-ppa
   fi

   # Install GStreamer / IM SDK
   sudo apt update
   sudo apt install -y gstreamer1.0-tools gstreamer1.0-plugins-good gstreamer1.0-plugins-base gstreamer1.0-plugins-base-apps gstreamer1.0-plugins-qcom-good gstreamer1.0-qcom-sample-apps

   # Install Python bindings for GStreamer, and some build dependencies
   sudo apt install -y v4l-utils libcairo2-dev pkg-config python3-dev libgirepository1.0-dev gir1.2-gstreamer-1.0
   ```
2. Get the [python examples](../python-examples/), extract them, create a venv, and install their dependencies:

   ```shell theme={null}
   # Download python_examples.zip from the examples page, then extract it.
   sudo apt install -y unzip
   mkdir -p ~/imsdk-python-examples
   cd ~/imsdk-python-examples
   unzip ~/Downloads/python_examples.zip

   # Create a new venv
   python3 -m venv .venv --system-site-packages
   source .venv/bin/activate

   # Install Python dependencies
   pip3 install -r requirements.txt
   ```
3. You'll need a camera (either built-in, like on the RB3 Gen 2 Vision Kit), or a USB webcam.
   * If you want to use a USB webcam:
     1. Find out the device ID:

        ```shell theme={null}
        v4l2-ctl --list-devices
        # msm_vidc_media (platform:aa00000.video-codec):
        #         /dev/media0
        #
        # msm_vidc_decoder (platform:msm_vidc_bus):
        #         /dev/video32
        #         /dev/video33
        #
        # C922 Pro Stream Webcam (usb-0000:01:00.0-2):
        #         /dev/video2     <-- So /dev/video2
        #         /dev/video3
        #         /dev/media3
        ```
     2. Set the environment variable (we'll use this in our examples):

        ```text theme={null}
        export IMSDK_VIDEO_SOURCE="v4l2src device=/dev/video2"
        ```
   * If you're on the RB3 Gen 2 Vision Kit, and want to use the built-in camera:

     ```text theme={null}
     export IMSDK_VIDEO_SOURCE="qtiqmmfsrc name=camsrc camera=0"
     ```

## Example 1: Resizing and cropping on GPU vs. CPU

Let's show how much faster working on the GPU can be compared to the CPU. If you have a neural network that expects a 224x224 RGB input, you'll need to preprocess your data: first, grab the frame from the webcam (e.g. native resolution is 1980x1080), then crop it to a 1/1 aspect ratio (e.g. crop to 1080x1080), then resize to the desired resolution (224x224), and finally create a Numpy array from the pixels.

1. Create a new file `ex1.py`, and add:

   ```python theme={null}
   from gst_helper import gst_grouped_frames, atomic_save_image, timing_marks_to_str
   import time, argparse

   parser = argparse.ArgumentParser(description='GStreamer -> Python RGB frames')
   parser.add_argument('--video-source', type=str, required=True, help='GStreamer video source (e.g. "v4l2src device=/dev/video2" or "qtiqmmfsrc name=camsrc camera=0")')
   args, unknown = parser.parse_known_args()

   PIPELINE = (
       # Video source
       f"{args.video_source} ! "
       # Properties for the video source
       "video/x-raw,width=1920,height=1080 ! "
       # An identity element so we can track when a new frame is ready (so we can calc. processing time)
       "identity name=frame_ready_webcam silent=false ! "
       # Crop to square
       "videoconvert ! aspectratiocrop aspect-ratio=1/1 ! "
       # Scale to 224x224 and RGB
       "videoscale ! video/x-raw,format=RGB,width=224,height=224 ! "
       # Event when the crop/scale are done
       "identity name=transform_done silent=false ! "
       # Send out the resulting frame to an appsink (where we can pick it up from Python)
       "queue max-size-buffers=2 leaky=downstream ! "
       "appsink name=frame drop=true sync=false max-buffers=1 emit-signals=true"
   )

   for frames_by_sink, marks in gst_grouped_frames(PIPELINE):
       print(f"Frame ready")
       print('    Data:', end='')
       for key in list(frames_by_sink):
           print(f' name={key} {frames_by_sink[key].shape}', end='')
       print('')
       print('    Timings:', timing_marks_to_str(marks))

       # Save image to disk, frames_by_sink has all the
       frame = frames_by_sink['frame']
       atomic_save_image(frame=frame, path='out/gstreamer.png')
   ```
2. Let's launch the python script. This pipeline runs on the CPU (using vanilla GStreamer components):

   ```shell theme={null}
   python3 ex1.py --video-source "$IMSDK_VIDEO_SOURCE"

   # Frame ready
   #     Data: name=frame (224, 224, 3)
   #     Timings: frame_ready_webcam->transform_done: 22.16ms, transform_done->pipeline_finished: 2.14ms (total 24.31ms)
   # Frame ready
   #     Data: name=frame (224, 224, 3)
   #     Timings: frame_ready_webcam->transform_done: 22.21ms, transform_done->pipeline_finished: 1.25ms (total 23.46ms)
   ```

   Here you see the resize/crop takes 22ms (measured on IQ9 with USB camera).
3. Now let's make this run on the GPU instead... Replace:

   ```shell theme={null}
       # Crop to square
       "videoconvert ! aspectratiocrop aspect-ratio=1/1 ! "
       # Scale to 224x224 and RGB
       "videoscale ! video/x-raw,format=RGB,width=224,height=224 ! "
   ```

   With:

   ```text theme={null}
       # Crop (square), the crop syntax is ('<X, Y, WIDTH, HEIGHT >').
       # So here we use 1920x1080 input, then center crop to 1080x1080 ((1920-1080)/2 = 420 = x crop)
       f'qtivtransform crop="<420, 0, 1080, 1080>" ! '
       # then resize to 224x224
       "video/x-raw,format=RGB,width=224,height=224 ! "
   ```

   Here is the complete file `ex1_imsdk.py`:

   ```python theme={null}
   from gst_helper import gst_grouped_frames, atomic_save_image, timing_marks_to_str
   import time, argparse

   parser = argparse.ArgumentParser(description='GStreamer -> Python RGB frames')
   parser.add_argument('--video-source', type=str, required=True, help='GStreamer video source (e.g. "v4l2src device=/dev/video2" or "qtiqmmfsrc name=camsrc camera=0")')
   args, unknown = parser.parse_known_args()

   PIPELINE = (
       # Video source
       f"{args.video_source} ! "
       # Properties for the video source
       "video/x-raw,width=1920,height=1080 ! "
       # An identity element so we can track when a new frame is ready (so we can calc. processing time)
       "identity name=frame_ready_webcam silent=false ! "
       # Crop (square), the crop syntax is ('<X, Y, WIDTH, HEIGHT >').
       # So here we use 1920x1080 input, then center crop to 1080x1080 ((1920-1080)/2 = 420 = x crop)
       f'qtivtransform crop="<420, 0, 1080, 1080>" ! '
       # then resize to 224x224
       "video/x-raw,format=RGB,width=224,height=224 ! "
       # Event when the crop/scale are done
       "identity name=transform_done silent=false ! "
       # Send out the resulting frame to an appsink (where we can pick it up from Python)
       "queue max-size-buffers=2 leaky=downstream ! "
       "appsink name=frame drop=true sync=false max-buffers=1 emit-signals=true"
   )

   for frames_by_sink, marks in gst_grouped_frames(PIPELINE):
       print(f"Frame ready")
       print('    Data:', end='')
       for key in list(frames_by_sink):
          print(f' name={key} {frames_by_sink[key].shape}', end='')
       print('')
       print('    Timings:', timing_marks_to_str(marks))

       # Save image to disk, frames_by_sink has all the
       frame = frames_by_sink['frame']
       atomic_save_image(frame=frame, path='out/gstreamer.png')
   ```
4. Run this again:

   ```text theme={null}
   python3 ex1.py --video-source "$IMSDK_VIDEO_SOURCE"

   # Frame ready
   #     Data: name=frame (224, 224, 3)
   #     Timings: frame_ready_webcam->transform_done: 6.55ms, transform_done->pipeline_finished: 0.78ms (total 7.33ms)
   # Frame ready
   #     Data: name=frame (224, 224, 3)
   #     Timings: frame_ready_webcam->transform_done: 6.60ms, transform_done->pipeline_finished: 0.78ms (total 7.38ms)
   ```

   🚀  You've now sped up the crop/resize operation from \~22ms to \~6ms; with just two lines of code!

## Example 2: Tee'ing streams and multiple outputs

In the pipeline above you've seen a few elements that will be relevant when interacting with your own code:

* **Identity** elements (e.g. `identity name=frame_ready_webcam silent=false`). These can be used to debug timing in a pipeline. The timestamp when they're emitted is saved, and then returned at the end of the pipeline in the `marks` element (k/v pair, key is the identity name, value is the timestamp).
* **Appsink** elements (e.g. `appsink name=frame`). These are used to send data from a GStreamer pipeline to your application. Here the element *before* the appsink is a `video/x-raw,format=RGB,width=224,height=224` - so we'll send a 224x224 RGB array to Python. You receive these in the `frames_by_sink` element (k/v pair, key is the appsink name, value is the data).

You can have multiple appsinks per pipeline. For example, you might want to grab the original 1920x1080 image as well. In that case you can split the pipeline up in two parts, right after `identity name=frame_ready_webcam`; and send one part to a new appsink; and the other part through the resize/crop pipeline.

1. Create a new file `ex2.py` and add:

   ```python theme={null}
   from gst_helper import gst_grouped_frames, atomic_save_image, timing_marks_to_str
   import time, argparse

   parser = argparse.ArgumentParser(description='GStreamer -> Python RGB frames')
   parser.add_argument('--video-source', type=str, required=True, help='GStreamer video source (e.g. "v4l2src device=/dev/video2" or "qtiqmmfsrc name=camsrc camera=0")')
   args, unknown = parser.parse_known_args()

   PIPELINE = (
       # Video source
       f"{args.video_source} ! "
       # Properties for the video source
       "video/x-raw,width=1920,height=1080 ! "
       # An identity element so we can track when a new frame is ready (so we can calc. processing time)
       "identity name=frame_ready_webcam silent=false ! "

       # Split the stream
       "tee name=t "

       # Branch A) convert to RGB and send to original appsink
           "t. ! queue max-size-buffers=1 leaky=downstream ! "
           "qtivtransform ! video/x-raw,format=RGB ! "
           "appsink name=original drop=true sync=false max-buffers=1 emit-signals=true "

       # Branch B) resize/crop to 224x224 -> send to another appsink
           "t. ! queue max-size-buffers=1 leaky=downstream ! "
           # Crop (square), the crop syntax is ('<X, Y, WIDTH, HEIGHT >').
           # So here we use 1920x1080 input, then center crop to 1080x1080 ((1920-1080)/2 = 420 = x crop)
           f'qtivtransform crop="<420, 0, 1080, 1080>" ! '
           # then resize to 224x224
           "video/x-raw,format=RGB,width=224,height=224 ! "
           # Event when the crop/scale are done
           "identity name=transform_done silent=false ! "
           # Send out the resulting frame to an appsink (where we can pick it up from Python)
           "queue max-size-buffers=2 leaky=downstream ! "
           "appsink name=frame drop=true sync=false max-buffers=1 emit-signals=true "
   )

   for frames_by_sink, marks in gst_grouped_frames(PIPELINE):
       print(f"Frame ready")
       print('    Data:', end='')
       for key in list(frames_by_sink):
           print(f' name={key} {frames_by_sink[key].shape}', end='')
       print('')
       print('    Timings:', timing_marks_to_str(marks))

       # Save image to disk
       frame = frames_by_sink['frame']
       atomic_save_image(frame=frame, path='out/imsdk.png')
       original = frames_by_sink['original']
       atomic_save_image(frame=original, path='out/imsdk_original.png')
   ```
2. Run this python:

   ```text theme={null}
   python3 ex2.py --video-source "$IMSDK_VIDEO_SOURCE"

   # Frame ready
   #      Data: name=frame (224, 224, 3) name=original (1080, 1920, 3)
   #      Timings: frame_ready_webcam->transform_done: 5.42ms, transform_done->pipeline_finished: 4.22ms (total 9.64ms)
   # Frame ready
   #      Data: name=frame (224, 224, 3) name=original (1080, 1920, 3)
   #      Timings: frame_ready_webcam->transform_done: 5.51ms, transform_done->pipeline_finished: 4.41ms (total 9.92ms)
   ```

   (The `out/` directory has the last processed frames in both original and resized resolutions)

Alright! That gives you *two* outputs from a single pipeline. Now you know how to construct more complex applications in a single pipeline.

## Example 3: Run a neural network

Now that we have images streaming from the webcam in the correct resolution, let's add a neural network to the mix.

<Danger>
  The following workflows only work on Ubuntu Server edition, not on Ubuntu Desktop OS.
</Danger>

### 3.1: Neural network and compositing in Python

1. First we'll do a "normal" implementation, where we take the resized frame from the IM SDK pipeline, and then use [LiteRT](/ai-workflows/lite-rt) to run the model (on the NPU). Afterwards we'll then draw the top prediction on the image and write it to disk. Create a new file `ex3_from_python.py` and add:

   ```python theme={null}
   from gst_helper import gst_grouped_frames, atomic_save_pillow_image, timing_marks_to_str, download_file_if_needed, softmax
   import time, argparse, numpy as np
   from ai_edge_litert.interpreter import Interpreter, load_delegate
   from PIL import ImageDraw, Image

   parser = argparse.ArgumentParser(description='GStreamer -> SqueezeNet')
   parser.add_argument('--video-source', type=str, required=True, help='GStreamer video source (e.g. "v4l2src device=/dev/video2" or "qtiqmmfsrc name=camsrc camera=0")')
   args, unknown = parser.parse_known_args()

   MODEL_PATH = download_file_if_needed('models/squeezenet1_1-squeezenet-1.1-w8a8.tflite', 'https://cdn.edgeimpulse.com/qc-ai-docs/models/squeezenet1_1-squeezenet-1.1-w8a8.tflite')
   LABELS_PATH = download_file_if_needed('models/SqueezeNet-1.1_labels.txt', 'https://cdn.edgeimpulse.com/qc-ai-docs/models/SqueezeNet-1.1_labels.txt')

   # Parse labels
   with open(LABELS_PATH, 'r') as f:
       labels = [line for line in f.read().splitlines() if line.strip()]

   # Load TFLite model and allocate tensors, note: this is a 224x224 model with uint8 input!
   # If your models are different, then you'll need to update the pipeline below.
   interpreter = Interpreter(
       model_path=MODEL_PATH,
       experimental_delegates=[load_delegate("libQnnTFLiteDelegate.so", options={"backend_type": "htp"})]     # Use NPU
   )
   interpreter.allocate_tensors()
   input_details = interpreter.get_input_details()
   output_details = interpreter.get_output_details()

   PIPELINE = (
       # Video source
       f"{args.video_source} ! "
       # Properties for the video source
       "video/x-raw,width=1920,height=1080 ! "
       # An identity element so we can track when a new frame is ready (so we can calc. processing time)
       "identity name=frame_ready_webcam silent=false ! "
       # Crop (square), the crop syntax is ('<X, Y, WIDTH, HEIGHT >').
       # So here we use 1920x1080 input, then center crop to 1080x1080 ((1920-1080)/2 = 420 = x crop)
       f'qtivtransform crop="<420, 0, 1080, 1080>" ! '
       # then resize to 224x224
       "video/x-raw,format=RGB,width=224,height=224 ! "
       # Event when the crop/scale are done
       "identity name=transform_done silent=false ! "
       # Send out the resulting frame to an appsink (where we can pick it up from Python)
       "queue max-size-buffers=2 leaky=downstream ! "
       "appsink name=frame drop=true sync=false max-buffers=1 emit-signals=true "
   )

   for frames_by_sink, marks in gst_grouped_frames(PIPELINE):
       print(f"Frame ready")
       print('    Data:', end='')
       for key in list(frames_by_sink):
           print(f' name={key} {frames_by_sink[key].shape}', end='')
       print('')

       # Begin inference timer
       inference_start = time.perf_counter()

       # Set tensor with the image received in "frames_by_sink['frame']", add batch dim, and run inference
       interpreter.set_tensor(input_details[0]['index'], frames_by_sink['frame'].reshape((1, 224, 224, 3)))
       interpreter.invoke()

       # Get prediction (dequantized)
       q_output = interpreter.get_tensor(output_details[0]['index'])
       scale, zero_point = output_details[0]['quantization']
       f_output = (q_output.astype(np.float32) - zero_point) * scale

       # Image classification models in AI Hub miss a Softmax() layer at the end of the model, so add it manually
       scores = softmax(f_output[0])

       # End inference timer
       inference_end = time.perf_counter()

       # Add an extra mark, so we have timing info for the complete pipeline
       marks['inference_done'] = list(marks.items())[-1][1] + (inference_end - inference_start)

       # Print top-5 predictions
       top_k = scores.argsort()[-5:][::-1]
       print(f"    Top-5 predictions:")
       for i in top_k:
           print(f"        Class {labels[i]}: score={scores[i]}")

       # Image composition timer
       image_composition_start = time.perf_counter()

       # Add the top 5 scores to the image, and save image to disk (for debug purposes)
       frame = frames_by_sink['frame']
       img = Image.fromarray(frame)
       img_draw = ImageDraw.Draw(img)
       img_draw.text((10, 10), f"{labels[top_k[0]]} ({scores[top_k[0]]:.2f})", fill="black")
       atomic_save_pillow_image(img=img, path='out/imsdk_with_prediction.png')

       image_composition_end = time.perf_counter()

       # Add an extra mark, so we have timing info for the complete pipeline
       marks['image_composition_end'] = list(marks.items())[-1][1] + (image_composition_end - image_composition_start)

       print('    Timings:', timing_marks_to_str(marks))
   ```
2. Now run this application:

   ```text theme={null}
   # We use '| grep -v "<W>"' to filter out some warnings - you can omit it if you want.
   python3 ex3_from_python.py --video-source "$IMSDK_VIDEO_SOURCE" | grep -v "<W>"

   # Frame ready
   #     Data: name=frame (224, 224, 3)
   # Top-5 predictions:
   #     Class laptop: score=0.4951600134372711
   #     Class notebook: score=0.33943524956703186
   #     Class computer keyboard: score=0.07495530694723129
   #     Class space bar: score=0.062059469521045685
   #     Class typewriter keyboard: score=0.007778045255690813
   # Timings: frame_ready_webcam->transform_done: 6.81ms, transform_done->pipeline_finished: 0.77ms, pipeline_finished->inference_done: 1.19ms, inference_done->image_composition_end: 37.85ms (total 46.61ms)
   ```

   <Frame caption="Image classification model with an overlay">
     <img src="https://mintlify.s3.us-west-1.amazonaws.com/qualcomm-prod/images/ai-workflows/imsdk_with_prediction.png" alt="" />
   </Frame>

   Absolutely not bad, but let's see if we can do better...

### 3.2: Running the neural network with IM SDK

Let's move the neural network inference to the IM SDK. You do this through three plugins:

* `qtimlvconverter` - to convert the frame into an input tensor.
* `qtimltflite` - to run a neural network (in LiteRT format). If you send these results over an appsink you'll get the exact same tensor back as earlier (you just didn't need to hit the CPU to invoke the inference engine).
* An element like `qtimlpostprocess` to interpret the output. Here this plugin is made for image classification usecases (like the SqueezeNet model we use) with a `(1, n)` shape. This plugin outputs either text (with the predictions), or an overlay (to draw onto the original image).

<Note>
  **Note:** This element has a particular label format (see below).
</Note>

1. Create a new file `ex3_nn_imsdk.py` and add:

   ```python theme={null}
    from gst_helper import gst_grouped_frames, atomic_save_pillow_image, timing_marks_to_str, download_file_if_needed
    import argparse, numpy as np
    from PIL import Image

    parser = argparse.ArgumentParser(description='GStreamer -> SqueezeNet')
    parser.add_argument('--video-source', type=str, required=True, help='GStreamer video source (e.g. "v4l2src device=/dev/video2" or "qtiqmmfsrc name=camsrc camera=0")')
    args, unknown = parser.parse_known_args()

    MODEL_PATH = download_file_if_needed('models/squeezenet1_1-squeezenet-1.1-w8a8.tflite', 'https://cdn.edgeimpulse.com/qc-ai-docs/models/squeezenet1_1-squeezenet-1.1-w8a8.tflite')
    LABELS_PATH = download_file_if_needed('models/SqueezeNet-1.1_labels.txt', 'https://cdn.edgeimpulse.com/qc-ai-docs/models/SqueezeNet-1.1_labels.txt')

    # Parse labels
    with open(LABELS_PATH, 'r') as f:
        labels = [line for line in f.read().splitlines() if line.strip()]

    PIPELINE = (
        # Video source
        f"{args.video_source} ! "
        # Properties for the video source
        "video/x-raw,width=1920,height=1080 ! "
        # An identity element so we can track when a new frame is ready (so we can calc. processing time)
        "identity name=frame_ready_webcam silent=false ! "
        'qtivtransform ! '
        # NV12 for tightly packed buffer
        "video/x-raw,format=NV12 ! "
        # Mark after transform
        "identity name=transform_done silent=false ! "

        # turn into right format (UINT8 data type) and add batch dimension
        'qtimlvconverter ! neural-network/tensors,type=UINT8,dimensions=<<1,224,224,3>> ! '
        # Event when conversion is done
        "identity name=conversion_done silent=false ! "
        # run inference (using the QNN delegates to run on NPU)
        f'qtimltflite delegate=external external-delegate-path=libQnnTFLiteDelegate.so external-delegate-options="QNNExternalDelegate,backend_type=htp;" model="{MODEL_PATH}" ! '
        # Event when inference is done
        "identity name=inference_done silent=false ! "

        # Post-process (Mobilenet-style softmax).
        f'qtimlpostprocess name=postproc module=mobilenet-softmax '
        f'labels="{LABELS_PATH}" results=5 settings="{{\\"confidence\\": 10.0}}" ! '
        "text/x-raw,format=utf8 ! "
        # Send to application
        "queue max-size-buffers=2 leaky=downstream ! "
        'appsink name=qtimlpostprocess_text drop=true sync=false max-buffers=1 emit-signals=true '
    )

    for frames_by_sink, marks in gst_grouped_frames(PIPELINE):
        print("Frame ready")
        print(' Data:', end='')
        for key in list(frames_by_sink):
            print(f' name={key} {frames_by_sink[key].shape} ({frames_by_sink[key].dtype})', end='')
        print('')

        # Grab the qtimlpostprocess_text (utf8 text) with predictions from IM SDK
        cls_text = frames_by_sink['qtimlpostprocess_text'].tobytes().decode('utf-8')
        print(' qtimlpostprocess_text:', cls_text)
        print(' Timings:', timing_marks_to_str(marks))
   ```

**NV12:** We switched from `RGB` to `NV12` format here (after `qtivtransform`), as `qtimltflite` requires a tightly packed buffer - and the RGB output uses row-stride padding. These issues can be very hard to debug. Adding `GST_DEBUG=3` before your command (e.g. `GST_DEBUG=3 python3 ex3_nn_imsdk.py`) and feeding the pipeline and error into an LLM like ChatGPT can sometimes help you troubleshoot if needed.

**module=mobilenet-softmax:** This is used by `qtimlpostprocess` for classification models whose output is a FLOAT32 1×N logits vector. It applies Softmax to normalize logits into probabilities.

1. Now run this application:

   ```text theme={null}
   # We use '| grep -v "<W>"' to filter out some warnings - you can omit it if you want.
   python3 ex3_nn_imsdk.py --video-source "$IMSDK_VIDEO_SOURCE" | grep -v "<W>"

   # Frame ready
   #     Data: name=qtimlpostprocess_text (892,) (uint8)
   #     qtimlpostprocess_text: { (structure)"ImageClassification\,\ labels\=\(structure\)\<\ \"laptop\\\,\\\ id\\\=\\\(uint\\\)256\\\,\\\ confidence\\\=\\\(double\\\)45.182029724121094\\\,\\\ color\\\=\\\(uint\\\)3211364863\\\;\"\,\ \"notebook\\\,\\\ id\\\=\\\(uint\\\)256\\\,\\\ confidence\\\=\\\(double\\\)17.578998565673828\\\,\\\ color\\\=\\\(uint\\\)2015954943\\\;\"\,\ \"computer.keyboard\\\,\\\ id\\\=\\\(uint\\\)256\\\,\\\ confidence\\\=\\\(double\\\)2.6610369682312012\\\,\\\ color\\\=\\\(uint\\\)2861764863\\\;\"\,\ \"monitor\\\,\\\ id\\\=\\\(uint\\\)256\\\,\\\ confidence\\\=\\\(double\\\)2.2032134532928467\\\,\\\ color\\\=\\\(uint\\\)1091632127\\\;\"\,\ \"desktop.computer\\\,\\\ id\\\=\\\(uint\\\)256\\\,\\\ confidence\\\=\\\(double\\\)2.2032134532928467\\\,\\\ color\\\=\\\(uint\\\)3377702399\\\;\"\ \>\,\ timestamp\=\(guint64\)6025656009\,\ sequence-index\=\(uint\)1\,\ sequence-num-entries\=\(uint\)1\;" }
   #     Timings: frame_ready_webcam->transform_done: 7.29ms, transform_done->conversion_done: 0.83ms, conversion_done->inference_done: 1.25ms, inference_done->postproc_done: 0.45ms (total 9.82ms)
   ```

   OK! The model now runs on the NPU *inside* the IM SDK pipeline. If you rather have the top 5 outputs (like we did in 3.1), you can tee the stream after the `qtimltflite` element and send the raw output tensor back to the application as well.

### 3.3: Overlays

To mimic the output in 3.1 we also want to draw an overlay. Let's first demonstrate that with a static overlay image.

1. Download a semi-transparent image ([source](https://commons.wikimedia.org/wiki/File:PNG_transparency_demonstration_2.png)):

   ```shell theme={null}
   mkdir -p images
   wget -O images/imsdk-transparent-static.png https://cdn.edgeimpulse.com/qc-ai-docs/example-images/imsdk-transparent-static.png
   ```
2. Create a new file `ex3_overlay.py` and add:

   ```python theme={null}
   from gst_helper import gst_grouped_frames, atomic_save_image, timing_marks_to_str, download_file_if_needed, softmax
   import time, argparse, numpy as np
   from ai_edge_litert.interpreter import Interpreter, load_delegate
   from PIL import ImageDraw, Image

   parser = argparse.ArgumentParser(description='GStreamer -> SqueezeNet')
   parser.add_argument('--video-source', type=str, required=True, help='GStreamer video source (e.g. "v4l2src device=/dev/video2" or "qtiqmmfsrc name=camsrc camera=0")')
   args, unknown = parser.parse_known_args()

   if args.video_source.strip() == '':
       raise Exception('--video-source is empty, did you not set the IMSDK_VIDEO_SOURCE env variable? E.g.:\n' +
       '    export IMSDK_VIDEO_SOURCE="v4l2src device=/dev/video2"')

   # Source: https://commons.wikimedia.org/wiki/File:Arrow_png_image.png
   OVERLAY_IMAGE = download_file_if_needed('images/imsdk-transparent-static.png', 'https://cdn.edgeimpulse.com/qc-ai-docs/example-images/imsdk-transparent-static.png')
   OVERLAY_WIDTH = 128
   OVERLAY_HEIGHT = 96

   PIPELINE = (
       # Part 1: Create a qtivcomposer with two sinks (we'll write webcam to sink 0, overlay to sink 1)
       "qtivcomposer name=comp sink_0::zorder=0 "
           # Sink 1 (the overlay) will be at x=10, y=10; and sized 128x96
           f"sink_1::zorder=1 sink_1::alpha=1.0 sink_1::position=<10,10> sink_1::dimensions=<{OVERLAY_WIDTH},{OVERLAY_HEIGHT}> ! "
       "videoconvert ! "
       "video/x-raw,format=RGBA,width=224,height=224 ! "
       # Write frames to appsink
       "appsink name=overlay_raw drop=true sync=false max-buffers=1 emit-signals=true "

       # Part 2: Grab image from webcam and write the composer
           # Video source
           f"{args.video_source} ! "
           # Properties for the video source
           "video/x-raw,width=1920,height=1080 ! "
           # An identity element so we can track when a new frame is ready (so we can calc. processing time)
           "identity name=frame_ready_webcam silent=false ! "
           # Crop (square), the crop syntax is ('<X, Y, WIDTH, HEIGHT >').
           # So here we use 1920x1080 input, then center crop to 1080x1080 ((1920-1080)/2 = 420 = x crop)
           f'qtivtransform crop="<420, 0, 1080, 1080>" ! '
           # then resize to 224x224
           "video/x-raw,width=224,height=224,format=NV12 ! "
           # Event when the crop/scale are done
           "identity name=transform_done silent=false ! "
           # Write to sink 0 on the composer
           "comp.sink_0 "

       # Part 3: Load overlay from disk and write to composer (sink 1)
           # Image (statically from disk)
           f'filesrc location="{OVERLAY_IMAGE}" ! '
           # Decode PNG
           "pngdec ! "
           # Turn into a video (scaled to 128x96, RGBA format so we keep transparency, requires a framerate)
           "imagefreeze ! "
           "videoscale ! "
           "videoconvert ! "
           f"video/x-raw,format=RGBA,width={OVERLAY_WIDTH},height={OVERLAY_HEIGHT},framerate=30/1 ! "
           # Write to sink 1 on the composer
           "comp.sink_1 "
   )

   for frames_by_sink, marks in gst_grouped_frames(PIPELINE):
       print(f"Frame ready")
       print('    Data:', end='')
       for key in list(frames_by_sink):
           print(f' name={key} {frames_by_sink[key].shape} ({frames_by_sink[key].dtype})', end='')
       print('')

       # Save image to disk
       save_image_start = time.perf_counter()
       frame = frames_by_sink['overlay_raw']
       atomic_save_image(frame=frame, path='out/webcam_with_overlay.png')
       save_image_end = time.perf_counter()

       # Add an extra mark, so we have timing info for the complete pipeline
       marks['save_image_end'] = list(marks.items())[-1][1] + (save_image_end - save_image_start)

       print('    Timings:', timing_marks_to_str(marks))
   ```
3. Run this application:

   ```text theme={null}
   # We use '| grep -v "<W>"' to filter out some warnings - you can omit it if you want.
   python3 ex3_overlay.py --video-source "$IMSDK_VIDEO_SOURCE" | grep -v "<W>"

   # Frame ready
   #     Data: name=overlay_raw (224, 224, 4) (uint8)
   #     Timings: frame_ready_webcam->transform_done: 1.03ms, transform_done->pipeline_finished: 3.28ms, pipeline_finished->save_image_end: 31.28ms (total 35.59ms)
   ```

   <Frame caption="Static overlay onto webcam image">
     <img src="https://mintlify.s3.us-west-1.amazonaws.com/qualcomm-prod/images/ai-workflows/imsdk-webcam_with_overlay.png" alt="" />
   </Frame>

### 3.4: Combining neural network with overlay

You've now seen how to run a neural network as part of an IM SDK pipeline; and you've seen how to draw overlays. Let's combine these into a single pipeline, where we overlay the prediction onto the image - all without ever touching the CPU.

1. Create a new file `ex3_from_imsdk.py` and add:

   ```python theme={null}
    from gst_helper import gst_grouped_frames, timing_marks_to_str, download_file_if_needed
    import argparse, os

    parser = argparse.ArgumentParser(description='GStreamer -> SqueezeNet')
    parser.add_argument('--video-source', type=str, required=True, help='GStreamer video source (e.g. "v4l2src device=/dev/video2" or "qtiqmmfsrc name=camsrc camera=0")')
    args, unknown = parser.parse_known_args()

    if args.video_source.strip() == '':
        raise Exception('--video-source is empty, did you not set the IMSDK_VIDEO_SOURCE env variable? E.g.:\n' +
        '    export IMSDK_VIDEO_SOURCE="v4l2src device=/dev/video2"')

    MODEL_PATH = download_file_if_needed('models/squeezenet1_1-squeezenet-1.1-w8a8.tflite', 'https://cdn.edgeimpulse.com/qc-ai-docs/models/squeezenet1_1-squeezenet-1.1-w8a8.tflite')
    LABELS_PATH = download_file_if_needed('models/SqueezeNet-1.1_labels.txt', 'https://cdn.edgeimpulse.com/qc-ai-docs/models/SqueezeNet-1.1_labels.txt')

    PIPELINE = (
        # Part 1: Create a qtivcomposer with two sinks (we'll write webcam to sink 0, overlay to sink 1)
        "qtivcomposer name=comp sink_0::zorder=0 "
        "sink_1::zorder=1 sink_1::alpha=1.0 ! "
        "video/x-raw,format=NV12,width=1920,height=1080 ! "
        "v4l2h264enc capture-io-mode=4 output-io-mode=4 ! "
        "queue ! h264parse ! mp4mux ! "
        "filesink location=output/out.mp4 "

        # Video source
        f"{args.video_source} ! "
        # Properties for the video source
        "video/x-raw,width=1920,height=1080 ! "
        # An identity element so we can track when a new frame is ready (so we can calc. processing time)
        "identity name=frame_ready_webcam silent=false ! "
        "qtivtransform ! "
        "video/x-raw,format=NV12 ! "
        "identity name=transform_done silent=false ! "
        "tee name=v "
        "v. ! queue max-size-buffers=1 leaky=downstream ! "
        "comp.sink_0 "

        # Part 3: NN path ? postprocess overlay ? comp.sink_1 + nn_overlay appsink
        "v. ! queue max-size-buffers=1 leaky=downstream ! "
        # (1) Input of qtimlvconverter
        "identity name=converter_in silent=false ! "
        "qtimlvconverter ! neural-network/tensors,type=UINT8,dimensions=<<1,224,224,3>> ! "
        # (2) Output of qtimlvconverter
        "identity name=converter_out silent=false ! "
        # qtimltflite (inference on HTP via QNN delegate)
        f'qtimltflite delegate=external external-delegate-path=libQnnTFLiteDelegate.so '
        f'external-delegate-options="QNNExternalDelegate,backend_type=htp;" model="{MODEL_PATH}" ! '
        # (3) Output of qtimltflite
        "identity name=inference_done silent=false ! "
        # qtimlpostprocess (mobilenet-softmax): dequant + softmax + overlay render
        f'qtimlpostprocess name=postproc module=mobilenet-softmax labels="{LABELS_PATH}" '
        'results=1 settings="{\\"confidence\\": 10.0}" ! '
        # (4) Output of qtimlpostprocess
        "identity name=postproc_done silent=false ! "
        # Overlay frame (BGRA) that feeds both composer and the NN appsink
        "video/x-raw,format=BGRA,width=224,height=224 ! "
        "tee name=nn_t "

        # Branch A (to composer)
        "nn_t. ! queue max-size-buffers=1 leaky=downstream ! "
        "comp.sink_1 "

        # Branch B (to appsink on the NN branch)
        "nn_t. ! queue max-size-buffers=1 leaky=downstream ! "
        "appsink name=nn_overlay drop=true sync=false max-buffers=1 emit-signals=true "
    )

    os.makedirs('output', exist_ok=True)

    for frames_by_sink, marks in gst_grouped_frames(PIPELINE):
        # Consume/inspect the NN-branch overlay (BGRA 224x224) so groups progress
        if 'nn_overlay' in frames_by_sink:
            nn_overlay = frames_by_sink['nn_overlay']  # ndarray (224, 224, 4), dtype=uint8
            print(f"[appsink:nn_overlay] frame {nn_overlay.shape} {nn_overlay.dtype}")

        # Print timing markers (now includes converter_* / inference_done / postproc_done / composer_out)
        print('Timings:', timing_marks_to_str(marks))  
   ```
2. Run this application:

   ```shell theme={null}
   # We use '| grep -v "<W>"' to filter out some warnings - you can omit it if you want.
   python3 ex3_from_imsdk.py --video-source "$IMSDK_VIDEO_SOURCE" | grep -v "<W>"

   # Frame ready
   #     Timings: frame_ready_webcam->transform_done: 6.42ms, transform_done->converter_in: 0.40ms, converter_in->converter_out: 1.09ms, converter_out->inference_done: 1.29ms, inference_done->postproc_done: 0.63ms (total 9.84ms)

   ```

   Great! This whole pipeline now runs in the IM SDK. You can find the output file in `out/imsdk-webcam-nn-overlay.mp4`.

   <Frame caption="Image classification model with an overlay rendered by IM SDK">
     <img src="https://mintlify.s3.us-west-1.amazonaws.com/qualcomm-prod/images/ai-workflows/imsdk-webcam-nn-overlay.png" alt="" />
   </Frame>

## Troubleshooting

### Pipeline does not yield anything

If you don't see any output, add `GST_DEBUG=3` to see more detailed debug info.

```text theme={null}
GST_DEBUG=3 python3 ex1.py
```

### QMMF Recorder StartCamera Failed / Failed to Open Camera

If you see get a QMMF error like the one below:

```text theme={null}
0:00:00.058915726  7329     0x1faf28a0 ERROR             qtiqmmfsrc qmmf_source_context.cc:1426:gst_qmmf_context_open: QMMF Recorder StartCamera Failed!
0:00:00.058955986  7329     0x1faf28a0 WARN              qtiqmmfsrc qmmf_source.c:1206:qmmfsrc_change_state:<camsrc> error: Failed to Open Camera!
```

You can release the camera by running:

```text theme={null}
sudo killall cam-server
```
