> ## Documentation Index
> Fetch the complete documentation index at: https://dragonwingdocs.qualcomm.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Troubleshoot performance issues

To address the performance issues, you can use both basic and advanced troubleshooting methods.

## **Basic troubleshooting**

Basic troubleshooting involves fundamental techniques at the application level. It's useful when developing applications using the Qualcomm development kits for educational and academic purposes. Basic troubleshooting can be applied to devices with Qualcomm<sup>®</sup> Linux<sup>®</sup> that operate without requiring root access.

For more complex issues, see [Advanced troubleshooting](#advanced-troubleshooting).

### **Analyze user space and kernel traces**

Tools such as Function tracer (ftrace), Trace Compass, and LTTng are commonly used to analyze traces on Linux for performance issues.

| **Performance debug tool** |                                                           **Reference**                                                           |
| :------------------------: | :-------------------------------------------------------------------------------------------------------------------------------: |
|        Trace Compass       | [Trace Compass User Guide](https://archive.eclipse.org/tracecompass/doc/stable/org.eclipse.tracecompass.doc.user/User-Guide.html) |
|            LTTng           |                                        [LTTng Documentation](https://lttng.org/docs/v2.13/)                                       |

You can compile your application with `-llttng-ust` and `-g -finstrument-functions` to display the function call stack.

For example, run the following command for compilation:

```text theme={null}
aarch64-qcom-linux-g++ <cpp source file> -o <output file> -llttng-ust -g -finstrument-functions
```

The following GCC and G++ compilers are available on the device after enabling them through [Compile performance tools](./get-started-with-performance-tuning-and-optimization#compile-performance-tools):

* `aarch64-qcom-linux-gcc`
* `aarch64-qcom-linux-g++`

### **Capture LTTng-UST trace**

To capture a trace using LTTng, follow these steps:

1. To display a call stack of the application with `liblttng-ust-cyg-profile.so` create a session named my-session with the following command:
   ```text theme={null}
   lttng create my-session --output=/tmp/my-trace
   ```
   The traces are available at `/tmp/my-trace`.
2. Run the commands in the following sequence to capture the traces:
   ```text theme={null}
   lttng enable-event -u -a
   ```
   ```text theme={null}
   lttng enable-event -k -a
   ```
   ```text theme={null}
   lttng start
   ```
3. Preload the `liblttng-ust-cyg-profile` library when running your program:
   ```text theme={null}
   LD_PRELOAD=/usr/lib/liblttng-ust-cyg-profile.so ./test_executable
   ```
   ```text theme={null}
   lttng stop
   ```
   ```text theme={null}
   lttng destroy my-session
   ```

### **Load LTTng traces**

1. To load and visualize the LTTng traces in Trace Compass, use secure copy protocol (SCP) or a similar tool to transfer a trace from the target to the host. Ensure that you specify the target IP address in the command. Here is an example command:
   ```text theme={null}
   scp -r root@10.92.162.185:/home/root/lttng-traces/ <store trace path>
   ```
2. Load the LTTng kernel and UST traces with Trace Compass on the host machine. From the Trace Compass tool, use the **File** menu option to open a trace. **Note** The screenshots are provided for reference. The directory structure shown in the screenshots may vary depending on the Trace Compass tool version.
   <div className="flex flex-col items-center gap-1">
     <img src="https://mintcdn.com/qualcomm-prod/rpHTx_a6zriKQll9/System/Performance/media/k2l-performance/fig-6-1-trace-compass.jpg?fit=max&auto=format&n=rpHTx_a6zriKQll9&q=85&s=b4ed667eec16949f5331301465d869fd" width="299" height="276" data-path="System/Performance/media/k2l-performance/fig-6-1-trace-compass.jpg" />
   </div>
3. To select a trace type, right-click on the trace, and choose **Select Trace Type** > **Ftrace Format** > **Raw Textual Ftrace** as shown in the following figure:

<div className="flex flex-col items-center gap-1">
  <img src="https://mintcdn.com/qualcomm-prod/OKFyShYzKWv2bmj8/System/Performance/media/k2l-performance/fig-6-7.jpg?fit=max&auto=format&n=OKFyShYzKWv2bmj8&q=85&s=0621c1665e2ffd9c2739e2f753e3ea9a" width="901" height="375" data-path="System/Performance/media/k2l-performance/fig-6-7.jpg" />
</div>

4. Install the required add-ons in Trace Compass for ftrace analysis. Go to **Menu** > **Tools** > **Add-ons**, and select **Trace Compass ftrace**. **Note** It's recommended to update the Trace Compass preferences. To print the time that matches the raw ftrace, change **Tracing–Time Format** to **TTT** (seconds in epoch).
5. To display the kernel and UST traces in one view, create Experiments and add two traces.

<div className="flex flex-col items-center gap-1">
  <img src="https://mintcdn.com/qualcomm-prod/OKFyShYzKWv2bmj8/System/Performance/media/k2l-performance/fig-6-2.jpg?fit=max&auto=format&n=OKFyShYzKWv2bmj8&q=85&s=271309aa8e84c52245fb33b558c4b75e" width="515" height="742" data-path="System/Performance/media/k2l-performance/fig-6-2.jpg" />
</div>

6. Select **Views** > **LTTng-UST-CallStack** > **Flame Chart and Views** > **Linux Kernel** > **Resources**. Trace Compass can display kernel resources and user space application function call stack as shown in the following figure:

<div className="flex flex-col items-center gap-1">
  <img src="https://mintcdn.com/qualcomm-prod/OKFyShYzKWv2bmj8/System/Performance/media/k2l-performance/fig-6-3.jpg?fit=max&auto=format&n=OKFyShYzKWv2bmj8&q=85&s=bbfc593df1cad9933e3de664daea6ab4" width="1290" height="692" data-path="System/Performance/media/k2l-performance/fig-6-3.jpg" />
</div>

7. Follow step 6 to open a trace for the CPU frequency. Select the **Resources** panel and the **Timeline** view of the process running on a specified CPU. There is a frequency number in the CPU frequency line. The following figure shows CPU0 to CPU2 running at 2 GHz and CPU3 to CPU5 running at 2.8 GHz.

<div className="flex flex-col items-center gap-1">
  <img src="https://mintcdn.com/qualcomm-prod/OKFyShYzKWv2bmj8/System/Performance/media/k2l-performance/fig-6-4.jpg?fit=max&auto=format&n=OKFyShYzKWv2bmj8&q=85&s=d4951a6c4a4b29a7e50854066cca8109" width="1270" height="470" data-path="System/Performance/media/k2l-performance/fig-6-4.jpg" />
</div>

### **Monitor CPU consumption of user space application**

Several Linux utilities, such as top and htop can be used to monitor the CPU usage.

### **Top**

Top is a tool that checks the CPU usage for an application and displays the overall CPU usage. On an octa‑core platform, tasks can consume the CPU from 0% to 800%.

To set a terminal environment to run top, run the following commands on the device:

```text theme={null}
export TERM=xterm
```

```text theme={null}
top
```

The following figure shows the CPU usage as an output of the command:

<div className="flex flex-col items-center gap-1">
  <img src="https://mintcdn.com/qualcomm-prod/OKFyShYzKWv2bmj8/System/Performance/media/k2l-performance/fig-6-5.jpg?fit=max&auto=format&n=OKFyShYzKWv2bmj8&q=85&s=66dc0b539911335aa7c6525f4a987eb1" width="687" height="299" data-path="System/Performance/media/k2l-performance/fig-6-5.jpg" />
</div>

### **htop**

htop displays the per-core CPU usage and overall CPU usage for each process. To compile htop on a build, see [Compile performance tools](./get-started-with-performance-tuning-and-optimization#compile-performance-tools).

To set a terminal environment for htop, run the following commands on the device:

```text theme={null}
export TERM=xterm
```

```text theme={null}
htop
```

The following figure shows the per core CPU usage as an output of the command:

<div className="flex flex-col items-center gap-1">
  <img src="https://mintcdn.com/qualcomm-prod/OKFyShYzKWv2bmj8/System/Performance/media/k2l-performance/fig-6-6.jpg?fit=max&auto=format&n=OKFyShYzKWv2bmj8&q=85&s=195b50f4887096639b9683943b059231" width="680" height="466" data-path="System/Performance/media/k2l-performance/fig-6-6.jpg" />
</div>

### **CPU usage in Trace Compass**

1. Open the Trace Compass tool on the host computer and load a trace.
2. Right-click on the trace, and choose **Select Trace Type** > **Ftrace Format Type** > **Raw Textual Ftrace** as shown in the following figure:

<div className="flex flex-col items-center gap-1">
  <img src="https://mintcdn.com/qualcomm-prod/OKFyShYzKWv2bmj8/System/Performance/media/k2l-performance/fig-6-7.jpg?fit=max&auto=format&n=OKFyShYzKWv2bmj8&q=85&s=0621c1665e2ffd9c2739e2f753e3ea9a" width="901" height="375" data-path="System/Performance/media/k2l-performance/fig-6-7.jpg" />
</div>

3. Right-click on **Raw Textual Ftrace** and select **Open**.
4. Double-click on **CPU usage** to view the system-wide CPU usage. Select a task in the left panel to check the CPU usage per task as shown in the following figure:

<div className="flex flex-col items-center gap-1">
  <img src="https://mintcdn.com/qualcomm-prod/OKFyShYzKWv2bmj8/System/Performance/media/k2l-performance/fig-6-8.jpg?fit=max&auto=format&n=OKFyShYzKWv2bmj8&q=85&s=c907e77f7e844e5fcdbb80c99870d51d" width="1055" height="274" data-path="System/Performance/media/k2l-performance/fig-6-8.jpg" />
</div>

### **Monitor the memory consumption of user space application**

You can check the memory allocation and memory usage for various processes.

To check memory consumption of a process, run the following command on a device:

```text theme={null}
cat /proc/<pid>/smaps_rollup
```

The following figure shows an output of the command:

<div className="flex flex-col items-center gap-1">
  <img src="https://mintcdn.com/qualcomm-prod/OKFyShYzKWv2bmj8/System/Performance/media/k2l-performance/fig-6-9.jpg?fit=max&auto=format&n=OKFyShYzKWv2bmj8&q=85&s=5310c5a54544168d34b51195fad30139" width="612" height="411" data-path="System/Performance/media/k2l-performance/fig-6-9.jpg" />
</div>

### **Procrank**

Procrank is a tool that displays memory consumption for each process. By default, it shows the following set sizes:

* VSS: Virtual set size
* RSS: Resident set size
* PSS: Proportional set size
* USS: Unique set size

PSS is considered as actual memory consumption by a process.

### **Build Procrank from source code**

Run the following commands on the host computer:

```text theme={null}
sudo apt install -y gcc-aarch64-linux-gnu
```

```text theme={null}
git clone https://github.com/cglmcu/procrank.git
```

```text theme={null}
cd procrank
```

```text theme={null}
export CC=aarch64-linux-gnu-gcc
```

```text theme={null}
aarch64-linux-gnu-gcc *.c -Os -o procrank -I.
```

ADB is included in the Qualcomm Linux build. To enable ADB, do the following:

1. Boot the device.
2. Log in to the serial shell.
3. Run the following command:
   ```text theme={null}
   touch /etc/usb-debugging-enabled
   ```
4. To start ADB, use one of the following options:
   * Option 1: Reboot the device.
   * Option 2: Run the following command:
     ```text theme={null}
     systemctl start android-tools-adbd
     ```

Once enabled, ADB remains active unless the `/etc/usb-debugging-enabled` file is removed and the device is rebooted.

Use Android Debug Bridge (adb) or a similar tool to transfer the Procrank file into the device from the host. Here are the example commands:

```text theme={null}
adb shell mount -o remount, rw /usr
```

```text theme={null}
adb push procrank /usr/bin
```

```text theme={null}
adb shell chmod a+x /usr/bin/procrank
```

<Note>
  Ensure that you specify the target IP address in the command.
</Note>

Procrank command examples:

* To view the anonymous memory allocated by each process, run the following command on the device:
  ```text theme={null}
  procrank -C
  ```
* To show the file cache memory allocated by each process, run the following command on the device:
  ```text theme={null}
  procrank -c
  ```
* To view both the anonymous and file cache memories allocated by each process, run the following command on the device:
  ```text theme={null}
  procrank
  ```

The following figure shows an example output of the `procrank -C` command:

<div className="flex flex-col items-center gap-1">
  <img src="https://mintcdn.com/qualcomm-prod/rpHTx_a6zriKQll9/System/Performance/media/k2l-performance/fig-6-10.jpg?fit=max&auto=format&n=rpHTx_a6zriKQll9&q=85&s=839252f3ce4b421c0ec5dc9b59ed95a6" width="796" height="698" data-path="System/Performance/media/k2l-performance/fig-6-10.jpg" />
</div>

### **Check instructions per cycle of the application**

The perf utility calculates instructions per cycle (IPC) for an application using the hardware performance counters.

To compile the perf utility, see [Compile performance tools](./get-started-with-performance-tuning-and-optimization#compile-performance-tools).

To calculate IPC, run the following command on the device:

```text theme={null}
perf stat -e cycles,instructions sleep 5
```

The following figure shows an example output of the command:

<div className="flex flex-col items-center gap-1">
  <img src="https://mintcdn.com/qualcomm-prod/OKFyShYzKWv2bmj8/System/Performance/media/k2l-performance/fig-6-check-instruction-per-cycle-one.jpg?fit=max&auto=format&n=OKFyShYzKWv2bmj8&q=85&s=6bd1989cb3b09967dc6a456ce678780e" width="725" height="171" data-path="System/Performance/media/k2l-performance/fig-6-check-instruction-per-cycle-one.jpg" />
</div>

* If the IPC is less than 1.0, it's likely that the memory is stalled. In this case, Qualcomm Linux tuning strategies, such as reducing the memory I/O workload, can help improve performance.
* If the IPC is greater than 1.0, it's likely that it's instruction bound. In this case, reducing code execution by eliminating unnecessary work and cache operations can help improve performance.

### **Check parts of code consuming most CPU**

The perf utility tool can generate a flame graph that helps visualize the stack and CPU usage of a thread with all the functions running on the CPU.

To generate a flame graph, do the following:

* On the device:
  1. Collect logs to generate a flame graph. To collect logs using the perf utility tool, run the following commands:
     ```text theme={null}
     perf record -g -o /tmp/perf.data -p <process pid> sleep 5
     ```
     ```text theme={null}
     cd /tmp
     ```
     ```text theme={null}
     perf script > /tmp/perf.script
     ```
  2. Run the following command using SCP or a similar tool and transfer `perf.script` from the target to the host. Ensure that you specify the target IP address in the command. Here is an example command:
     ```text theme={null}
     scp -r root@10.92.162.185:/tmp/perf.script /local/mnt/workspace/logs
     ```
* On the host:
  1. Run the following command to download the flame graph:
     > > ```text theme={null}
     > > git clone https://github.com/brendangregg/FlameGraph.git
     > > ```
     >
     > Ensure that you install Perl on the host computer.
  2. Copy `perf.script` in the `FlameGraph` directory:
     > ```text theme={null}
     > cd FlameGraph
     > ```
     >
     > ```text theme={null}
     > perl stackcollapse-perf.pl perf.script > out.folded
     > ```
     >
     > ```text theme={null}
     > perl out.folded > perf.svg
     > ```
  3. Open the SVG file in a browser to view the flame graph to know the CPU usage:

<div className="flex flex-col items-center gap-1">
  <img src="https://mintcdn.com/qualcomm-prod/rpHTx_a6zriKQll9/System/Performance/media/k2l-performance/fig-6-11.jpg?fit=max&auto=format&n=rpHTx_a6zriKQll9&q=85&s=39f1ccbc770f39e113ed6233c726fa6b" width="798" height="421" data-path="System/Performance/media/k2l-performance/fig-6-11.jpg" />
</div>

### **Check memory consumed by functions in the user space application code**

[Valgrind](https://valgrind.org/docs/manual/ms-manual.html), an open-source tool, provides a utility called massif that helps to analyze the memory consumed by each function in a program.

The following is a sample code for memory allocation:

```text theme={null}
    #include <stdlib.h>

void g(void) {
   malloc(4000);
}

void f(void) {
   malloc(2000);
   g();
}

int main(void) {
   int i;
   int* a[10];
   for (i = 0; i < 10; i++) {
      a[i] = malloc(1000);
   }
   f();
   g();
   for (i = 0; i < 10; i++) {
      free(a[i]);
   }
   return 0;
}
```

Compile the source code and run the following Valgrind command on the device:

```text theme={null}
valgrind --tool=massif ./test
```

The following is an output of the sample code:

cat massif.out.1587\
…\
n3: 20000 (heap allocation functions) malloc/new/new\[], --alloc-fns, etc.\
n0: 10000 0x10882B: main (in /home/root/valgrind/test)\
n2: 8000 0x1087E7: g (in /home/root/valgrind/test)\
n1: 4000 0x108807: f (in /home/root/valgrind/test)\
n0: 4000 0x10885B: main (in /home/root/valgrind/test)\
n0: 4000 0x10885F: main (in /home/root/valgrind/test)\
n1: 2000 0x108803: f (in /home/root/valgrind/test)\
n0: 2000 0x10885B: main (in /home/root/valgrind/test)

For more information about Valgrind, see [Valgrind User Manual](https://valgrind.org/docs/manual/ms-manual.html).

### **Detect memory leaks in the user space application**

To detect memory leaks within a process, you can use the Valgrind tool with the leak‑check feature enabled.

The following is a sample code where memory has been allocated but not released:

```text theme={null}
    #include <stdlib.h>

void do_alloc() {
    int *x = malloc(10 * sizeof(int)); /* here simulate a leak */
    x[10] = 0; /* here write to invalid memory address */
}

int main() {
    do_alloc();
    return 0;
}
```

To detect memory leaks, compile the sample code and run the following command on the device:

```text theme={null}
valgrind --leak-check=yes ./test
```

The following is an output of the sample code:

```text theme={null}
==1512== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==1512== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info
==1512== Command: ./test
==1512==
==1512== Invalid write of size 4
==1512==    at 0x1087B4: do_alloc (in /home/root/valgrind/test)
==1512==    by 0x1087CF: main (in /home/root/valgrind/test)
==1512==  Address 0x4a36068 is 0 bytes after a block of size 40 alloc'd
==1512==    at 0x486551C: malloc (vg_replace_malloc.c:381)
==1512==    by 0x1087A7: do_alloc (in /home/root/valgrind/test)
==1512==    by 0x1087CF: main (in /home/root/valgrind/test)
==1512==
==1512==
==1512== HEAP SUMMARY:
==1512==     in use at exit: 40 bytes in 1 blocks
==1512==   total heap usage: 1 allocs, 0 frees, 40 bytes allocated
==1512==
==1512== 40 bytes in 1 blocks are definitely lost in loss record 1 of 1
==1512==    at 0x486551C: malloc (vg_replace_malloc.c:381)
==1512==    by 0x1087A7: do_alloc (in /home/root/valgrind/test)
==1512==    by 0x1087CF: main (in /home/root/valgrind/test)
```

## **Advanced troubleshooting**

Advanced troubleshooting methods are used at the system level. These methods are crucial for building a Qualcomm reference device and integrating Qualcomm Linux across all layers to produce a final product.

For related information, see [Basic troubleshooting](#basic-troubleshooting).

### **Boot time**

The phases of boot time and boot time log markers help in debugging and optimizing the boot process.

The Qualcomm Linux boot chain can be divided into two phases:

* Boot loader initialization and kernel loading: The boot loader is initiated and the kernel is loaded.
* Linux system initialization: The kernel, drivers, and user space services are initialized.

### **First-phase timelines (Boot loader initialization and kernel loading)**

During the device booting sequence, collect the serial logs. Parsing these logs can provide a better understanding of the milestones in this phase.

The time taken across the modules can be measured using the respective timestamps listed in the following table:

|   **Module**   |                 **Debug lines printed**                 |
| :------------: | :-----------------------------------------------------: |
|    PBL + XBL   |                  "UEFI Start" timestamp                 |
|    Core UEFI   | "UEFI Total" – time consumed is printed in milliseconds |
| Kernel loading |  Difference between "UEFI End" - OS Loader" timestamps  |

For more information about how to collect serial logs, see [Measure boot time](./performance-dashboards#measure-boot-time).

The following is an example of the sample serial logs and timelines:

<div className="flex flex-col items-center gap-1">
  <img src="https://mintcdn.com/qualcomm-prod/OKFyShYzKWv2bmj8/System/Performance/media/k2l-performance/fig-6-boot-time-example.jpg?fit=max&auto=format&n=OKFyShYzKWv2bmj8&q=85&s=197f9f6a48176c4e09715e54850e38d4" width="649" height="337" data-path="System/Performance/media/k2l-performance/fig-6-boot-time-example.jpg" />
</div>

### **Second phase timelines (Linux system initialization)**

To capture performance statistics during system boot, use the [systemd-analyze tool](https://www.freedesktop.org/software/systemd/man/latest/systemd-analyze.html).

To install the tool, see [Analyze performance with tools](./analyze-performance-with-tools).

To analyze the initialization of drivers within the kernel, enable the `initcall_debug` flag in the kernel boot command line. Use the systemd-analyze tool to analyze the initialization details of user space services and applications.

The following are the example commands that you can run on the device for using the systemd-analyze tool:

* To obtain the kernel and user space boot time, run the following command:
  ```text theme={null}
  systemd-analyze time
  ```
  The following is an output of the command: Linux QCS6490 (Linux 6.6.0 #1 SMP PREEMPT Sun Feb 4 18:35:47 UTC 2024) arm64. Startup finished in 4.238s (kernel) + 15.620s (userspace) = 19.859s multi-user.target reached after 15.594s in userspace
* To obtain the time consumed by each subsystem during boot, run the following command:
  ```text theme={null}
  systemd-analyze blame
  ```
  The following is an output of the command: 4.982s android-tools-adbd.service\
  3.013s dev-disk-byx2dpartlabel-system.device\
  1.418s systemd-modules-load.service\
  1.179s sshdgenkeys.service

### **Graphical view of system initialization time**

The `systemd-analyze plot` command provides a graphical breakdown of the system services that have started, along with their initialization times.

To obtain a graphical breakdown of the system services, run the following command on the device:

```text theme={null}
systemd-analyze plot > /var/lib/systemd-plot.svg
```

To visualize time consumption across the modules in the system initialization phase and analyze the performance, open the `systemd-plot.svg` file in any web browser. The following figure shows the example graph:

<div className="flex flex-col items-center gap-1">
  <img src="https://mintcdn.com/qualcomm-prod/OKFyShYzKWv2bmj8/System/Performance/media/k2l-performance/fig-6-12.jpg?fit=max&auto=format&n=OKFyShYzKWv2bmj8&q=85&s=066f44e8dce8a7d06e13ea5781d75c00" width="1497" height="810" data-path="System/Performance/media/k2l-performance/fig-6-12.jpg" />
</div>

### **Identify CPU bound use cases**

To verify that a task is running on the most capable CPUs at their maximum frequency, capture the scheduler and frequency ftrace.

The following is a sample code that loads the CPU using a `while` loop:

```text theme={null}
    #include <stdlib.h>
#include <unistd.h>

int main() {
     int i = 0;
     while(1)
    {
        i++;
    }
    return 0;
  }
```

You can collect an ftrace for the sample code and use Trace Compass to load the ftrace. It allows you to check if the test thread is running on the Prime core at the maximum CPU frequency of 2.7 GHz as shown in the following figure:

<div className="flex flex-col items-center gap-1">
  <img src="https://mintcdn.com/qualcomm-prod/OKFyShYzKWv2bmj8/System/Performance/media/k2l-performance/fig-6-13.jpg?fit=max&auto=format&n=OKFyShYzKWv2bmj8&q=85&s=cbfe3b4baa3f6746b4773cf68c30f06c" width="1688" height="684" data-path="System/Performance/media/k2l-performance/fig-6-13.jpg" />
</div>

### **Identify I/O bound use cases**

To obtain I/O statistics, use `/proc/diskstats`.

For more information, see [/proc/diskstats](https://www.kernel.org/doc/Documentation/ABI/testing/procfs-diskstats).

The following is an example of running lmdd on the device for the I/O-bound use case:

* Before running the use case, run the following command:
  ```text theme={null}
  cat /proc/diskstats
  ```
  The following is an output of the command: 8 10 sda10 715 544 15056 250 4394 413 4199944 135729 0 5508 135979 0 0 0 0 0 0 Next, get `pgpgin` and `pgpgout` from vmstat:
  ```text theme={null}
  cat /proc/vmstat
  ```
  The following is an output of the command: pgpgin 348632pgpgout 2100056
* To run lmdd, you must first compile lmbench, see [Compile performance tools](./get-started-with-performance-tuning-and-optimization#compile-performance-tools) for more information. For the I/O-bound use case, run the following lmdd command:
  ```text theme={null}
  lmdd if=/mnt/overlay/2GB.file of=/mnt/overlay/2GB.file.copy fsync=1 bs=1M
  ```
* After running the use case, run the following command:
  ```text theme={null}
  cat /proc/diskstats
  ```
  The following is an output of the command: 8 10 sda10 4822 544 4209448 13018 8530 451 8394624 300094 0 11836 313112 0 0 0 0 0 0
* Next, check `pgpgin` and `pgpgout` again:
  ```text theme={null}
  cat /proc/vmstat
  ```
  The following is an output of the command: pgpgin 2446172pgpgout 4197396

The following is an example of the statistics for an I/O-bound use case:

```text theme={null}
Sectors read = (4209448 – 15056) = 4194392 sectors = 2GB
Time spent reading = (13018 – 250) = 12768 ms
Sectors written = (8394624 - 4199944) = 4194680 sectors = 2GB
Time spent writing = (300094 -135729) = 164365 ms
Time spend IO = (11836 – 5508) = 6328 ms

pgpgin gap = (2446172-348632) = 2GB
pgpgout gap = (4197396 – 2100056) = 2GB
```

For more information, see [I/O statistics fields](https://www.kernel.org/doc/Documentation/iostats.txt).

### **Vmstat**

`Vmstat` is a Linux command used to gather information about block input (bi) and block output (bo). The following figure shows an example of the vmstat output:

<div className="flex flex-col items-center gap-1">
  <img src="https://mintcdn.com/qualcomm-prod/OKFyShYzKWv2bmj8/System/Performance/media/k2l-performance/fig-6-14.jpg?fit=max&auto=format&n=OKFyShYzKWv2bmj8&q=85&s=c1415bfa1543117d910418d622917026" width="694" height="255" data-path="System/Performance/media/k2l-performance/fig-6-14.jpg" />
</div>

For more information, see [Transparent Hugepage Support](https://www.kernel.org/doc/html/latest/admin-guide/mm/transhuge.html).

### **Use large cores for heavy use cases**

When a heavy task runs on the Silver core with a high runtime, it can impact performance. Affine such tasks onto the larger (Gold) cores using `sched_setaffinity()`. This task affinity can help to reduce the CPU runtime and enhance performance.

<Warning>
  Any modification made to the nodes can impact the power and the performance of the device. It's important to verify the impact across all relevant use cases before changing the nodes.
</Warning>

The following figure from Trace Compass shows an example of a thread test running for 12.9 milliseconds on CPU0 at a frequency of 1.9 GHz.

<div className="flex flex-col items-center gap-1">
  <img src="https://mintcdn.com/qualcomm-prod/OKFyShYzKWv2bmj8/System/Performance/media/k2l-performance/fig-6-15.jpg?fit=max&auto=format&n=OKFyShYzKWv2bmj8&q=85&s=4f3e4a48fef5eeef43431967f55bf45e" width="926" height="375" data-path="System/Performance/media/k2l-performance/fig-6-15.jpg" />
</div>

To set task affinity to the Gold core using `sched_setaffinity()`, see [sched\_setaffinity(2) — Linux manual page](https://man7.org/linux/man-pages/man2/sched_setaffinity.2.html).

The following is a sample code where a task is affined to Gold core 7:

```text theme={null}
#include <sched.h>
#include <unistd.h>
#include <sys/syscall.h>
cpu_set_t mask;
CPU_ZERO(&mask);
CPU_SET(7, &mask);
pid_t tid = syscall(__NR_gettid);
int result = sched_setaffinity(tid, sizeof(mask), &mask);
```

After the task is affined with `sched_setaffinity()`, it runs on CPU7 and the runtime is reduced from 12.9 milliseconds to 2.9 milliseconds with a CPU frequency of 2.7 GHz.

The following figure shows the reduced time after setting the `sched_setaffinity()` property:

<div className="flex flex-col items-center gap-1">
  <img src="https://mintcdn.com/qualcomm-prod/OKFyShYzKWv2bmj8/System/Performance/media/k2l-performance/fig-6-16.jpg?fit=max&auto=format&n=OKFyShYzKWv2bmj8&q=85&s=cc59ad1297277adfa5349f2cc9cdc66a" width="948" height="296" data-path="System/Performance/media/k2l-performance/fig-6-16.jpg" />
</div>

### **Mitigate impact of runnables on use cases**

When a task is ready to run but the CPU is unavailable, the task is considered to be in a runnable state. This state is assigned to tasks when the CPU is under heavy load.

To visualize the status of threads, you can use the Trace Compass **Control Flow** view.

The following figure displays thread statuses represented by different colors:

* Dark red line indicates that the thread is in a runnable state
* Yellow lines represent the sleep state
* Red line indicates that the CPU is busy handling `irq` or `softirq`

<div className="flex flex-col items-center gap-1">
  <img src="https://mintcdn.com/qualcomm-prod/OKFyShYzKWv2bmj8/System/Performance/media/k2l-performance/mitigate_impact_runnables.png?fit=max&auto=format&n=OKFyShYzKWv2bmj8&q=85&s=288ede338b9a46c2ed907fc631aa0967" width="1416" height="316" data-path="System/Performance/media/k2l-performance/mitigate_impact_runnables.png" />
</div>

Types of runnables:

* Wake-up latency runnable refers to the time it takes for tasks that are ready to move from a runnable state to actually running on the CPU. This latency can be reduced by tuning a scheduler or disabling the Low‑power mode of the CPU.
* Normal runnable occurs when the CPU selects the higher-priority processes to run instead of the current one. Increasing the priority of a task can help reduce the runnables.

The priority of a thread depends on its type:

* The priority of a real-time (RT) thread ranges from 0 to 99, with a higher number indicating a higher priority. To change the real-time thread priority, use the `SCHED_FIFO` policy in `sched_setscheduler()`.
* The priority of a normal thread ranges from 100 to 139, with a lower number indicating a higher priority. To change the normal thread priority, use the `renice` Linux command and `sched_setscheduler()` with the `SCHED_OTHER` policy. The values in the range –20 to +19 are mapped to the thread priorities in the range 100 to 139.

To reduce the runnable time by changing the thread priority, use `sched_setscheduler()`.

For `sched_setscheduler()`, see [sched\_setscheduler(2)—Linux manual page](https://man7.org/linux/man-pages/man2/sched_setscheduler.2.html).

The following is a sample code that reduces runnable time by changing the thread priority using `sched_setscheduler()`:

```text theme={null}
struct sched_param param = {0};
param.sched_priority = 1;
int ret=0;
ret = sched_setscheduler(0, SCHED_FIFO, &param);
```

The first parameter represents the Task ID. 0 represents the current task. The second parameter represents the scheduler policy. `SCHED_FIFO` is for the RT threads. The `sched_priority` is equal to 1.

```text theme={null}
0--> 99 ( RT class highest priority)
1 --> 99-1 --> 98
2 --> 99-2 --> 97
..
99 --> 99-99 --> 0 (RT least priority)
```

By default, the process priority is 120. It's inherited from the shell. The runnable time is 225 milliseconds and the runtime is 267 milliseconds. By increasing the process priority from 120 to 98 (real-time priority), the runnable duration reduces to less than 2 milliseconds.

### **Speed up CPU ramp-up time**

A delay in transitioning to a higher required CPU frequency can impact performance. You can tune the `sched_util_clamp_min` scheduler node to speed up the CPU frequency ramp-up.

Tune the `sched_util_clamp_min` within a range of 0 to 1024. Higher values can enhance performance but may also increase power consumption.

The following are examples of how the test thread performs on core 4:

* When `sched_util_clamp_min` is 0, the CPU frequency ramps up slowly from 691 MHz to 1.5 GHz and then to 1.7 GHz. You can set this value by running the following command on the device:
  ```text theme={null}
  echo 0 > /proc/sys/kernel/sched_util_clamp_min
  ```
  The following figure from Trace Compass shows the ramping up of the CPU frequency:
  <div className="flex flex-col items-center gap-1">
    <img src="https://mintcdn.com/qualcomm-prod/OKFyShYzKWv2bmj8/System/Performance/media/k2l-performance/fig-6-18.jpg?fit=max&auto=format&n=OKFyShYzKWv2bmj8&q=85&s=dd710934392853cb2eeeec4715bd9a19" width="1231" height="99" data-path="System/Performance/media/k2l-performance/fig-6-18.jpg" />
  </div>
* When `sched_util_clamp_min` is 512, the CPU frequency ramps up directly from 691 MHz to 1.9 GHz. You can set this value by running the following command on the device:
  ```text theme={null}
  echo 512 > /proc/sys/kernel/sched_util_clamp_min
  ```
  The following figure shows the ramping up of the CPU frequency to 1.9 GHz:
  <div className="flex flex-col items-center gap-1">
    <img src="https://mintcdn.com/qualcomm-prod/OKFyShYzKWv2bmj8/System/Performance/media/k2l-performance/fig-6-19.jpg?fit=max&auto=format&n=OKFyShYzKWv2bmj8&q=85&s=c6998371206dcfc33fb4025a459ddab0" width="1178" height="98" data-path="System/Performance/media/k2l-performance/fig-6-19.jpg" />
  </div>
* When `sched_util_clamp_min` is 1024, the CPU frequency ramps up from 691 MHz directly to the maximum frequency (FMAX) of 2.4 GHz. You can set this value by running the following command on the device:
  ```text theme={null}
  echo 1024 > /proc/sys/kernel/sched_util_clamp_min
  ```
  The following figure shows the ramping up of the CPU frequency directly from 691 MHz to FMAX 2.4 GHz:
  <div className="flex flex-col items-center gap-1">
    <img src="https://mintcdn.com/qualcomm-prod/OKFyShYzKWv2bmj8/System/Performance/media/k2l-performance/fig-6-20.jpg?fit=max&auto=format&n=OKFyShYzKWv2bmj8&q=85&s=a4294427aabec5234759756ed512a3ae" width="947" height="102" data-path="System/Performance/media/k2l-performance/fig-6-20.jpg" />
  </div>

### **Determine cache residency for use cases**

The perf utility tool is used to analyze cache misses and cache refill counter statistics. This analysis helps to determine the residency of a use case in a specific cache, such as L2, L3, and last level cache controller (LLCC) DDR residency.

For instructions on how to compile the perf utility, see [Compile performance tools](./get-started-with-performance-tuning-and-optimization#compile-performance-tools).

To check the available cache event for the target, run the following command on the device:

```text theme={null}
perf list | grep cache
```

The following is an example command to obtain the cache residency:

```text theme={null}
perf stat -e l1d_cache_lmiss_rd -e l1i_cache_lmiss -e l2d_cache_lmiss_rd -e l3d_cache_lmiss_rd -e ll_cache_miss_rd  sleep 5
```

Cache miss counters in the CPU path, from the previous cache levels (L1 → L2 → L3 → LLCC → DDR) indicate the residency of the use case in the subsequent cache.

The following sample code provides cache miss counter statistics:

```text theme={null}
Performance counter stats for '5 duration':

           5797      l1d_cache_lmiss_rd
          26699      l1i_cache_lmiss
          16200      l2d_cache_lmiss_rd
           8634      l3d_cache_lmiss_rd
           9710      ll_cache_miss_rd

    5.004388332 seconds time elapsed

    0.001599000 seconds user
    0.000000000 seconds sys
```

### **Identify lock contention**

Lock contention occurs when one thread (thread\_1) attempts to acquire a Mutex lock that's already held by another thread (thread\_2).

In this situation, thread\_1 enters the Sleep mode and wakes up when thread\_2 releases the Mutex lock.

To resolve this issue, go to **Trace Compass** and select **Select Previous State Change** as shown in the following figure:

<div className="flex flex-col items-center gap-1">
  <img src="https://mintcdn.com/qualcomm-prod/OKFyShYzKWv2bmj8/System/Performance/media/k2l-performance/fig-6-21.jpg?fit=max&auto=format&n=OKFyShYzKWv2bmj8&q=85&s=64c79ee8206720df21e8c70d8d9484e9" width="280" height="96" data-path="System/Performance/media/k2l-performance/fig-6-21.jpg" />
</div>

The following figure shows an instance where thread 2991 wakes up thread 2993:

<div className="flex flex-col items-center gap-1">
  <img src="https://mintcdn.com/qualcomm-prod/OKFyShYzKWv2bmj8/System/Performance/media/k2l-performance/fig-6-22.jpg?fit=max&auto=format&n=OKFyShYzKWv2bmj8&q=85&s=873a7ff72d7cb180b0a1dc4e286ed78a" width="983" height="378" data-path="System/Performance/media/k2l-performance/fig-6-22.jpg" />
</div>

### **Determine duration of pre-emption disabling**

The kernel operates on a pre-emptive basis. This means that any kernel process can be paused at any moment to make way for a higher priority process. Therefore, a new task can start running in the same critical region where a previous task was pre‑empted.

The following procedure outlines how to record the duration during which pre-emption is disabled:

1. From the kernel configuration, enable `CONFIG_IRQSOFF_TRACER` and `CONFIG_PREEMPT_TRACER` in the source code.
2. To collect a trace, run the following commands: **Note** The following commands should be run on the device.
   ```text theme={null}
   echo preemptoff > /sys/kernel/tracing/current_tracer
   ```
   ```text theme={null}
   echo 1 > /sys/kernel/tracing/tracing_on
   ```
   ```text theme={null}
   cat /sys/kernel/tracing/trace
   ```

As shown in the figure, a timestamp is recorded for each instance of pre-emption being disabled, marking the start and end points in the code:

<div className="flex flex-col items-center gap-1">
  <img src="https://mintcdn.com/qualcomm-prod/OKFyShYzKWv2bmj8/System/Performance/media/k2l-performance/fig-6-print-timestamp.jpg?fit=max&auto=format&n=OKFyShYzKWv2bmj8&q=85&s=6f9d2270984aea8ee9d4e9eef18cb34d" width="643" height="452" data-path="System/Performance/media/k2l-performance/fig-6-print-timestamp.jpg" />
</div>

For more information about function tracer, see [ftrace - Function Tracer](https://www.kernel.org/doc/Documentation/trace/ftrace.txt).

### **Debug frame drops**

Frame drops can occur due to delays in various subsystems, such as the display or camera. For example, if the display refresh rate is 60 Hz, each frame must be completed within 16.6 milliseconds.

The following figure shows a trace where `Weston` and `SDM_EventThread` run every 16.6 milliseconds. Any application must render periodically and complete its rendering within this 16.6 milliseconds timeframe. If rendering isn't complete before this window expires, the frames are dropped.

<div className="flex flex-col items-center gap-1">
  <img src="https://mintcdn.com/qualcomm-prod/OKFyShYzKWv2bmj8/System/Performance/media/k2l-performance/fig-6-23.jpg?fit=max&auto=format&n=OKFyShYzKWv2bmj8&q=85&s=20e53323eccbff9b95e1bfa0f7538d75" width="896" height="174" data-path="System/Performance/media/k2l-performance/fig-6-23.jpg" />
</div>

### **Identify memory thrashing**

Memory thrashing occurs when the system spends a significant amount of time reclaiming memory from RAM and then reloads the same content back into RAM.

This can occur on file cache pages from disk and anonymous pages from ZRAM, leading to substantial performance degradation.

Memory thrashing typically occurs when the available memory is insufficient for the current use case (referred to as the workingset). This causes the system to struggle in finding memory that can be reclaimed.

You can identify memory thrashing from the following information in `/proc/vmstat`:

|                    **vmstat nodes**                   |                                                                                                                                                                              **Description**                                                                                                                                                                              |
| :---------------------------------------------------: | :-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
|  `workingset_refault_anon`/`workingset_refault_file`  |                                                                                                                   These nodes represent the number of reclaimed pages that are immediately requested after reclaim. The lower these numbers, the better.                                                                                                                  |
| `workingset_activate_anon`/`workingset_activate_file` |                                                                                                                   These nodes represent the number of reclaimed pages that are immediately activated after reclaim. The lower these numbers, the better.                                                                                                                  |
|                   `pgpgin`/`pswpin`                   |                                                                                                                                       These nodes represent the number of pages read from swap and swapped back into the RAM memory.                                                                                                                                      |
|                  `pgpgout`/`pswpout`                  |                                                                              These nodes represent the number of pages written to swap as part of reclaim. If `pgpg*` and `pswp*` are increasing simultaneously along with `workingset_refaults`, it indicates a memory thrashing situation.                                                                              |
|           `pgsteal_kswapd`/`pgsteal_direct`           |                                                                                                                                                     These nodes represent the number of pages reclaimed by the system.                                                                                                                                                    |
|            `pgscan_kswapd`/`pgscan_direct`            | These nodes represent the number of pages that the system has scanned to find reclaimable memory. The ratio of `pgsteal`/`pgscan` indicates the reclaim efficiency of the system. A higher value indicates better system performance while a lower reclaim efficiency indicates that the system is struggling to find reclaimable memory, indicative of memory thrashing. |

To identify memory thrashing, run the following command on the device:

```text theme={null}
cat /proc/vmstat
```

The vmstat fields are as follows:

```text theme={null}
workingset_refault_anon 984111
workingset_refault_file 1838690
workingset_activate_anon 502428
workingset_activate_file 499034
pgpgin 17488312
pgpgout 3398036
pswpin 984141
pswpout 2101230
pgsteal_kswapd 3946686
pgsteal_direct 59226
pgscan_kswapd 4660928
pgscan_direct 73719
```

These counters increase linearly over time.

To detect patterns in memory thrashing, gather data from these counters at regular intervals. Then, plot this data over a specific time period to visualize the patterns.

## **Next steps**

* [Performance dashboards](./performance-dashboards)
