Skip to main content

Supporting the NVIDIA GPU

How to collect GPU metrics of the WhaTap Kubernetes agent

The WhaTap Kubernetes node agent collects and monitors performance metrics of the NVIDIA GPU by using the Data Center GPU Manager (DCGM) Exporter. The process is structured using the Sidecar pattern.

  • Sidecar pattern

    The DCGM Exporter is set as a secondary container that runs within the same Pod together with the main application container. This Sidecar pattern helps the DCGM Exporter efficiently collect the GPU status information.

  • DCGM Exporter

    The dcgm-exporter container collects GPU status and performance metrics via NVIDIA's Data Center GPU Manager (DCGM).

  • Metric collection and transmission

    The whatap-node-agent container requests and collects GPU metrics via the HTTP endpoint of dcgm-exporter.

    Note

    The HTTP endpoint of dcgm-exporter usually uses the port 9400.

Collection scope

The WhaTap Kubernetes agent focuses on efficiently monitoring and managing the GPU usage of each container deployed on the node.

  • Containers

    The WhaTap node agent collects GPU metrics for each container, giving you a clear view of how much GPU resources each container is using. This data is useful for optimizing the resource allocation.

  • Nodes

    You can monitor GPU usage across the entire node to assess the GPU resource utilization for the entire node. This information is useful for analyzing the overall performance of the cluster, but does not provide the details at the individual process level.

Collection metrics

The following lists the key GPU metrics collected by the DCGM Exporter:

Node level metrics

It focuses on monitoring the GPU status for all Kubernetes nodes. These metrics are useful for assessing the overall GPU utilization of the cluster.

  • DCGM_FI_DEV_GPU_UTIL Gauge

    • This metric represents the GPU utilization, displaying the current GPU usage as a percentage.

    • It monitors the GPU usage for all nodes in real time to help ensure proper resource allocation and load balancing.

  • DCGM_FI_DEV_MEM_COPY_UTIL Gauge

    • This metric represents the memory usage, providing GPU memory bandwidth usage as a percentage.

    • It monitors the GPU memory resource usage to help detect and resolve memory bandwidth bottlenecks.

  • DCGM_FI_DEV_POWER_USAGE Gauge

    • This metric represents the current power consumption of the GPU in Watts (W).

    • It monitors the GPU power usage to improve power efficiency and optimize operating costs.

  • DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION Counter

    • It measures the total accumulated GPU energy consumption in millijoules (mJ) since the system boot.

    • By tracing the energy usage for all nodes to improve the long-term energy efficiency and establish the operational strategies.

Container level metrics

It focuses on monitoring the GPU resource usage within individual containers. These metrics evaluate whether each container uses its allocated resources effectively.

  • DCGM_FI_DEV_FB_FREE and DCGM_FI_DEV_FB_USED Gauge

    • It displays the amount of available frame buffer memory and the amount of frame buffer memory in use, in MiB.

    • By monitoring the GPU memory usage for each container, you can optimize memory resource allocation and prevent performance degradation due to resource shortage.

  • DCGM_FI_DEV_SM_CLOCK and DCGM_FI_DEV_MEM_CLOCK Gauge

    • It displays the streaming multiprocessor (SM) clock frequency and memory clock frequency of each GPU in MHz.

    • It monitors GPU clock speeds of each container to help optimize the performance and set an appropriate frequency. This allows you to deliver the performance tailored to the needs of each application.

  • DCGM_FI_DEV_GPU_TEMP and DCGM_FI_DEV_MEMORY_TEMP Gauge

    • It measures the temperatures of each GPU and memory temperatures in degrees Celsius (C).

    • You can monitor the GPU temperatures of each container to prevent overheating and ensure stable operation. This extends the life of the system and reduces the downtime.