Skip to main content

WhaTap GPU Monitoring

WhaTap GPU Monitoring provides integrated analysis of GPU resource status and utilization across server and Kubernetes environments. While GPU is one of many resources in a server, it is a critical resource that directly impacts the performance of AI/ML, LLM, and HPC workloads — and comes with significantly higher costs compared to CPU or memory.

Beyond simply checking whether a GPU is installed and running, you need visibility into how much it is being utilized, which workloads are occupying it, whether any anomalies exist, and whether resource allocation is appropriate.

WhaTap GPU Monitoring Scope

Note

WhaTap GPU Monitoring supported environments

  • GPU monitoring in Server environments

  • GPU monitoring in Kubernetes environments

Server Environment

  • GPU server monitoring periodically collects and stores key metrics such as GPU utilization, memory usage, temperature, power, clock speed, errors, and PCIe/NVLink communication status, enabling comprehensive assessment of GPU device health from a server infrastructure perspective.

  • Process-level GPU occupancy data is also provided, so you can see which workloads are using which GPUs and how much. This goes beyond simple device health checks to help you analyze how GPU resources are actually being used within the server.

  • Monitoring enables early detection of signs such as overheating, throttling, power limits, abnormal errors, and PCIe/NVLink communication issues — helping you prevent failures and respond quickly. This directly contributes to service stability and improved operational efficiency.

Kubernetes Environment

  • In Kubernetes environments, GPUs are primarily allocated to AI training/inference workloads, batch jobs, and high-performance computing applications. Viewing only node-level GPU status is often insufficient for understanding real-world operational conditions.

  • WhaTap Kubernetes GPU monitoring provides visibility into GPU resource usage at the cluster, node, and pod/container levels. This helps you understand which workloads are occupying GPUs, whether usage is concentrated on specific nodes, and how efficiently GPUs are being used relative to requests and allocations.

  • GPU monitoring from a Kubernetes perspective is also useful for analyzing scheduling efficiency, resource imbalance, and over/under-allocation. Ultimately, it improves the stability of GPU-based workload operations and enhances overall cluster resource utilization.

Enterprise Environment

  • Enterprise environments typically run many GPU servers and Kubernetes clusters together. Monitoring individual servers or Kubernetes environments in isolation makes it difficult to get a comprehensive view of overall resource status, so an integrated management approach is required.

  • Organizations operating hundreds to thousands of GPUs want continuous visibility into how effectively GPU resources allocated by team or workload are being used. This data is essential for making informed decisions about idle resource reallocation, capacity expansion, and procurement planning.

  • By unifying multiple Kubernetes-based GPU environments and bare-metal GPU server environments under a single monitoring framework, you can manage distributed GPU resources from a single perspective. This delivers enterprise-grade GPU monitoring that provides a consolidated view of GPU resources across the organization and connects to Capacity Planning.

Value of WhaTap GPU Monitoring

GPU Resource Visibility

Gain a multi-dimensional view of GPU usage down to the server, cluster, node, pod, and process level.

Early Anomaly Detection and Rapid Response

Quickly identify signs such as temperature spikes, power limits, clock degradation, errors, and communication issues before they escalate into failures.

GPU Utilization Optimization

Identify idle GPUs, uneven usage, and underutilized resources to enable reallocation and operational optimization.

Improved Operational Efficiency

Manage server and Kubernetes environments in an integrated manner — rather than separately — to reduce operational complexity.

Informed Expansion and Investment Decisions

Use real usage data to evaluate the need for GPU expansion, and leverage it as supporting evidence for budget and procurement planning.

Capacity Planning Support

Connect short-term incident response to medium- and long-term resource demand forecasting and operational strategy planning.

GPU Device Anomaly Detection

Methods for detecting anomalies in GPU devices and workloads.

1. Xid-Based Anomaly Detection

NVIDIA GPUs record Xid events in logs when a failure or abnormal condition occurs. Detecting the presence of Xid keywords and codes in system logs or driver logs enables relatively fast identification of GPU device anomalies.

For example, specific Xid events can indicate GPU computation errors, memory access issues, driver/hardware faults, or reset events — making Xid events a valuable failure signal in production environments.

2. Status Metric-Based Anomaly Detection

GPU anomalies are not always reflected in error logs alone. In production environments, monitoring the change patterns of key status metrics — such as GPU utilization, memory usage, temperature, power, clock speed, and PCIe/NVLink communication status — enables earlier anomaly detection.

However, it is important to interpret these metrics not solely by absolute values, but in conjunction with workload characteristics, task type, time-of-day patterns, and typical baselines. For example, training, inference, batch processing, and data preprocessing workloads each have different normal GPU usage patterns, so the same value can mean different things depending on the operational context.

From this perspective, the following pattern changes are key areas to monitor:

  • GPU utilization is excessively high compared to typical workload patterns, or allocated resources remain persistently underutilized

  • Memory occupancy remains abnormally high after a task completes, or memory usage spikes sharply from a certain point

  • Temperature continuously rises beyond the normal operating range, or a sustained high-temperature condition persists for an extended period

  • Power consumption hovers near the limit for a prolonged period, suggesting performance throttling due to power capping

  • Clock speed remains lower than expected relative to the load level, suggesting throttling or abnormal control behavior

  • PCIe/NVLink communication volume shows abnormal increases or decreases relative to workload characteristics, or error indicators are observed on InfiniBand

Interpreting status metrics as change patterns within operational context — rather than as individual values — allows you to more accurately identify issues such as performance degradation, overheating, throttling, power limits, communication bottlenecks, and abnormal behavior.

3. Process/Workload-Based Anomaly Detection

Viewing workload information such as processes, pods, and containers occupying the GPU alongside device metrics helps clarify anomalies that would be difficult to identify from device metrics alone. However, this also requires considering workload type, execution stage, repetition cycle, and typical occupancy patterns — not just whether something is occupying the GPU.

Since training, inference, batch processing, and data preprocessing workloads use GPUs differently, the criteria for normal versus abnormal can vary depending on the type of task — even when the occupancy state appears the same.

From this perspective, the following patterns are key areas to monitor:

  • A specific process or workload occupies GPU memory in excess of what its task characteristics would suggest, without releasing it within the expected timeframe

  • Memory occupancy is sustained for a long time while GPU utilization is low, suggesting task stagnation, abnormal waiting, or potential memory leaks

  • A specific task repeatedly restarts or fails while inefficiently occupying GPU resources

  • Workloads are persistently concentrated on a subset of GPUs, causing usage imbalance or localized overheating

  • Compared to similar workloads, a specific process shows excessively long execution time, high memory occupancy, or abnormal occupancy patterns

  • After a pod or container is rescheduled, GPU occupancy patterns change drastically, suggesting scheduling inefficiency or resource imbalance

These patterns can help identify not only device-level issues but also application or workload operational problems.