Using GPU infrastructure monitoring
GPUs are a core resource that is far more expensive than other infrastructure (CPU, memory) and directly impacts AI/ML, LLM, and HPC workload performance. To protect your investment, you need to look beyond "is the GPU alive" and track utilization, occupying workloads, anomalies, and placement suitability. WhaTap GPU monitoring provides this across both server and Kubernetes environments in a unified way.
4 questions GPU monitoring must answer
| Question | What happens if unanswered |
|---|---|
| How heavily is it used? | Expensive GPUs sit idle undetected → over-investment |
| Who (which Pod/workload) occupies it? | A single Pod monopolizes the GPU → other teams' workloads are delayed |
| Are there any anomalies? | Temperature or power issues start degrading performance → later escalate into hardware lifetime problems |
| Is the placement appropriate? | A mix of overloaded and idle GPUs → reallocation decisions are delayed |
Supported environments
WhaTap provides two views of GPUs. Choose based on your workload deployment.
GPU monitoring in server environments
Track GPUs installed directly on bare metal or VM servers. Server GPU monitoring
- GPU inventory — Asset management: model, quantity, and location of installed GPUs
- GPU performance summary — Summary of utilization, temperature, power, and memory
- GPU metrics — Detailed time-series metrics
GPU monitoring in Kubernetes environments
Track Node ↔ GPU (MIG) ↔ Pod mapping across the Kubernetes cluster. K8s GPU monitoring
- GPU dashboard — Visualize the Node-GPU-Pod mapping
- GPU trends — Long-term usage pattern analysis
- GPU metrics — Detailed metrics
MIG (Multi-Instance GPU) support — Monitor NVIDIA GPUs at both the physical (P) and MIG instance (M) level. Essential for environments that split GPUs within a cluster.
Prerequisites
- Kubernetes GPU dashboard: Kubernetes agent 1.8.7 or later + OpenAgent installed
- Server GPU: GPU module enabled in the server agent (agent-gpu)
Usage scenarios
① Get GPU asset status at a glance
When adopting or migrating infrastructure, you first need to answer "where are our GPUs and how many?" Use the Server GPU inventory to check models, quantities, and allocation status in one place.
② Track workload bottlenecks
When LLM or ML inference slows down:
- Check the Top 5 by utilization, temperature, and memory on the GPU dashboard.
- When an overloaded GPU is found, trace its node-Pod mapping.
- In a MIG environment, drill down to identify which instance is saturated.
In an LLM context, cross-reference with LLM Observability metrics to distinguish between "GPU utilization saturation" and "model choice or prompt length" as the root cause.
③ Anomaly detection
- Temperature or power anomalies: Early warning on hardware issues → prevent outages
- GPU in Pending state: Detect missing allocations
- Unused GPUs: Catch budget waste early
- Utilization skew: One node saturated while others are idle → reallocation signal
Link to alert rules: Add GPU metrics to event rules so thresholds exceeded trigger automatic notifications. See Attach your first alert for setup details.
④ Optimize resource placement
- Check long-term usage patterns with GPU trends.
- If only specific time windows are saturated, adjust scheduling or placement.
- Per-team usage → grounds for internal chargeback or quota policy
⑤ Capacity planning
Monthly and quarterly GPU usage trends provide grounds for scale-up or scale-down decisions. Include them in the quarterly retrospective of the Performance reporting scenario.
Dashboard structure highlights
Kubernetes GPU dashboard
- GPU resource status summary (top four widgets): Counts of allocated nodes, Pods, and GPUs by status
- GPU Map: Device map chart (P = physical, M = MIG)
- Grouped by node or physical device
- Color-coded by status and utilization
- Top 5 trends: Time series of GPUs ranked by utilization, temperature, and memory
Details: GPU dashboard
Server GPU performance summary
- Real-time utilization, temperature, power, and memory per installed GPU
- Per-node summary with drill-down to individual GPUs
Details: GPU performance summary
Next steps
- Server GPU installation and configuration → Server GPU agent configuration
- K8s GPU map and drill-down → K8s GPU dashboard
- Long-term usage pattern analysis → GPU trends
- Combine with the LLM perspective → Adopting LLM Observability
- Kubernetes-wide observability → Kubernetes Observability
- Add GPU event alert rules → Attach your first alert