Incident response scenario
When an alert fires and "where do I start?" becomes an ad hoc decision every time, response slows down. This guide walks the incident responder through a single, consistent flow — detection, mitigation, postmortem — and maps each step to the WhaTap menus to use.
- Engineers currently on-call or preparing to take on-call duties
- Leads or SREs who want to establish an incident response playbook for the team
- Teammates who "get the alerts but aren't sure what to look at next"
Prerequisites
These three Quick Wins must be in place for real operations:
- Attach your first alert — default alert rules
- Share a dashboard with your team — shared dashboard + permissions
- Set up team-based alert routing — tag-based alert routing
The five steps
① Detect → ② Scope → ③ Root cause
↓
④ Mitigate → ⑤ Postmortem
① Detect — Validate the alert
Goal: within 1 minute, decide whether the alert is a real anomaly.
- From the Slack/Email/SMS/mobile notification, check:
- Level (Critical / Warning)
- Project and agent name
- Which event rule triggered (e.g., "CPU usage threshold exceeded")
- Click the deep link in the notification to jump straight into the WhaTap screen.
For frequent false positives like short spikes, raise the event rule's duration condition (e.g., 1 min → 5 min). Leaving repeat false positives unattended erodes team trust in alerts.
② Scope — How far did it spread?
Goal: within 2 minutes, identify how many agents, services, and transactions are affected.
Menus used: Application dashboard or the team's Flexboard.
- On the Application dashboard, check at a glance:
- Did Apdex drop sharply?
- Is TPS abnormally low or high?
- Did average response time spike?
- Any increase in error rate?
- If multiple agents are affected together, it's more likely an infrastructure or dependency issue (DB, network, etc.).
- If only one agent is affected, it's more likely a node/instance-level issue.
Check whether a recent deploy happened. If the anomaly started right after a deploy, a rollback is often the fastest mitigation.
③ Root cause — Where did it slow down?
Goal: identify which layer (code / SQL / external call / resource) is the culprit.
Menus used: Hitmap transactions → Transaction trace → (if needed) Active stack.
- In Hitmap transactions, drag-select clusters of points in the slow window.
- Read patterns from color and position first: slow only at a specific time, or slow overall?
- Click the slowest of the selected transactions → open Transaction trace.
- Call relations: which method/SQL/HTTP call consumes time?
- Stack samples: where in the code did it wait?
- If DB is suspected → check DB connection state and Slow SQL together.
- If GC/memory is suspected → check heap memory and active stack.
- If thread deadlock/blocking is suspected → open Thread list & dump in Instance Performance Management (Using Instance Performance Management).
Shorten with AI analysis
When interpreting traces and stacks manually takes too long, WhaTap's AI features can get you to a first hypothesis quickly.
- AI active stack analysis — on the Active stack screen, select a suspect stack and AI summarizes the bottleneck and wait cause in natural language. Even without reading thread dumps, you learn "which thread is stuck where." Especially useful for distinguishing GC / lock / I/O waits.
- AI browser error stack analysis — in frontend error tracking, AI interprets error stacks and suggests code-level causes and fix directions.
- WhaTap AI Chatbot / MCP — ask in natural language ("Is this DB connection pool value within normal range?", "How do I respond when this metric spikes?") and get answers grounded in docs, guides, and similar cases (beta, Korean-supported).
Treat AI output as a first hypothesis. Final judgment must cross-check actual traces, stacks, and logs. In particular, AI doesn't know the business context (recent deploys, traffic events).
Related deep dives:
A single trace is a "case." To confirm a pattern, compare multiple traces from the same time window. If one trace's characteristics show up in others, it's a common cause; otherwise it's an individual issue.
④ Mitigate — Emergency response
Goal: take temporary action to minimize user impact.
Default actions by cause:
| Cause type | Quick mitigation |
|---|---|
| Recent deploy issue | Roll back to the previous version |
| DB connection pool exhaustion | Increase pool size or kill slow queries |
| Single-node failure | Restart the node or remove it from the load balancer |
| Memory leak | Restart the process (full fix during postmortem) |
| External API outage | Temporarily disable the feature / fail over |
| Traffic spike | Verify autoscaling, review rate limits |
Mitigation is about reducing user impact now; root cause analysis is about understanding why it happened. Spending time on investigation while users are still affected only increases damage. Mitigate first; complete the root cause in ⑤ Postmortem.
⑤ Postmortem — Prevent recurrence
Goal: define prevention actions and document them.
Menus used: Event history (alert firing log) plus trace history.
- In Event history, capture the time window and the events that fired for this incident.
- From trace/log screens, collect evidence screenshots (hard to reproduce later).
- Organize the postmortem document:
- Timeline: alert fired → detected → mitigated
- Cause: what you found in step ③
- Impact: how many minutes, users, or transactions were affected
- Prevention: event rule, alert policy, code/infra changes
- Apply event rule improvements immediately:
- Adjust thresholds for faster detection
- Add new alert rules
- Reassign alert reception tags
Make it stick in the team
Incident response is less about one skilled engineer and more about the whole team following the same procedure.
- Pin this guide in the team wiki (next to the shared dashboard from Quick Win 2)
- During on-call handover, share "traces and event history used last week"
- Run postmortem retros quarterly — promote recurring patterns into event rules
Next steps
- Share via reporting → Performance reporting scenario
- Reduce deploy-time incidents → Release verification scenario
- Proactive AI anomaly detection → use Anomaly Detection (advanced)