Skip to main content

Incident response scenario

When an alert fires and "where do I start?" becomes an ad hoc decision every time, response slows down. This guide walks the incident responder through a single, consistent flow — detection, mitigation, postmortem — and maps each step to the WhaTap menus to use.

Who this guide is for
  • Engineers currently on-call or preparing to take on-call duties
  • Leads or SREs who want to establish an incident response playbook for the team
  • Teammates who "get the alerts but aren't sure what to look at next"

Prerequisites

These three Quick Wins must be in place for real operations:

The five steps

① Detect  →  ② Scope  →  ③ Root cause

④ Mitigate → ⑤ Postmortem

① Detect — Validate the alert

Goal: within 1 minute, decide whether the alert is a real anomaly.

  1. From the Slack/Email/SMS/mobile notification, check:
    • Level (Critical / Warning)
    • Project and agent name
    • Which event rule triggered (e.g., "CPU usage threshold exceeded")
  2. Click the deep link in the notification to jump straight into the WhaTap screen.
Filtering out false positives

For frequent false positives like short spikes, raise the event rule's duration condition (e.g., 1 min → 5 min). Leaving repeat false positives unattended erodes team trust in alerts.

② Scope — How far did it spread?

Goal: within 2 minutes, identify how many agents, services, and transactions are affected.

Menus used: Application dashboard or the team's Flexboard.

  1. On the Application dashboard, check at a glance:
    • Did Apdex drop sharply?
    • Is TPS abnormally low or high?
    • Did average response time spike?
    • Any increase in error rate?
  2. If multiple agents are affected together, it's more likely an infrastructure or dependency issue (DB, network, etc.).
  3. If only one agent is affected, it's more likely a node/instance-level issue.
Outage vs. deploy

Check whether a recent deploy happened. If the anomaly started right after a deploy, a rollback is often the fastest mitigation.

③ Root cause — Where did it slow down?

Goal: identify which layer (code / SQL / external call / resource) is the culprit.

Menus used: Hitmap transactionsTransaction trace → (if needed) Active stack.

  1. In Hitmap transactions, drag-select clusters of points in the slow window.
    • Read patterns from color and position first: slow only at a specific time, or slow overall?
  2. Click the slowest of the selected transactions → open Transaction trace.
    • Call relations: which method/SQL/HTTP call consumes time?
    • Stack samples: where in the code did it wait?
  3. If DB is suspected → check DB connection state and Slow SQL together.
  4. If GC/memory is suspected → check heap memory and active stack.
  5. If thread deadlock/blocking is suspected → open Thread list & dump in Instance Performance Management (Using Instance Performance Management).

Shorten with AI analysis

When interpreting traces and stacks manually takes too long, WhaTap's AI features can get you to a first hypothesis quickly.

  • AI active stack analysis — on the Active stack screen, select a suspect stack and AI summarizes the bottleneck and wait cause in natural language. Even without reading thread dumps, you learn "which thread is stuck where." Especially useful for distinguishing GC / lock / I/O waits.
  • AI browser error stack analysis — in frontend error tracking, AI interprets error stacks and suggests code-level causes and fix directions.
  • WhaTap AI Chatbot / MCP — ask in natural language ("Is this DB connection pool value within normal range?", "How do I respond when this metric spikes?") and get answers grounded in docs, guides, and similar cases (beta, Korean-supported).
Note

Treat AI output as a first hypothesis. Final judgment must cross-check actual traces, stacks, and logs. In particular, AI doesn't know the business context (recent deploys, traffic events).

Related deep dives:

When one trace isn't enough

A single trace is a "case." To confirm a pattern, compare multiple traces from the same time window. If one trace's characteristics show up in others, it's a common cause; otherwise it's an individual issue.

④ Mitigate — Emergency response

Goal: take temporary action to minimize user impact.

Default actions by cause:

Table | Quick mitigation by root cause
Cause typeQuick mitigation
Recent deploy issueRoll back to the previous version
DB connection pool exhaustionIncrease pool size or kill slow queries
Single-node failureRestart the node or remove it from the load balancer
Memory leakRestart the process (full fix during postmortem)
External API outageTemporarily disable the feature / fail over
Traffic spikeVerify autoscaling, review rate limits
Separate mitigation from root cause analysis

Mitigation is about reducing user impact now; root cause analysis is about understanding why it happened. Spending time on investigation while users are still affected only increases damage. Mitigate first; complete the root cause in ⑤ Postmortem.

⑤ Postmortem — Prevent recurrence

Goal: define prevention actions and document them.

Menus used: Event history (alert firing log) plus trace history.

  1. In Event history, capture the time window and the events that fired for this incident.
  2. From trace/log screens, collect evidence screenshots (hard to reproduce later).
  3. Organize the postmortem document:
    • Timeline: alert fired → detected → mitigated
    • Cause: what you found in step ③
    • Impact: how many minutes, users, or transactions were affected
    • Prevention: event rule, alert policy, code/infra changes
  4. Apply event rule improvements immediately:
    • Adjust thresholds for faster detection
    • Add new alert rules
    • Reassign alert reception tags

Make it stick in the team

Incident response is less about one skilled engineer and more about the whole team following the same procedure.

  • Pin this guide in the team wiki (next to the shared dashboard from Quick Win 2)
  • During on-call handover, share "traces and event history used last week"
  • Run postmortem retros quarterly — promote recurring patterns into event rules

Next steps