Skip to content

Incident root cause analysis

Investigate an incident by reconciling telemetry, changes, and human reports into an evidence-backed explanation of what failed and why.

Metadata

  • Pattern id: incident-root-cause-analysis
  • Pattern family: Investigate / Reconcile / Verify
  • Problem structure: Discrepancy investigation (discrepancy-investigation)
  • Domains: Engineering (engineering), Operations (operations), Support (support)

Workflow goal

Produce a defensible root-cause narrative, reconciled timeline, and prioritized next checks or remediations for an incident.

Inputs

Incident trigger

  • Description: A declared incident, anomaly, or escalated case that requires explanation beyond initial triage.
  • Kind: case
  • Required: Yes
  • Examples:
  • Sev-2 service degradation
  • Escalated customer-impacting support case

Operational evidence

  • Description: Logs, metrics, traces, alerts, and system state snapshots relevant to the incident window.
  • Kind: telemetry
  • Required: Yes
  • Examples:
  • Request latency traces
  • Queue depth metrics
  • Database failover logs

Change and context history

  • Description: Deployments, configuration updates, tickets, maintenance actions, and recent environment changes.
  • Kind: change-history
  • Required: Yes
  • Examples:
  • Deployment records
  • Feature flag changes
  • Runbook interventions

Human reports

  • Description: Operator notes, support escalations, and stakeholder observations that may explain or constrain hypotheses.
  • Kind: narrative
  • Required: No
  • Examples:
  • Incident channel notes
  • Customer symptom reports

Outputs

Root-cause hypothesis set

  • Description: Ranked causal explanations with supporting and disconfirming evidence plus confidence notes.
  • Kind: analysis
  • Required: Yes
  • Examples:
  • Primary cause with contributing conditions
  • Competing hypotheses awaiting one confirming check

Reconciled incident timeline

  • Description: A normalized sequence of relevant events across systems and actors.
  • Kind: timeline
  • Required: Yes
  • Examples:
  • Timeline linking deploy, alert, mitigation, and customer reports
  • Description: Proposed investigations, corrective actions, or monitoring changes derived from the analysis.
  • Kind: recommendation
  • Required: Yes
  • Examples:
  • Validate rollback path before declaring closure
  • Add detection around the failing dependency

Environment

Operates in high-pressure technical and service environments where evidence is fragmented, timelines are noisy, and premature conclusions can worsen the incident.

Systems

  • Observability platforms
  • Incident management systems
  • Change management records
  • Support case systems

Actors

  • Incident commander
  • Responding engineers or operators
  • Service owners
  • Support leads

Constraints

  • Preserve evidence integrity and timestamps as collected.
  • Distinguish observed facts from inferred causes.
  • Avoid declaring closure while material uncertainty remains.
  • Keep the narrative inspectable for postmortem and audit review.

Assumptions

  • Relevant systems retain logs and state long enough for investigation.
  • Time sources can be normalized sufficiently to build a coherent timeline.
  • Human responders are available to validate or reject recommended conclusions.

Capability requirements

  • Retrieval (retrieval): Investigators must gather evidence from multiple systems and records before causes can be narrowed.
  • Discrepancy analysis (discrepancy-analysis): The workflow centers on explaining mismatches between expected and observed behavior.
  • Verification (verification): Candidate causes need corroboration against independent evidence rather than plausible storytelling alone.
  • Memory and state tracking (memory-and-state-tracking): The workflow must preserve evolving hypotheses, evidence links, and timeline state across multiple steps.
  • Coordination (coordination): Multiple responders and systems contribute evidence, so handoffs and ownership need structured coordination.

Execution architecture

  • Orchestrated multi-agent (orchestrated-multi-agent): Specialized retrieval, timeline-building, and verification roles are common when evidence volume and responder concurrency are high.
  • Human in the loop (human-in-the-loop): Human responders routinely validate hypotheses, weigh operational context, and approve incident conclusions.

Autonomy profile

  • Level: Recommendation only (recommendation-only)
  • Reversibility: Analytical conclusions are editable, but accepting the wrong root cause can drive hard-to-reverse remediation and delay real recovery.
  • Escalation: Escalate when evidence is incomplete, multiple plausible causes remain after planned checks, or the proposed remediation could materially increase user or system risk.

Human checkpoints

  • Confirm incident scope and critical systems before narrowing hypotheses.
  • Review ranked root-cause hypotheses before declaring the primary cause.
  • Approve remediation or closure actions derived from the analysis.

Risk and governance

  • Risk level: High (high)
  • Failure impact: Incorrect diagnosis can prolong outages, trigger harmful remediations, misinform postmortems, and create false confidence about system reliability.
  • Auditability: Preserve normalized timelines, evidence links, rejected hypotheses, human overrides, and final causal rationale for postmortem review.

Approval requirements

  • The incident commander must approve the declared root cause before final closure or external reporting.
  • Service owners must approve corrective actions that materially alter production systems.

Privacy

  • Minimize exposure of personal data in logs, tickets, and customer reports during investigation.
  • Restrict copied evidence to incident workspaces with appropriate retention controls.

Security

  • Use read-only access where possible when collecting production evidence.
  • Protect credentials and privileged diagnostics from leaking into analysis artifacts.

Notes: High-risk governance is justified because the pattern shapes consequential remediation and formal incident narratives.

Why agentic

  • The workflow must iteratively form, test, and narrow causal hypotheses as new evidence arrives.
  • Evidence comes from heterogeneous systems and people, making stateful reconciliation and coordination essential.
  • Static dashboards rarely preserve enough structured reasoning to explain why one cause is more credible than another.

Failure modes

Premature fixation on a plausible but wrong root cause

  • Impact: Teams pursue the wrong remediation and leave the real fault unresolved.
  • Severity: high
  • Detectability: medium
  • Mitigations:
  • Keep multiple hypotheses visible until disconfirming checks are complete.
  • Require evidence that explicitly links cause to observed impact.
  • Review the final narrative with an incident lead before closure.

Incident timeline is incomplete or incorrectly ordered

  • Impact: Causal reasoning becomes distorted and contributing factors are missed.
  • Severity: high
  • Detectability: medium
  • Mitigations:
  • Normalize time sources and highlight gaps in the event sequence.
  • Reconcile machine telemetry with human reports before finalizing the timeline.

State is lost across responders or investigation sessions

  • Impact: Previously rejected hypotheses reappear and evidence handoffs become inconsistent.
  • Severity: medium
  • Detectability: high
  • Mitigations:
  • Maintain shared case memory for evidence, conclusions, and open checks.
  • Log analyst decisions and rationale as part of the investigation record.

Support or operator observations are ignored because they are less structured

  • Impact: The analysis misses symptoms that would have ruled out or strengthened a hypothesis.
  • Severity: medium
  • Detectability: medium
  • Mitigations:
  • Capture human observations as first-class evidence linked to the timeline.
  • Require explicit notes when reports are excluded from the final narrative.

Evaluation

Success metrics

  • Agreement rate between initial analysis and final adjudicated root cause.
  • Time to first defensible hypothesis with cited supporting evidence.
  • Percentage of incidents with a complete reconciled timeline before closure.

Quality criteria

  • The analysis separates observations, inferences, and recommended actions.
  • Competing explanations are documented until they are disproved or deprioritized.
  • Evidence supporting the declared root cause is inspectable and durable.

Robustness checks

  • Test against incidents with overlapping concurrent failures to ensure hypotheses stay distinguishable.
  • Test with missing telemetry and verify the workflow degrades into explicit uncertainty rather than false certainty.
  • Test with conflicting customer reports and system metrics to ensure both remain visible for adjudication.

Benchmark notes: Strong evaluation compares both diagnostic accuracy and the discipline of evidence preservation under operational pressure.

Implementation notes

Orchestration notes

  • Keep retrieval, timeline reconciliation, and causal verification as explicit stages with shared case memory.
  • Preserve rejected hypotheses to reduce repeated investigation loops during handoffs.

Integration notes

  • Typical integrations include observability stacks, incident systems, change logs, and support tooling.
  • Architecture should remain vendor-neutral so the pattern does not collapse into a specific monitoring platform.

Deployment notes

  • Prefer read-only evidence collection paths in production.
  • Align retention of investigation artifacts with postmortem and compliance obligations.

References

Example domains

  • Engineering (engineering): Explain a production outage by reconciling deployments, traces, and dependency failures.
  • Operations (operations): Analyze a process breakdown by aligning handoff records, queue metrics, and operator notes.
  • Support (support): Investigate an escalated customer issue by combining case history, service signals, and responder observations.

Grounded instances

Canonical source

  • data/patterns/investigate-reconcile-verify/incident-root-cause-analysis.yaml