Incident root cause analysis¶
Investigate an incident by reconciling telemetry, changes, and human reports into an evidence-backed explanation of what failed and why.
Metadata¶
- Pattern id:
incident-root-cause-analysis - Pattern family: Investigate / Reconcile / Verify
- Problem structure: Discrepancy investigation (
discrepancy-investigation) - Domains: Engineering (
engineering), Operations (operations), Support (support)
Workflow goal¶
Produce a defensible root-cause narrative, reconciled timeline, and prioritized next checks or remediations for an incident.
Inputs¶
Incident trigger¶
- Description: A declared incident, anomaly, or escalated case that requires explanation beyond initial triage.
- Kind: case
- Required: Yes
- Examples:
- Sev-2 service degradation
- Escalated customer-impacting support case
Operational evidence¶
- Description: Logs, metrics, traces, alerts, and system state snapshots relevant to the incident window.
- Kind: telemetry
- Required: Yes
- Examples:
- Request latency traces
- Queue depth metrics
- Database failover logs
Change and context history¶
- Description: Deployments, configuration updates, tickets, maintenance actions, and recent environment changes.
- Kind: change-history
- Required: Yes
- Examples:
- Deployment records
- Feature flag changes
- Runbook interventions
Human reports¶
- Description: Operator notes, support escalations, and stakeholder observations that may explain or constrain hypotheses.
- Kind: narrative
- Required: No
- Examples:
- Incident channel notes
- Customer symptom reports
Outputs¶
Root-cause hypothesis set¶
- Description: Ranked causal explanations with supporting and disconfirming evidence plus confidence notes.
- Kind: analysis
- Required: Yes
- Examples:
- Primary cause with contributing conditions
- Competing hypotheses awaiting one confirming check
Reconciled incident timeline¶
- Description: A normalized sequence of relevant events across systems and actors.
- Kind: timeline
- Required: Yes
- Examples:
- Timeline linking deploy, alert, mitigation, and customer reports
Recommended follow-up actions¶
- Description: Proposed investigations, corrective actions, or monitoring changes derived from the analysis.
- Kind: recommendation
- Required: Yes
- Examples:
- Validate rollback path before declaring closure
- Add detection around the failing dependency
Environment¶
Operates in high-pressure technical and service environments where evidence is fragmented, timelines are noisy, and premature conclusions can worsen the incident.
Systems¶
- Observability platforms
- Incident management systems
- Change management records
- Support case systems
Actors¶
- Incident commander
- Responding engineers or operators
- Service owners
- Support leads
Constraints¶
- Preserve evidence integrity and timestamps as collected.
- Distinguish observed facts from inferred causes.
- Avoid declaring closure while material uncertainty remains.
- Keep the narrative inspectable for postmortem and audit review.
Assumptions¶
- Relevant systems retain logs and state long enough for investigation.
- Time sources can be normalized sufficiently to build a coherent timeline.
- Human responders are available to validate or reject recommended conclusions.
Capability requirements¶
- Retrieval (
retrieval): Investigators must gather evidence from multiple systems and records before causes can be narrowed. - Discrepancy analysis (
discrepancy-analysis): The workflow centers on explaining mismatches between expected and observed behavior. - Verification (
verification): Candidate causes need corroboration against independent evidence rather than plausible storytelling alone. - Memory and state tracking (
memory-and-state-tracking): The workflow must preserve evolving hypotheses, evidence links, and timeline state across multiple steps. - Coordination (
coordination): Multiple responders and systems contribute evidence, so handoffs and ownership need structured coordination.
Execution architecture¶
- Orchestrated multi-agent (
orchestrated-multi-agent): Specialized retrieval, timeline-building, and verification roles are common when evidence volume and responder concurrency are high. - Human in the loop (
human-in-the-loop): Human responders routinely validate hypotheses, weigh operational context, and approve incident conclusions.
Autonomy profile¶
- Level: Recommendation only (
recommendation-only) - Reversibility: Analytical conclusions are editable, but accepting the wrong root cause can drive hard-to-reverse remediation and delay real recovery.
- Escalation: Escalate when evidence is incomplete, multiple plausible causes remain after planned checks, or the proposed remediation could materially increase user or system risk.
Human checkpoints¶
- Confirm incident scope and critical systems before narrowing hypotheses.
- Review ranked root-cause hypotheses before declaring the primary cause.
- Approve remediation or closure actions derived from the analysis.
Risk and governance¶
- Risk level: High (
high) - Failure impact: Incorrect diagnosis can prolong outages, trigger harmful remediations, misinform postmortems, and create false confidence about system reliability.
- Auditability: Preserve normalized timelines, evidence links, rejected hypotheses, human overrides, and final causal rationale for postmortem review.
Approval requirements¶
- The incident commander must approve the declared root cause before final closure or external reporting.
- Service owners must approve corrective actions that materially alter production systems.
Privacy¶
- Minimize exposure of personal data in logs, tickets, and customer reports during investigation.
- Restrict copied evidence to incident workspaces with appropriate retention controls.
Security¶
- Use read-only access where possible when collecting production evidence.
- Protect credentials and privileged diagnostics from leaking into analysis artifacts.
Notes: High-risk governance is justified because the pattern shapes consequential remediation and formal incident narratives.
Why agentic¶
- The workflow must iteratively form, test, and narrow causal hypotheses as new evidence arrives.
- Evidence comes from heterogeneous systems and people, making stateful reconciliation and coordination essential.
- Static dashboards rarely preserve enough structured reasoning to explain why one cause is more credible than another.
Failure modes¶
Premature fixation on a plausible but wrong root cause¶
- Impact: Teams pursue the wrong remediation and leave the real fault unresolved.
- Severity: high
- Detectability: medium
- Mitigations:
- Keep multiple hypotheses visible until disconfirming checks are complete.
- Require evidence that explicitly links cause to observed impact.
- Review the final narrative with an incident lead before closure.
Incident timeline is incomplete or incorrectly ordered¶
- Impact: Causal reasoning becomes distorted and contributing factors are missed.
- Severity: high
- Detectability: medium
- Mitigations:
- Normalize time sources and highlight gaps in the event sequence.
- Reconcile machine telemetry with human reports before finalizing the timeline.
State is lost across responders or investigation sessions¶
- Impact: Previously rejected hypotheses reappear and evidence handoffs become inconsistent.
- Severity: medium
- Detectability: high
- Mitigations:
- Maintain shared case memory for evidence, conclusions, and open checks.
- Log analyst decisions and rationale as part of the investigation record.
Support or operator observations are ignored because they are less structured¶
- Impact: The analysis misses symptoms that would have ruled out or strengthened a hypothesis.
- Severity: medium
- Detectability: medium
- Mitigations:
- Capture human observations as first-class evidence linked to the timeline.
- Require explicit notes when reports are excluded from the final narrative.
Evaluation¶
Success metrics¶
- Agreement rate between initial analysis and final adjudicated root cause.
- Time to first defensible hypothesis with cited supporting evidence.
- Percentage of incidents with a complete reconciled timeline before closure.
Quality criteria¶
- The analysis separates observations, inferences, and recommended actions.
- Competing explanations are documented until they are disproved or deprioritized.
- Evidence supporting the declared root cause is inspectable and durable.
Robustness checks¶
- Test against incidents with overlapping concurrent failures to ensure hypotheses stay distinguishable.
- Test with missing telemetry and verify the workflow degrades into explicit uncertainty rather than false certainty.
- Test with conflicting customer reports and system metrics to ensure both remain visible for adjudication.
Benchmark notes: Strong evaluation compares both diagnostic accuracy and the discipline of evidence preservation under operational pressure.
Implementation notes¶
Orchestration notes¶
- Keep retrieval, timeline reconciliation, and causal verification as explicit stages with shared case memory.
- Preserve rejected hypotheses to reduce repeated investigation loops during handoffs.
Integration notes¶
- Typical integrations include observability stacks, incident systems, change logs, and support tooling.
- Architecture should remain vendor-neutral so the pattern does not collapse into a specific monitoring platform.
Deployment notes¶
- Prefer read-only evidence collection paths in production.
- Align retention of investigation artifacts with postmortem and compliance obligations.
References¶
Example domains¶
- Engineering (
engineering): Explain a production outage by reconciling deployments, traces, and dependency failures. - Operations (
operations): Analyze a process breakdown by aligning handoff records, queue metrics, and operator notes. - Support (
support): Investigate an escalated customer issue by combining case history, service signals, and responder observations.
Related patterns¶
- Risk alert triage (follows-from)
- Alert triage often hands the highest-severity or ambiguous cases into root-cause analysis.
- Research synthesis with citation verification (complements)
- Evidence-grounded synthesis techniques strengthen the explainability and provenance of the final incident narrative.
Grounded instances¶
- Fixed-income voice-capture retention gap root-cause investigation
- Sanctions screening gap root-cause investigation
- Payments API latency incident investigation
- Restricted production crash-dump redaction exposure root-cause investigation
- Intercompany netting settlement mismatch investigation
- Treasury cash position discrepancy investigation
- Protected leave return-to-work status drift root-cause investigation
- Distribution sorter misroute root-cause investigation
- Cross-lab benchmark replication discrepancy investigation
- Enterprise admin entitlement drift root-cause investigation
- Severity-one sovereign support case evidence-loss and routing-state root-cause investigation
Canonical source¶
data/patterns/investigate-reconcile-verify/incident-root-cause-analysis.yaml