Incident root cause analysis¶

Investigate an incident by reconciling telemetry, changes, and human reports into an evidence-backed explanation of what failed and why.

Metadata¶

Pattern id: incident-root-cause-analysis
Pattern family: Investigate / Reconcile / Verify
Problem structure: Discrepancy investigation (discrepancy-investigation)
Domains: Engineering (engineering), Operations (operations), Support (support)

Workflow goal¶

Produce a defensible root-cause narrative, reconciled timeline, and prioritized next checks or remediations for an incident.

Inputs¶

Incident trigger¶

Description: A declared incident, anomaly, or escalated case that requires explanation beyond initial triage.
Kind: case
Required: Yes
Examples:
Sev-2 service degradation
Escalated customer-impacting support case

Operational evidence¶

Description: Logs, metrics, traces, alerts, and system state snapshots relevant to the incident window.
Kind: telemetry
Required: Yes
Examples:
Request latency traces
Queue depth metrics
Database failover logs

Change and context history¶

Description: Deployments, configuration updates, tickets, maintenance actions, and recent environment changes.
Kind: change-history
Required: Yes
Examples:
Deployment records
Feature flag changes
Runbook interventions

Human reports¶

Description: Operator notes, support escalations, and stakeholder observations that may explain or constrain hypotheses.
Kind: narrative
Required: No
Examples:
Incident channel notes
Customer symptom reports

Outputs¶

Root-cause hypothesis set¶

Description: Ranked causal explanations with supporting and disconfirming evidence plus confidence notes.
Kind: analysis
Required: Yes
Examples:
Primary cause with contributing conditions
Competing hypotheses awaiting one confirming check

Reconciled incident timeline¶

Description: A normalized sequence of relevant events across systems and actors.
Kind: timeline
Required: Yes
Examples:
Timeline linking deploy, alert, mitigation, and customer reports

Recommended follow-up actions¶

Description: Proposed investigations, corrective actions, or monitoring changes derived from the analysis.
Kind: recommendation
Required: Yes
Examples:
Validate rollback path before declaring closure
Add detection around the failing dependency

Environment¶

Operates in high-pressure technical and service environments where evidence is fragmented, timelines are noisy, and premature conclusions can worsen the incident.

Systems¶

Observability platforms
Incident management systems
Change management records
Support case systems

Actors¶

Incident commander
Responding engineers or operators
Service owners
Support leads

Constraints¶

Preserve evidence integrity and timestamps as collected.
Distinguish observed facts from inferred causes.
Avoid declaring closure while material uncertainty remains.
Keep the narrative inspectable for postmortem and audit review.

Assumptions¶

Relevant systems retain logs and state long enough for investigation.
Time sources can be normalized sufficiently to build a coherent timeline.
Human responders are available to validate or reject recommended conclusions.

Capability requirements¶

Retrieval (retrieval): Investigators must gather evidence from multiple systems and records before causes can be narrowed.
Discrepancy analysis (discrepancy-analysis): The workflow centers on explaining mismatches between expected and observed behavior.
Verification (verification): Candidate causes need corroboration against independent evidence rather than plausible storytelling alone.
Memory and state tracking (memory-and-state-tracking): The workflow must preserve evolving hypotheses, evidence links, and timeline state across multiple steps.
Coordination (coordination): Multiple responders and systems contribute evidence, so handoffs and ownership need structured coordination.

Execution architecture¶

Orchestrated multi-agent (orchestrated-multi-agent): Specialized retrieval, timeline-building, and verification roles are common when evidence volume and responder concurrency are high.
Human in the loop (human-in-the-loop): Human responders routinely validate hypotheses, weigh operational context, and approve incident conclusions.

Autonomy profile¶

Level: Recommendation only (recommendation-only)
Reversibility: Analytical conclusions are editable, but accepting the wrong root cause can drive hard-to-reverse remediation and delay real recovery.
Escalation: Escalate when evidence is incomplete, multiple plausible causes remain after planned checks, or the proposed remediation could materially increase user or system risk.

Human checkpoints¶

Confirm incident scope and critical systems before narrowing hypotheses.
Review ranked root-cause hypotheses before declaring the primary cause.
Approve remediation or closure actions derived from the analysis.

Risk and governance¶

Risk level: High (high)
Failure impact: Incorrect diagnosis can prolong outages, trigger harmful remediations, misinform postmortems, and create false confidence about system reliability.
Auditability: Preserve normalized timelines, evidence links, rejected hypotheses, human overrides, and final causal rationale for postmortem review.

Approval requirements¶

The incident commander must approve the declared root cause before final closure or external reporting.
Service owners must approve corrective actions that materially alter production systems.

Privacy¶

Minimize exposure of personal data in logs, tickets, and customer reports during investigation.
Restrict copied evidence to incident workspaces with appropriate retention controls.

Security¶

Use read-only access where possible when collecting production evidence.
Protect credentials and privileged diagnostics from leaking into analysis artifacts.

Notes: High-risk governance is justified because the pattern shapes consequential remediation and formal incident narratives.

Why agentic¶

The workflow must iteratively form, test, and narrow causal hypotheses as new evidence arrives.
Evidence comes from heterogeneous systems and people, making stateful reconciliation and coordination essential.
Static dashboards rarely preserve enough structured reasoning to explain why one cause is more credible than another.

Failure modes¶

Premature fixation on a plausible but wrong root cause¶

Impact: Teams pursue the wrong remediation and leave the real fault unresolved.
Severity: high
Detectability: medium
Mitigations:
Keep multiple hypotheses visible until disconfirming checks are complete.
Require evidence that explicitly links cause to observed impact.
Review the final narrative with an incident lead before closure.

Incident timeline is incomplete or incorrectly ordered¶

Impact: Causal reasoning becomes distorted and contributing factors are missed.
Severity: high
Detectability: medium
Mitigations:
Normalize time sources and highlight gaps in the event sequence.
Reconcile machine telemetry with human reports before finalizing the timeline.

State is lost across responders or investigation sessions¶

Impact: Previously rejected hypotheses reappear and evidence handoffs become inconsistent.
Severity: medium
Detectability: high
Mitigations:
Maintain shared case memory for evidence, conclusions, and open checks.
Log analyst decisions and rationale as part of the investigation record.

Support or operator observations are ignored because they are less structured¶

Impact: The analysis misses symptoms that would have ruled out or strengthened a hypothesis.
Severity: medium
Detectability: medium
Mitigations:
Capture human observations as first-class evidence linked to the timeline.
Require explicit notes when reports are excluded from the final narrative.

Evaluation¶

Success metrics¶

Agreement rate between initial analysis and final adjudicated root cause.
Time to first defensible hypothesis with cited supporting evidence.
Percentage of incidents with a complete reconciled timeline before closure.

Quality criteria¶

The analysis separates observations, inferences, and recommended actions.
Competing explanations are documented until they are disproved or deprioritized.
Evidence supporting the declared root cause is inspectable and durable.

Robustness checks¶

Test against incidents with overlapping concurrent failures to ensure hypotheses stay distinguishable.
Test with missing telemetry and verify the workflow degrades into explicit uncertainty rather than false certainty.
Test with conflicting customer reports and system metrics to ensure both remain visible for adjudication.

Benchmark notes: Strong evaluation compares both diagnostic accuracy and the discipline of evidence preservation under operational pressure.

Implementation notes¶

Orchestration notes¶

Keep retrieval, timeline reconciliation, and causal verification as explicit stages with shared case memory.
Preserve rejected hypotheses to reduce repeated investigation loops during handoffs.

Integration notes¶

Typical integrations include observability stacks, incident systems, change logs, and support tooling.
Architecture should remain vendor-neutral so the pattern does not collapse into a specific monitoring platform.

Deployment notes¶

Prefer read-only evidence collection paths in production.
Align retention of investigation artifacts with postmortem and compliance obligations.

References¶

Example domains¶

Engineering (engineering): Explain a production outage by reconciling deployments, traces, and dependency failures.
Operations (operations): Analyze a process breakdown by aligning handoff records, queue metrics, and operator notes.
Support (support): Investigate an escalated customer issue by combining case history, service signals, and responder observations.

Risk alert triage (follows-from)
Alert triage often hands the highest-severity or ambiguous cases into root-cause analysis.
Research synthesis with citation verification (complements)
Evidence-grounded synthesis techniques strengthen the explainability and provenance of the final incident narrative.

Grounded instances¶

Canonical source¶

data/patterns/investigate-reconcile-verify/incident-root-cause-analysis.yaml