Respond to an Incident¶
Something has gone wrong with your AI system. This journey helps you respond effectively and learn from what happened.
Phase 1
Contain
Phase 2
Investigate
Phase 3
Remediate
Phase 4
Learn
Phase 1: Contain the Incident¶
Priority: Stop the bleeding.
Immediate Actions (First 30 minutes)¶
- Assess severity - Who is affected and how badly?
- Decide containment - Can you isolate the problem or do you need full shutdown?
- Notify key stakeholders - Don't let them find out from someone else
- Preserve evidence - Logs, outputs, inputs that led to the incident
Containment Options¶
| Option | When to Use | Trade-off |
|---|---|---|
| Full shutdown | Safety risk, legal exposure | Complete service loss |
| Disable AI component | AI is the problem, fallback exists | Degraded service |
| Rate limit | Volume-related issue | Slower service |
| Human review gate | Quality issue, need oversight | Increased workload |
| User segment isolation | Affects specific group | Partial availability |
Common containment actions:
- Revert to previous model version
- Switch to rule-based fallback
- Route to human decision-makers
- Disable affected feature
Common containment actions:
- Disable the GenAI feature
- Tighten content filters
- Reduce model capabilities (e.g., no code generation)
- Add mandatory human review
Phase 2: Investigate¶
Priority: Understand what happened.
Key Questions¶
- What happened? - Factual description of the incident
- When did it start? - Timeline and first occurrence
- Who was affected? - Scope and impact
- What caused it? - Root cause (may take time)
- Why wasn't it caught? - Gaps in monitoring/testing
Investigation Sources¶
- Model prediction logs
- Input data characteristics
- Training data issues
- Feature engineering problems
- Infrastructure failures
- Integration errors
- Prompt/response logs
- User inputs that triggered issue
- Guardrail/filter logs
- API error logs
- Vendor status pages
- Content moderation flags
Document Everything¶
Template: AI Incident Response Plan
Record: - Timeline of events - Actions taken - People involved - Decisions made and rationale - Evidence collected
Phase 3: Remediate¶
Priority: Fix the problem properly.
Short-term Fix vs Long-term Fix¶
| Timeframe | Focus | Examples |
|---|---|---|
| Immediate | Stop harm | Shutdown, rollback, human override |
| Short-term | Restore service safely | Patches, tighter controls, monitoring |
| Long-term | Prevent recurrence | Retrain, redesign, process changes |
Before Restoring Service¶
- Root cause identified (or bounded)
- Fix implemented and tested
- Additional monitoring in place
- Stakeholders briefed on restoration plan
- Rollback plan still ready
Communication¶
Who needs to know what happened:
| Stakeholder | What They Need | When |
|---|---|---|
| Affected users | What happened, what you're doing | ASAP |
| Executives | Impact, cause, remediation | Same day |
| Legal/Compliance | If regulatory implications | Immediately if required |
| Media/Public | If public-facing | Per comms policy |
Phase 4: Learn¶
Priority: Make sure this doesn't happen again.
Post-Incident Review¶
Hold a blameless retrospective within 1-2 weeks:
- What happened - Factual timeline
- What went well - Response effectiveness
- What didn't go well - Gaps and failures
- What we'll change - Concrete actions
Update Your Systems¶
- Update risk register with new risks identified
- Enhance monitoring for early detection
- Update testing to catch similar issues
- Revise incident response procedures if needed
- Share learnings (appropriately) with broader team
Questions to Ask¶
Read: Forbidden Questions: About the Model
- Why didn't we catch this in testing?
- What assumptions were wrong?
- Who raised concerns we didn't act on?
- What would have prevented this?
Common AI Incident Types¶
| Type | Signs | Typical Causes |
|---|---|---|
| Biased outputs | Disparate outcomes by group | Training data, feature selection |
| Hallucination | Confident but wrong answers | GenAI limitations, poor grounding |
| Privacy breach | PII in outputs | Training data contamination |
| Offensive content | Harmful or inappropriate outputs | Inadequate guardrails |
| Performance degradation | Accuracy drop over time | Data drift, model decay |
| Availability failure | System down | Infrastructure, vendor issues |
Related Journeys¶
- Worried About a Project - if concerns led you here
- Improve a Struggling Model - if performance issues
- Prepare for an Audit - if incident triggers review