Respond to an Incident¶

Something has gone wrong with your AI system. This journey helps you respond effectively and learn from what happened.

Phase 1

Contain

Phase 2

Investigate

Phase 3

Remediate

Phase 4

Learn

Phase 1: Contain the Incident¶

Priority: Stop the bleeding.

Immediate Actions (First 30 minutes)¶

Assess severity - Who is affected and how badly?
Decide containment - Can you isolate the problem or do you need full shutdown?
Notify key stakeholders - Don't let them find out from someone else
Preserve evidence - Logs, outputs, inputs that led to the incident

Containment Options¶

Option	When to Use	Trade-off
Full shutdown	Safety risk, legal exposure	Complete service loss
Disable AI component	AI is the problem, fallback exists	Degraded service
Rate limit	Volume-related issue	Slower service
Human review gate	Quality issue, need oversight	Increased workload
User segment isolation	Affects specific group	Partial availability

Traditional ML/AIGenAI Systems

Common containment actions:

Revert to previous model version
Switch to rule-based fallback
Route to human decision-makers
Disable affected feature

Common containment actions:

Disable the GenAI feature
Tighten content filters
Reduce model capabilities (e.g., no code generation)
Add mandatory human review

Phase 2: Investigate¶

Priority: Understand what happened.

Key Questions¶

What happened? - Factual description of the incident
When did it start? - Timeline and first occurrence
Who was affected? - Scope and impact
What caused it? - Root cause (may take time)
Why wasn't it caught? - Gaps in monitoring/testing

Investigation Sources¶

Traditional ML/AIGenAI Systems

Model prediction logs
Input data characteristics
Training data issues
Feature engineering problems
Infrastructure failures
Integration errors

Prompt/response logs
User inputs that triggered issue
Guardrail/filter logs
API error logs
Vendor status pages
Content moderation flags

Document Everything¶

Template: AI Incident Response Plan

Record: - Timeline of events - Actions taken - People involved - Decisions made and rationale - Evidence collected

Phase 3: Remediate¶

Priority: Fix the problem properly.

Short-term Fix vs Long-term Fix¶

Timeframe	Focus	Examples
Immediate	Stop harm	Shutdown, rollback, human override
Short-term	Restore service safely	Patches, tighter controls, monitoring
Long-term	Prevent recurrence	Retrain, redesign, process changes

Before Restoring Service¶

Root cause identified (or bounded)
Fix implemented and tested
Additional monitoring in place
Stakeholders briefed on restoration plan
Rollback plan still ready

Communication¶

Who needs to know what happened:

Stakeholder	What They Need	When
Affected users	What happened, what you're doing	ASAP
Executives	Impact, cause, remediation	Same day
Legal/Compliance	If regulatory implications	Immediately if required
Media/Public	If public-facing	Per comms policy

Phase 4: Learn¶

Priority: Make sure this doesn't happen again.

Post-Incident Review¶

Hold a blameless retrospective within 1-2 weeks:

What happened - Factual timeline
What went well - Response effectiveness
What didn't go well - Gaps and failures
What we'll change - Concrete actions

Update Your Systems¶

Update risk register with new risks identified
Enhance monitoring for early detection
Update testing to catch similar issues
Revise incident response procedures if needed
Share learnings (appropriately) with broader team

Questions to Ask¶

Read: Forbidden Questions: About the Model

Why didn't we catch this in testing?
What assumptions were wrong?
Who raised concerns we didn't act on?
What would have prevented this?

Common AI Incident Types¶

Type	Signs	Typical Causes
Biased outputs	Disparate outcomes by group	Training data, feature selection
Hallucination	Confident but wrong answers	GenAI limitations, poor grounding
Privacy breach	PII in outputs	Training data contamination
Offensive content	Harmful or inappropriate outputs	Inadequate guardrails
Performance degradation	Accuracy drop over time	Data drift, model decay
Availability failure	System down	Infrastructure, vendor issues

Worried About a Project - if concerns led you here
Improve a Struggling Model - if performance issues
Prepare for an Audit - if incident triggers review