AI Incident Response Plan¶
Template Download Template
- When to prepare: Before going live with any AI system
- Key roles: Incident Lead, Technical Lead, Communications Lead
- Response phases: Detect → Contain → Remediate → Recover → Learn
- Related journey: Respond to an Incident
AI Incidents Are Different
AI incidents can escalate rapidly (media attention, political scrutiny) and may involve hard-to-explain model behaviour. Prepare before you need this plan.
Document Control¶
| Field | Value |
|---|---|
| AI System Name | |
| Version | 1.0 |
| Author | |
| Date | |
| Status | Draft / Approved / Active |
| Next Review Date | |
| Incident Response Lead |
1. Scope and Objectives¶
1.1 Scope¶
This plan covers incidents involving: - [ ] AI/ML model failures - [ ] Biased or unfair outcomes - [ ] Data quality issues affecting AI - [ ] AI security breaches - [ ] Privacy violations from AI - [ ] AI-generated misinformation - [ ] Adversarial attacks on AI - [ ] Unintended harmful AI behaviors
AI Systems Covered:
| System ID | System Name | Criticality | Owner |
|---|---|---|---|
| High/Med/Low | |||
1.2 Objectives¶
- Detect - Rapidly identify AI incidents
- Contain - Limit impact and prevent escalation
- Communicate - Notify stakeholders appropriately
- Remediate - Fix the root cause
- Recover - Restore normal operations
- Learn - Improve systems and processes
2. AI Incident Classification¶
2.1 Incident Categories¶
| Category | Description | Examples |
|---|---|---|
| Model Performance | AI not performing as expected | Accuracy degradation, prediction errors |
| Bias/Fairness | AI producing unfair outcomes | Discrimination against protected groups |
| Data Incident | Data issues affecting AI | Data breach, poisoning, quality failure |
| Security | Security threats to AI | Adversarial attacks, model theft |
| Privacy | Privacy violations from AI | PII exposure, re-identification |
| Operational | AI system availability issues | Outages, latency, scaling failures |
| Safety | AI causing or risking harm | Dangerous recommendations, safety failures |
| Ethics | AI ethical principle violations | Transparency failures, consent issues |
2.2 Severity Levels¶
| Level | Name | Description | Response Time | Examples |
|---|---|---|---|---|
| 1 | Critical | Significant harm occurring or imminent | 15 minutes | Data breach, safety incident, widespread bias |
| 2 | High | Serious impact on services or individuals | 1 hour | Major accuracy failure, privacy violation |
| 3 | Medium | Moderate impact, workaround available | 4 hours | Performance degradation, limited bias |
| 4 | Low | Minor impact, limited scope | 24 hours | Minor bugs, localized issues |
2.3 Severity Assessment Matrix¶
| Factor | Critical (1) | High (2) | Medium (3) | Low (4) |
|---|---|---|---|---|
| Affected users | >1000 or vulnerable | 100-1000 | 10-100 | <10 |
| Harm potential | Serious harm likely | Harm possible | Inconvenience | Minimal |
| Legal exposure | Breach notification | Compliance risk | Minor issue | None |
| Media risk | National coverage | Local coverage | Possible interest | None |
| Recovery time | >24 hours | 4-24 hours | 1-4 hours | <1 hour |
3. Incident Response Team¶
3.1 Core Team Roles¶
| Role | Primary | Backup | Contact |
|---|---|---|---|
| Incident Commander | |||
| Technical Lead | |||
| AI/ML Lead | |||
| Communications Lead | |||
| Legal/Privacy Lead | |||
| Business Owner |
3.2 Extended Team (As Needed)¶
| Role | When Engaged | Contact |
|---|---|---|
| Executive Sponsor | Severity 1-2 | |
| Security Team | Security incidents | |
| Ethics Lead | Bias/fairness incidents | |
| External Comms | Media-related incidents | |
| Vendor Contact | Third-party AI issues | |
| OAIC Liaison | Notifiable breaches |
3.3 RACI Matrix¶
| Activity | Commander | Tech Lead | AI/ML Lead | Comms | Legal | Business |
|---|---|---|---|---|---|---|
| Incident declaration | A | R | C | I | I | I |
| Technical triage | I | A | R | I | I | C |
| Containment decision | A | R | R | I | C | C |
| Stakeholder comms | A | C | C | R | C | C |
| Legal/compliance review | I | C | C | I | A | I |
| Recovery decision | A | R | R | I | C | R |
| Post-incident review | A | R | R | C | C | R |
4. Detection and Reporting¶
4.1 Detection Sources¶
| Source | Type | Monitoring | Escalation Path |
|---|---|---|---|
| Automated monitoring | System | Real-time alerts | On-call → Tech Lead |
| User complaints | Human | Service desk tickets | Desk → Incident Commander |
| Staff observation | Human | Direct report | Team → AI/ML Lead |
| External report | Human | Public channels | Comms → Incident Commander |
| Audit findings | Process | Periodic audits | Auditor → Business Owner |
| Bias detection | Automated | Regular scans | Alert → Ethics Lead |
4.2 Reporting Procedure¶
Anyone identifying a potential AI incident should:
- STOP - If safe, stop the AI system from causing further harm
- DOCUMENT - Note what happened, when, and what was affected
- REPORT - Use the incident reporting form or contact:
- Phone: [Number]
- Email: [Email]
- Portal: [Link]
4.3 Initial Report Information¶
| Field | Required Information |
|---|---|
| Date/time | When incident occurred/was detected |
| Reporter | Name and contact details |
| AI system | System name and identifier |
| Description | What happened |
| Impact | Who/what is affected |
| Actions taken | Immediate steps taken |
| Ongoing | Is incident still occurring? |
5. Response Procedures¶
5.1 Response Workflow¶
flowchart LR
D[Detection] --> T[Triage] --> DC[Declaration] --> C[Containment] --> I[Investigation]
I --> RC[Root Cause]
RC --> R[Remediation] --> REC[Recovery] --> CL[Closure] --> L[Learning]
style D fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
style C fill:#ffcc80,stroke:#ef6c00,stroke-width:2px
style RC fill:#ef9a9a,stroke:#c62828,stroke-width:2px
style L fill:#c8e6c9,stroke:#388e3c,stroke-width:2px 5.2 Phase 1: Triage (First 15 minutes)¶
Objectives: Assess severity, activate response team
Actions: - [ ] Review incident report - [ ] Verify incident is genuine (not false positive) - [ ] Assess initial severity level - [ ] Determine incident category - [ ] Identify affected systems and users - [ ] Notify Incident Commander - [ ] Document triage findings
Decision Point: Declare incident and severity level
5.3 Phase 2: Containment (Severity-dependent)¶
Objectives: Stop harm, prevent escalation
Containment Options:
| Option | Description | When to Use | Impact |
|---|---|---|---|
| Do nothing | Continue monitoring | Minor issues, false positives | Minimal |
| Reduce scope | Limit AI to subset of users | Performance issues | Moderate |
| Increase oversight | Add human review | Bias concerns | Moderate |
| Fallback mode | Switch to non-AI process | Serious errors | High |
| Full shutdown | Completely disable AI | Safety/critical issues | Severe |
Containment Actions: - [ ] Select containment strategy - [ ] Implement containment measures - [ ] Verify containment effectiveness - [ ] Document containment actions - [ ] Communicate containment status
5.4 Phase 3: Investigation¶
Objectives: Understand what happened and why
Investigation Activities: - [ ] Collect and preserve evidence - [ ] Review system logs - [ ] Analyze model behavior - [ ] Review recent changes - [ ] Interview relevant staff - [ ] Examine data inputs - [ ] Check for external factors
Evidence Collection:
| Evidence Type | Source | Collection Method | Retention |
|---|---|---|---|
| System logs | AI platform | Export logs | 90 days |
| Model inputs | Data pipeline | Snapshot data | Per policy |
| Model outputs | Prediction service | Log extraction | 90 days |
| Configuration | Model registry | Version history | Indefinite |
| User reports | Service desk | Export tickets | Per policy |
5.5 Phase 4: Root Cause Analysis¶
Objectives: Identify underlying cause
Root Cause Categories for AI:
| Category | Examples | Investigation Focus |
|---|---|---|
| Data | Quality issues, drift, poisoning | Data pipeline, sources |
| Model | Underfitting, overfitting, concept drift | Model performance, training |
| Code | Bugs, configuration errors | Code changes, deployments |
| Infrastructure | Capacity, latency, failures | Platform metrics |
| Human | Errors, misuse, insufficient training | Process, training |
| External | Adversarial attack, vendor issues | Security, third parties |
| Design | Inadequate requirements, testing | Design documentation |
5 Whys Analysis:
| Why | Finding |
|---|---|
| 1. Why did the incident occur? | |
| 2. Why did that happen? | |
| 3. Why did that happen? | |
| 4. Why did that happen? | |
| 5. Why did that happen? (Root cause) |
5.6 Phase 5: Remediation¶
Objectives: Fix the root cause
Remediation Options:
| Issue Type | Remediation Approach |
|---|---|
| Data quality | Clean data, update pipeline |
| Model performance | Retrain, tune, or replace model |
| Bias detected | Adjust training, add constraints |
| Security vulnerability | Patch, update controls |
| Configuration error | Correct configuration |
| Design flaw | Re-engineer solution |
Remediation Planning: - [ ] Define remediation actions - [ ] Assign owners and deadlines - [ ] Assess remediation risks - [ ] Plan testing and validation - [ ] Document remediation plan
5.7 Phase 6: Recovery¶
Objectives: Restore normal operations
Recovery Steps: - [ ] Verify remediation complete - [ ] Test fixed system - [ ] Plan phased restoration - [ ] Monitor closely during recovery - [ ] Confirm normal operations - [ ] Update stakeholders
Restoration Sequence:
| Stage | Action | Validation | Duration |
|---|---|---|---|
| 1 | Deploy fix to non-prod | Testing passed | |
| 2 | Limited production release | Monitoring clean | |
| 3 | Gradual rollout | Performance normal | |
| 4 | Full restoration | All metrics green |
5.8 Phase 7: Closure¶
Objectives: Formally close the incident
Closure Checklist: - [ ] All remediation actions complete - [ ] System operating normally - [ ] Stakeholders informed - [ ] Documentation complete - [ ] Lessons learned captured - [ ] Incident report finalized - [ ] Closure approved by Incident Commander
6. Communication Protocols¶
6.1 Internal Communication¶
Communication Timeline:
| Timeframe | Audience | Message Type |
|---|---|---|
| Immediate | Response team | Incident activation |
| 30 minutes | Business owner | Initial briefing |
| 1 hour (Sev 1-2) | Executive sponsor | Status update |
| Every 2 hours | All stakeholders | Progress update |
| At closure | All stakeholders | Resolution notice |
Communication Channels:
| Audience | Channel | Owner |
|---|---|---|
| Response team | [Team channel/bridge] | Tech Lead |
| Leadership | Email + phone | Incident Commander |
| Affected staff | Comms Lead | |
| All staff | Intranet update | Comms Lead |
6.2 External Communication¶
External Notification Matrix:
| Stakeholder | Trigger | Timeframe | Owner | Approval |
|---|---|---|---|---|
| Affected individuals | Privacy breach | As soon as practical | Comms Lead | Privacy Lead |
| OAIC | Notifiable data breach | 30 days | Privacy Lead | Executive |
| Minister's office | High-profile incident | Same day | Executive | SES |
| Media | Media inquiry | As needed | Media team | Executive |
| Vendors | Vendor-related issue | As needed | Tech Lead | Business Owner |
6.3 Communication Templates¶
Initial Notification (Internal):
Status Update:
7. Specific Incident Playbooks¶
7.1 Bias/Fairness Incident¶
Indicators: - Disparate outcomes across demographic groups - User complaints about unfair treatment - Bias monitoring alerts - Audit findings
Response Actions: 1. Immediately add human review to affected decisions 2. Preserve model and data for analysis 3. Engage Ethics Lead 4. Analyze outcomes by protected attributes 5. Quantify scope and impact 6. Consider whether affected decisions need review 7. Plan remediation (retraining, adjustment, replacement) 8. Notify affected individuals if significant harm
Escalation Triggers: - Multiple protected groups affected - Decisions involved significant consequences (benefits, enforcement) - Media or political interest
7.2 Data Breach Involving AI¶
Indicators: - Unauthorized access to training data - Model inversion attack detected - PII exposed in model outputs - Data exfiltration alerts
Response Actions: 1. Isolate affected systems 2. Engage Security Team and Privacy Lead 3. Assess data exposed (type, volume, sensitivity) 4. Determine if notifiable data breach 5. Preserve evidence for investigation 6. Notify OAIC if required (within 30 days) 7. Notify affected individuals 8. Implement additional security controls
Notifiable Data Breach Assessment: | Question | Answer | |----------|--------| | Personal information involved? | Yes/No | | Unauthorized access or disclosure? | Yes/No | | Serious harm likely? | Yes/No | | Can remedial action prevent harm? | Yes/No | | Notification required? | Yes/No |
7.3 Model Performance Degradation¶
Indicators: - Accuracy metrics below threshold - Increased prediction errors - User complaints about quality - Business outcome deterioration
Response Actions: 1. Assess current performance vs baseline 2. Check for data drift or quality issues 3. Review recent deployments or changes 4. Consider increasing human review 5. Evaluate fallback to previous model version 6. Plan model retraining if needed 7. Restore normal thresholds before full operation
7.4 Adversarial Attack¶
Indicators: - Unusual input patterns - Attempts to probe model behavior - Model extraction attempts - Poisoned data detected
Response Actions: 1. Engage Security Team immediately 2. Block suspicious sources if identifiable 3. Preserve attack evidence 4. Assess model compromise 5. Consider model replacement 6. Implement additional defenses 7. Report to Australian Cyber Security Centre if significant
7.5 AI Safety Incident¶
Indicators: - AI recommendation could cause harm - Dangerous content generated - Safety guardrails bypassed - Unintended real-world consequences
Response Actions: 1. IMMEDIATE: Disable AI system 2. Prevent further harmful actions 3. Engage executive sponsor 4. Assess actual harm caused 5. Support affected individuals 6. Conduct thorough safety review before restart 7. Implement enhanced safeguards
8. Documentation Requirements¶
8.1 Incident Documentation¶
Incident Record Template:
| Field | Content |
|---|---|
| Incident ID | [Auto-generated] |
| AI System | |
| Category | |
| Severity | |
| Status | |
| Detection Time | |
| Declaration Time | |
| Containment Time | |
| Resolution Time | |
| Closure Time | |
| Incident Commander | |
| Summary | |
| Root Cause | |
| Remediation Actions | |
| Lessons Learned | |
| Related Incidents |
8.2 Timeline Log¶
| Time | Action | Actor | Notes |
|---|---|---|---|
8.3 Post-Incident Report¶
Required Sections: 1. Executive Summary 2. Incident Timeline 3. Impact Assessment 4. Root Cause Analysis 5. Response Evaluation 6. Remediation Actions 7. Lessons Learned 8. Recommendations 9. Appendices (evidence, logs)
9. Post-Incident Review¶
9.1 Review Process¶
Timing: - Severity 1-2: Within 5 business days - Severity 3-4: Within 10 business days
Participants: - Incident Response Team - System owners - Subject matter experts - Executive sponsor (Severity 1-2)
9.2 Review Agenda¶
- Incident timeline review
- Response effectiveness assessment
- What went well
- What could be improved
- Root cause confirmation
- Remediation status
- Recommendations development
- Action item assignment
9.3 Improvement Actions¶
| ID | Improvement | Type | Owner | Due Date | Status |
|---|---|---|---|---|---|
| Process/Technical/Training | |||||
10. Testing and Maintenance¶
10.1 Testing Schedule¶
| Test Type | Frequency | Last Test | Next Test | Owner |
|---|---|---|---|---|
| Tabletop exercise | Quarterly | |||
| Technical drill | Semi-annual | |||
| Full simulation | Annual | |||
| Contact list verification | Monthly |
10.2 Plan Maintenance¶
| Activity | Frequency | Owner |
|---|---|---|
| Review contact details | Monthly | Incident Commander |
| Update procedures | Quarterly | Tech Lead |
| Incorporate lessons learned | After each incident | Incident Commander |
| Full plan review | Annual | All stakeholders |
| Training refresh | Annual | Training team |
11. Training Requirements¶
11.1 Training Matrix¶
| Role | Training | Frequency | Status |
|---|---|---|---|
| All response team | Incident response basics | Annual | |
| Technical staff | AI incident investigation | Annual | |
| Incident Commander | Incident command | Annual | |
| Communications | Crisis communication | Annual |
11.2 Exercise Participation¶
| Name | Role | Last Exercise | Next Required |
|---|---|---|---|
12. Metrics and Reporting¶
12.1 Incident Metrics¶
| Metric | Definition | Target | Current |
|---|---|---|---|
| Mean time to detect | Time from occurrence to detection | <30 mins | |
| Mean time to contain | Time from detection to containment | <2 hours | |
| Mean time to resolve | Time from detection to resolution | <24 hours | |
| Incidents per month | Number of AI incidents | <5 | |
| Repeat incident rate | Same root cause within 90 days | <5% |
12.2 Reporting¶
| Report | Audience | Frequency | Owner |
|---|---|---|---|
| Incident summary | Leadership | Monthly | Incident Commander |
| Trend analysis | Executive | Quarterly | AI Lead |
| Annual review | Board/Audit | Annual | Executive Sponsor |
13. Appendices¶
Appendix A: Contact List¶
| Role | Name | Phone | Backup | |
|---|---|---|---|---|
| Incident Commander | ||||
| Technical Lead | ||||
| AI/ML Lead | ||||
| Communications Lead | ||||
| Privacy Lead | ||||
| Security Team | ||||
| Executive Sponsor | ||||
| OAIC | 1300 363 992 | enquiries@oaic.gov.au |
Appendix B: Escalation Flowchart¶
flowchart TB
DET[Incident Detected] --> TRI[Initial Triage]
TRI --> S12[Severity 1-2]
TRI --> S3[Severity 3]
TRI --> S4[Severity 4]
S12 --> EXEC[Exec + Full<br/>Response Team]
S3 --> STD[Standard<br/>Response]
S4 --> NRM[Normal<br/>Process]
EXEC --> IMM[Immediate<br/>Containment]
STD --> HR4[4-hour<br/>Response]
NRM --> HR24[24-hour<br/>Response]
style DET fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
style S12 fill:#ef9a9a,stroke:#c62828,stroke-width:2px
style S3 fill:#ffcc80,stroke:#ef6c00,stroke-width:2px
style S4 fill:#fff9c4,stroke:#f9a825,stroke-width:2px
style IMM fill:#ef9a9a,stroke:#c62828,stroke-width:2px Appendix C: Quick Reference Card¶
AI INCIDENT QUICK REFERENCE
| Step | Action |
|---|---|
| 1 | STOP harm if safe to do so |
| 2 | REPORT to [contact] |
| 3 | PRESERVE evidence |
| 4 | DOCUMENT what happened |
| 5 | AWAIT instructions |
Key Contacts: - Emergency: [Number] - Incident line: [Number] - After hours: [Number]
Appendix D: Glossary¶
| Term | Definition |
|---|---|
| Adversarial attack | Deliberately crafted inputs to fool AI systems |
| Concept drift | Change in the relationship between inputs and outputs over time |
| Data drift | Change in input data distribution over time |
| Model inversion | Attack to extract training data from model |
| Notifiable data breach | Breach requiring OAIC notification under Privacy Act |
14. Sign-Off¶
| Role | Name | Signature | Date |
|---|---|---|---|
| AI/ML Lead | |||
| Security Officer | |||
| Privacy Officer | |||
| Business Owner | |||
| Executive Sponsor |