Skip to content

AI Incident Response Plan

Template Download Template

Purpose: Establish a structured framework for responding to incidents involving AI systems. Covers AI-specific incident types, response procedures, communication protocols, and recovery processes.
At a Glance
  • When to prepare: Before going live with any AI system
  • Key roles: Incident Lead, Technical Lead, Communications Lead
  • Response phases: Detect → Contain → Remediate → Recover → Learn
  • Related journey: Respond to an Incident

AI Incidents Are Different

AI incidents can escalate rapidly (media attention, political scrutiny) and may involve hard-to-explain model behaviour. Prepare before you need this plan.


Document Control

Field Value
AI System Name
Version 1.0
Author
Date
Status Draft / Approved / Active
Next Review Date
Incident Response Lead

1. Scope and Objectives

1.1 Scope

This plan covers incidents involving: - [ ] AI/ML model failures - [ ] Biased or unfair outcomes - [ ] Data quality issues affecting AI - [ ] AI security breaches - [ ] Privacy violations from AI - [ ] AI-generated misinformation - [ ] Adversarial attacks on AI - [ ] Unintended harmful AI behaviors

AI Systems Covered:

System ID System Name Criticality Owner
High/Med/Low

1.2 Objectives

  1. Detect - Rapidly identify AI incidents
  2. Contain - Limit impact and prevent escalation
  3. Communicate - Notify stakeholders appropriately
  4. Remediate - Fix the root cause
  5. Recover - Restore normal operations
  6. Learn - Improve systems and processes

2. AI Incident Classification

2.1 Incident Categories

Category Description Examples
Model Performance AI not performing as expected Accuracy degradation, prediction errors
Bias/Fairness AI producing unfair outcomes Discrimination against protected groups
Data Incident Data issues affecting AI Data breach, poisoning, quality failure
Security Security threats to AI Adversarial attacks, model theft
Privacy Privacy violations from AI PII exposure, re-identification
Operational AI system availability issues Outages, latency, scaling failures
Safety AI causing or risking harm Dangerous recommendations, safety failures
Ethics AI ethical principle violations Transparency failures, consent issues

2.2 Severity Levels

Level Name Description Response Time Examples
1 Critical Significant harm occurring or imminent 15 minutes Data breach, safety incident, widespread bias
2 High Serious impact on services or individuals 1 hour Major accuracy failure, privacy violation
3 Medium Moderate impact, workaround available 4 hours Performance degradation, limited bias
4 Low Minor impact, limited scope 24 hours Minor bugs, localized issues

2.3 Severity Assessment Matrix

Factor Critical (1) High (2) Medium (3) Low (4)
Affected users >1000 or vulnerable 100-1000 10-100 <10
Harm potential Serious harm likely Harm possible Inconvenience Minimal
Legal exposure Breach notification Compliance risk Minor issue None
Media risk National coverage Local coverage Possible interest None
Recovery time >24 hours 4-24 hours 1-4 hours <1 hour

3. Incident Response Team

3.1 Core Team Roles

Role Primary Backup Contact
Incident Commander
Technical Lead
AI/ML Lead
Communications Lead
Legal/Privacy Lead
Business Owner

3.2 Extended Team (As Needed)

Role When Engaged Contact
Executive Sponsor Severity 1-2
Security Team Security incidents
Ethics Lead Bias/fairness incidents
External Comms Media-related incidents
Vendor Contact Third-party AI issues
OAIC Liaison Notifiable breaches

3.3 RACI Matrix

Activity Commander Tech Lead AI/ML Lead Comms Legal Business
Incident declaration A R C I I I
Technical triage I A R I I C
Containment decision A R R I C C
Stakeholder comms A C C R C C
Legal/compliance review I C C I A I
Recovery decision A R R I C R
Post-incident review A R R C C R

4. Detection and Reporting

4.1 Detection Sources

Source Type Monitoring Escalation Path
Automated monitoring System Real-time alerts On-call → Tech Lead
User complaints Human Service desk tickets Desk → Incident Commander
Staff observation Human Direct report Team → AI/ML Lead
External report Human Public channels Comms → Incident Commander
Audit findings Process Periodic audits Auditor → Business Owner
Bias detection Automated Regular scans Alert → Ethics Lead

4.2 Reporting Procedure

Anyone identifying a potential AI incident should:

  1. STOP - If safe, stop the AI system from causing further harm
  2. DOCUMENT - Note what happened, when, and what was affected
  3. REPORT - Use the incident reporting form or contact:
  4. Phone: [Number]
  5. Email: [Email]
  6. Portal: [Link]

4.3 Initial Report Information

Field Required Information
Date/time When incident occurred/was detected
Reporter Name and contact details
AI system System name and identifier
Description What happened
Impact Who/what is affected
Actions taken Immediate steps taken
Ongoing Is incident still occurring?

5. Response Procedures

5.1 Response Workflow

flowchart LR
    D[Detection] --> T[Triage] --> DC[Declaration] --> C[Containment] --> I[Investigation]
    I --> RC[Root Cause]
    RC --> R[Remediation] --> REC[Recovery] --> CL[Closure] --> L[Learning]

    style D fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
    style C fill:#ffcc80,stroke:#ef6c00,stroke-width:2px
    style RC fill:#ef9a9a,stroke:#c62828,stroke-width:2px
    style L fill:#c8e6c9,stroke:#388e3c,stroke-width:2px

5.2 Phase 1: Triage (First 15 minutes)

Objectives: Assess severity, activate response team

Actions: - [ ] Review incident report - [ ] Verify incident is genuine (not false positive) - [ ] Assess initial severity level - [ ] Determine incident category - [ ] Identify affected systems and users - [ ] Notify Incident Commander - [ ] Document triage findings

Decision Point: Declare incident and severity level

5.3 Phase 2: Containment (Severity-dependent)

Objectives: Stop harm, prevent escalation

Containment Options:

Option Description When to Use Impact
Do nothing Continue monitoring Minor issues, false positives Minimal
Reduce scope Limit AI to subset of users Performance issues Moderate
Increase oversight Add human review Bias concerns Moderate
Fallback mode Switch to non-AI process Serious errors High
Full shutdown Completely disable AI Safety/critical issues Severe

Containment Actions: - [ ] Select containment strategy - [ ] Implement containment measures - [ ] Verify containment effectiveness - [ ] Document containment actions - [ ] Communicate containment status

5.4 Phase 3: Investigation

Objectives: Understand what happened and why

Investigation Activities: - [ ] Collect and preserve evidence - [ ] Review system logs - [ ] Analyze model behavior - [ ] Review recent changes - [ ] Interview relevant staff - [ ] Examine data inputs - [ ] Check for external factors

Evidence Collection:

Evidence Type Source Collection Method Retention
System logs AI platform Export logs 90 days
Model inputs Data pipeline Snapshot data Per policy
Model outputs Prediction service Log extraction 90 days
Configuration Model registry Version history Indefinite
User reports Service desk Export tickets Per policy

5.5 Phase 4: Root Cause Analysis

Objectives: Identify underlying cause

Root Cause Categories for AI:

Category Examples Investigation Focus
Data Quality issues, drift, poisoning Data pipeline, sources
Model Underfitting, overfitting, concept drift Model performance, training
Code Bugs, configuration errors Code changes, deployments
Infrastructure Capacity, latency, failures Platform metrics
Human Errors, misuse, insufficient training Process, training
External Adversarial attack, vendor issues Security, third parties
Design Inadequate requirements, testing Design documentation

5 Whys Analysis:

Why Finding
1. Why did the incident occur?
2. Why did that happen?
3. Why did that happen?
4. Why did that happen?
5. Why did that happen? (Root cause)

5.6 Phase 5: Remediation

Objectives: Fix the root cause

Remediation Options:

Issue Type Remediation Approach
Data quality Clean data, update pipeline
Model performance Retrain, tune, or replace model
Bias detected Adjust training, add constraints
Security vulnerability Patch, update controls
Configuration error Correct configuration
Design flaw Re-engineer solution

Remediation Planning: - [ ] Define remediation actions - [ ] Assign owners and deadlines - [ ] Assess remediation risks - [ ] Plan testing and validation - [ ] Document remediation plan

5.7 Phase 6: Recovery

Objectives: Restore normal operations

Recovery Steps: - [ ] Verify remediation complete - [ ] Test fixed system - [ ] Plan phased restoration - [ ] Monitor closely during recovery - [ ] Confirm normal operations - [ ] Update stakeholders

Restoration Sequence:

Stage Action Validation Duration
1 Deploy fix to non-prod Testing passed
2 Limited production release Monitoring clean
3 Gradual rollout Performance normal
4 Full restoration All metrics green

5.8 Phase 7: Closure

Objectives: Formally close the incident

Closure Checklist: - [ ] All remediation actions complete - [ ] System operating normally - [ ] Stakeholders informed - [ ] Documentation complete - [ ] Lessons learned captured - [ ] Incident report finalized - [ ] Closure approved by Incident Commander


6. Communication Protocols

6.1 Internal Communication

Communication Timeline:

Timeframe Audience Message Type
Immediate Response team Incident activation
30 minutes Business owner Initial briefing
1 hour (Sev 1-2) Executive sponsor Status update
Every 2 hours All stakeholders Progress update
At closure All stakeholders Resolution notice

Communication Channels:

Audience Channel Owner
Response team [Team channel/bridge] Tech Lead
Leadership Email + phone Incident Commander
Affected staff Email Comms Lead
All staff Intranet update Comms Lead

6.2 External Communication

External Notification Matrix:

Stakeholder Trigger Timeframe Owner Approval
Affected individuals Privacy breach As soon as practical Comms Lead Privacy Lead
OAIC Notifiable data breach 30 days Privacy Lead Executive
Minister's office High-profile incident Same day Executive SES
Media Media inquiry As needed Media team Executive
Vendors Vendor-related issue As needed Tech Lead Business Owner

6.3 Communication Templates

Initial Notification (Internal):

**AI INCIDENT ALERT - [SEVERITY LEVEL]** | Field | Value | |-------|-------| | **System** | [AI System Name] | | **Time** | [Detection Time] | | **Status** | [Active/Contained/Resolved] | | **Summary** | [Brief description] | | **Impact** | [Who/what is affected] | | **Actions** | [Current response actions] | | **Next Update** | [Time] | | **Contact** | [Incident Commander contact] |

Status Update:

**AI INCIDENT UPDATE - [System] - [Update #]** | Field | Value | |-------|-------| | **Current Status** | [Active/Contained/In Recovery] | **Since Last Update:** - [Action taken] - [Finding] - [Progress] **Current Focus:** - [Current priority] **Expected Next Steps:** - [Planned action] - [Estimated timeline] **Next Update:** [Time]

7. Specific Incident Playbooks

7.1 Bias/Fairness Incident

Indicators: - Disparate outcomes across demographic groups - User complaints about unfair treatment - Bias monitoring alerts - Audit findings

Response Actions: 1. Immediately add human review to affected decisions 2. Preserve model and data for analysis 3. Engage Ethics Lead 4. Analyze outcomes by protected attributes 5. Quantify scope and impact 6. Consider whether affected decisions need review 7. Plan remediation (retraining, adjustment, replacement) 8. Notify affected individuals if significant harm

Escalation Triggers: - Multiple protected groups affected - Decisions involved significant consequences (benefits, enforcement) - Media or political interest

7.2 Data Breach Involving AI

Indicators: - Unauthorized access to training data - Model inversion attack detected - PII exposed in model outputs - Data exfiltration alerts

Response Actions: 1. Isolate affected systems 2. Engage Security Team and Privacy Lead 3. Assess data exposed (type, volume, sensitivity) 4. Determine if notifiable data breach 5. Preserve evidence for investigation 6. Notify OAIC if required (within 30 days) 7. Notify affected individuals 8. Implement additional security controls

Notifiable Data Breach Assessment: | Question | Answer | |----------|--------| | Personal information involved? | Yes/No | | Unauthorized access or disclosure? | Yes/No | | Serious harm likely? | Yes/No | | Can remedial action prevent harm? | Yes/No | | Notification required? | Yes/No |

7.3 Model Performance Degradation

Indicators: - Accuracy metrics below threshold - Increased prediction errors - User complaints about quality - Business outcome deterioration

Response Actions: 1. Assess current performance vs baseline 2. Check for data drift or quality issues 3. Review recent deployments or changes 4. Consider increasing human review 5. Evaluate fallback to previous model version 6. Plan model retraining if needed 7. Restore normal thresholds before full operation

7.4 Adversarial Attack

Indicators: - Unusual input patterns - Attempts to probe model behavior - Model extraction attempts - Poisoned data detected

Response Actions: 1. Engage Security Team immediately 2. Block suspicious sources if identifiable 3. Preserve attack evidence 4. Assess model compromise 5. Consider model replacement 6. Implement additional defenses 7. Report to Australian Cyber Security Centre if significant

7.5 AI Safety Incident

Indicators: - AI recommendation could cause harm - Dangerous content generated - Safety guardrails bypassed - Unintended real-world consequences

Response Actions: 1. IMMEDIATE: Disable AI system 2. Prevent further harmful actions 3. Engage executive sponsor 4. Assess actual harm caused 5. Support affected individuals 6. Conduct thorough safety review before restart 7. Implement enhanced safeguards


8. Documentation Requirements

8.1 Incident Documentation

Incident Record Template:

Field Content
Incident ID [Auto-generated]
AI System
Category
Severity
Status
Detection Time
Declaration Time
Containment Time
Resolution Time
Closure Time
Incident Commander
Summary
Root Cause
Remediation Actions
Lessons Learned
Related Incidents

8.2 Timeline Log

Time Action Actor Notes

8.3 Post-Incident Report

Required Sections: 1. Executive Summary 2. Incident Timeline 3. Impact Assessment 4. Root Cause Analysis 5. Response Evaluation 6. Remediation Actions 7. Lessons Learned 8. Recommendations 9. Appendices (evidence, logs)


9. Post-Incident Review

9.1 Review Process

Timing: - Severity 1-2: Within 5 business days - Severity 3-4: Within 10 business days

Participants: - Incident Response Team - System owners - Subject matter experts - Executive sponsor (Severity 1-2)

9.2 Review Agenda

  1. Incident timeline review
  2. Response effectiveness assessment
  3. What went well
  4. What could be improved
  5. Root cause confirmation
  6. Remediation status
  7. Recommendations development
  8. Action item assignment

9.3 Improvement Actions

ID Improvement Type Owner Due Date Status
Process/Technical/Training

10. Testing and Maintenance

10.1 Testing Schedule

Test Type Frequency Last Test Next Test Owner
Tabletop exercise Quarterly
Technical drill Semi-annual
Full simulation Annual
Contact list verification Monthly

10.2 Plan Maintenance

Activity Frequency Owner
Review contact details Monthly Incident Commander
Update procedures Quarterly Tech Lead
Incorporate lessons learned After each incident Incident Commander
Full plan review Annual All stakeholders
Training refresh Annual Training team

11. Training Requirements

11.1 Training Matrix

Role Training Frequency Status
All response team Incident response basics Annual
Technical staff AI incident investigation Annual
Incident Commander Incident command Annual
Communications Crisis communication Annual

11.2 Exercise Participation

Name Role Last Exercise Next Required

12. Metrics and Reporting

12.1 Incident Metrics

Metric Definition Target Current
Mean time to detect Time from occurrence to detection <30 mins
Mean time to contain Time from detection to containment <2 hours
Mean time to resolve Time from detection to resolution <24 hours
Incidents per month Number of AI incidents <5
Repeat incident rate Same root cause within 90 days <5%

12.2 Reporting

Report Audience Frequency Owner
Incident summary Leadership Monthly Incident Commander
Trend analysis Executive Quarterly AI Lead
Annual review Board/Audit Annual Executive Sponsor

13. Appendices

Appendix A: Contact List

Role Name Phone Email Backup
Incident Commander
Technical Lead
AI/ML Lead
Communications Lead
Privacy Lead
Security Team
Executive Sponsor
OAIC 1300 363 992 enquiries@oaic.gov.au

Appendix B: Escalation Flowchart

flowchart TB
    DET[Incident Detected] --> TRI[Initial Triage]
    TRI --> S12[Severity 1-2]
    TRI --> S3[Severity 3]
    TRI --> S4[Severity 4]

    S12 --> EXEC[Exec + Full<br/>Response Team]
    S3 --> STD[Standard<br/>Response]
    S4 --> NRM[Normal<br/>Process]

    EXEC --> IMM[Immediate<br/>Containment]
    STD --> HR4[4-hour<br/>Response]
    NRM --> HR24[24-hour<br/>Response]

    style DET fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
    style S12 fill:#ef9a9a,stroke:#c62828,stroke-width:2px
    style S3 fill:#ffcc80,stroke:#ef6c00,stroke-width:2px
    style S4 fill:#fff9c4,stroke:#f9a825,stroke-width:2px
    style IMM fill:#ef9a9a,stroke:#c62828,stroke-width:2px

Appendix C: Quick Reference Card

AI INCIDENT QUICK REFERENCE

Step Action
1 STOP harm if safe to do so
2 REPORT to [contact]
3 PRESERVE evidence
4 DOCUMENT what happened
5 AWAIT instructions

Key Contacts: - Emergency: [Number] - Incident line: [Number] - After hours: [Number]

Appendix D: Glossary

Term Definition
Adversarial attack Deliberately crafted inputs to fool AI systems
Concept drift Change in the relationship between inputs and outputs over time
Data drift Change in input data distribution over time
Model inversion Attack to extract training data from model
Notifiable data breach Breach requiring OAIC notification under Privacy Act

14. Sign-Off

Role Name Signature Date
AI/ML Lead
Security Officer
Privacy Officer
Business Owner
Executive Sponsor