How to Set Up AI System Monitoring¶
Ready to Use
Quick Reference
- Three pillars: System health, model performance, business impact
- Key difference: AI accuracy degrades without code changes
- Watch for: Data drift, concept drift, fairness regression
- Frequency: Real-time for system, daily/weekly for model metrics
Purpose¶
This guide provides practical steps for implementing comprehensive monitoring of AI systems in production, ensuring reliability, fairness, and compliance.
Why AI Monitoring is Different¶
Traditional software monitoring focuses on uptime and response times. AI systems require additional monitoring for:
| Aspect | Why It Matters |
|---|---|
| Model Performance | Accuracy can degrade without code changes |
| Data Drift | Input data patterns change over time |
| Concept Drift | Relationships between inputs and outcomes change |
| Fairness | Bias can emerge or worsen in production |
| Explainability | Explanations should remain valid |
Monitoring Framework¶
The Three Pillars of AI Monitoring¶
flowchart TB
MON["<strong>AI MONITORING</strong>"] --> SYS
MON --> PERF
MON --> BIZ
subgraph SYS["<strong>SYSTEM HEALTH</strong>"]
S1[Uptime]
S2[Latency]
S3[Errors]
S4[Resources]
end
subgraph PERF["<strong>MODEL PERFORMANCE</strong>"]
P1[Accuracy]
P2[Drift]
P3[Fairness]
P4[Explanations]
end
subgraph BIZ["<strong>BUSINESS IMPACT</strong>"]
B1[User satisfaction]
B2[Decision quality]
B3[Outcomes]
B4[Compliance]
end
style MON fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
style SYS fill:#e8f5e9,stroke:#388e3c,stroke-width:2px
style PERF fill:#fff3e0,stroke:#f57c00,stroke-width:2px
style BIZ fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px Step 1: System Health Monitoring¶
Key Metrics¶
| Metric | Target | Alert Threshold |
|---|---|---|
| Availability | 99.9% | < 99.5% |
| Response time (p50) | < 200ms | > 500ms |
| Response time (p99) | < 1s | > 2s |
| Error rate | < 0.1% | > 1% |
| CPU utilization | < 70% | > 85% |
| Memory utilization | < 80% | > 90% |
| GPU utilization (if used) | < 80% | > 95% |
Implementation¶
import time
from prometheus_client import Counter, Histogram, Gauge
import logging
# Define metrics
PREDICTION_COUNT = Counter(
'ai_predictions_total',
'Total number of predictions',
['model_name', 'model_version']
)
PREDICTION_LATENCY = Histogram(
'ai_prediction_latency_seconds',
'Prediction latency in seconds',
['model_name'],
buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0]
)
PREDICTION_ERRORS = Counter(
'ai_prediction_errors_total',
'Total prediction errors',
['model_name', 'error_type']
)
MODEL_LOADED = Gauge(
'ai_model_loaded',
'Whether model is loaded and ready',
['model_name', 'model_version']
)
def predict_with_monitoring(model, input_data):
"""Wrapper that adds monitoring to predictions."""
start_time = time.time()
try:
prediction = model.predict(input_data)
# Record success metrics
PREDICTION_COUNT.labels(
model_name=model.name,
model_version=model.version
).inc()
latency = time.time() - start_time
PREDICTION_LATENCY.labels(model_name=model.name).observe(latency)
return prediction
except Exception as e:
PREDICTION_ERRORS.labels(
model_name=model.name,
error_type=type(e).__name__
).inc()
logging.error(f"Prediction error: {e}")
raise
Step 2: Model Performance Monitoring¶
Accuracy Monitoring¶
Monitor model accuracy continuously using: 1. Ground truth feedback: When actual outcomes become known 2. Human review samples: Regular expert review of predictions 3. Proxy metrics: Indicators that correlate with accuracy
import pandas as pd
from datetime import datetime, timedelta
class AccuracyMonitor:
"""Monitor model accuracy over time."""
def __init__(self, model_name, alert_threshold=0.05):
self.model_name = model_name
self.baseline_accuracy = None
self.alert_threshold = alert_threshold
self.predictions = []
def log_prediction(self, prediction_id, predicted, probability):
"""Log a prediction for later evaluation."""
self.predictions.append({
'prediction_id': prediction_id,
'timestamp': datetime.now(),
'predicted': predicted,
'probability': probability,
'actual': None
})
def record_actual(self, prediction_id, actual):
"""Record the actual outcome when known."""
for pred in self.predictions:
if pred['prediction_id'] == prediction_id:
pred['actual'] = actual
break
def calculate_accuracy(self, days=7):
"""Calculate accuracy over recent period."""
cutoff = datetime.now() - timedelta(days=days)
recent = [p for p in self.predictions
if p['timestamp'] > cutoff and p['actual'] is not None]
if not recent:
return None
correct = sum(1 for p in recent if p['predicted'] == p['actual'])
accuracy = correct / len(recent)
return accuracy
def check_degradation(self):
"""Check if accuracy has degraded significantly."""
current = self.calculate_accuracy(days=7)
baseline = self.baseline_accuracy
if current is None or baseline is None:
return None
degradation = baseline - current
if degradation > self.alert_threshold:
return {
'alert': True,
'message': f'Accuracy degraded by {degradation:.2%}',
'baseline': baseline,
'current': current
}
return {'alert': False, 'current': current}
Drift Monitoring¶
from scipy import stats
import numpy as np
class DriftMonitor:
"""Monitor for data and concept drift."""
def __init__(self, reference_data, feature_names):
self.reference_data = reference_data
self.feature_names = feature_names
self.reference_stats = self._compute_stats(reference_data)
def _compute_stats(self, data):
"""Compute distribution statistics."""
stats_dict = {}
for i, feature in enumerate(self.feature_names):
if isinstance(data, np.ndarray):
col = data[:, i]
else:
col = data[feature]
stats_dict[feature] = {
'mean': np.mean(col),
'std': np.std(col),
'min': np.min(col),
'max': np.max(col),
'distribution': col
}
return stats_dict
def detect_drift(self, current_data, threshold=0.05):
"""Detect drift using statistical tests."""
current_stats = self._compute_stats(current_data)
drift_results = {}
for feature in self.feature_names:
ref_dist = self.reference_stats[feature]['distribution']
cur_dist = current_stats[feature]['distribution']
# Kolmogorov-Smirnov test
ks_stat, p_value = stats.ks_2samp(ref_dist, cur_dist)
drift_detected = p_value < threshold
drift_results[feature] = {
'drift_detected': drift_detected,
'ks_statistic': ks_stat,
'p_value': p_value,
'mean_shift': (current_stats[feature]['mean'] -
self.reference_stats[feature]['mean'])
}
return drift_results
def get_drift_summary(self, drift_results):
"""Summarize drift detection results."""
drifted_features = [f for f, r in drift_results.items()
if r['drift_detected']]
return {
'total_features': len(self.feature_names),
'drifted_features': len(drifted_features),
'drift_percentage': len(drifted_features) / len(self.feature_names),
'features_with_drift': drifted_features
}
Step 3: Fairness Monitoring¶
Continuous Fairness Tracking¶
class FairnessMonitor:
"""Monitor fairness metrics in production."""
def __init__(self, protected_attributes, alert_threshold=0.8):
self.protected_attributes = protected_attributes
self.alert_threshold = alert_threshold # 80% rule
self.predictions_log = []
def log_prediction(self, prediction, protected_values):
"""Log prediction with protected attribute values."""
entry = {
'timestamp': datetime.now(),
'prediction': prediction,
**{attr: val for attr, val in
zip(self.protected_attributes, protected_values)}
}
self.predictions_log.append(entry)
def calculate_disparities(self, days=7):
"""Calculate fairness disparities over recent period."""
cutoff = datetime.now() - timedelta(days=days)
recent = [p for p in self.predictions_log if p['timestamp'] > cutoff]
if len(recent) < 100:
return None # Not enough data
df = pd.DataFrame(recent)
results = {}
for attr in self.protected_attributes:
groups = df.groupby(attr)['prediction'].mean()
if len(groups) < 2:
continue
min_rate = groups.min()
max_rate = groups.max()
disparity = min_rate / max_rate if max_rate > 0 else 1.0
results[attr] = {
'disparity_ratio': disparity,
'alert': disparity < self.alert_threshold,
'group_rates': groups.to_dict()
}
return results
def generate_fairness_report(self):
"""Generate fairness monitoring report."""
disparities = self.calculate_disparities()
if disparities is None:
return "Insufficient data for fairness analysis"
report = ["Fairness Monitoring Report", "=" * 40]
for attr, results in disparities.items():
status = "ALERT" if results['alert'] else "OK"
report.append(f"\n{attr}: {status}")
report.append(f" Disparity ratio: {results['disparity_ratio']:.3f}")
report.append(" Group rates:")
for group, rate in results['group_rates'].items():
report.append(f" {group}: {rate:.3f}")
return "\n".join(report)
Step 4: Alerting Configuration¶
Alert Hierarchy¶
flowchart TB
subgraph CRIT["<strong>CRITICAL</strong> - Immediate response (page on-call)"]
C1[Model failing >10% error rate]
C2[System down or unresponsive]
C3[Major fairness violation]
end
subgraph HIGH["<strong>HIGH</strong> - Response within 1 hour (Slack + email)"]
H1[Accuracy degradation >5%]
H2[Sustained high latency >2s]
H3[Drift in critical features]
end
subgraph MED["<strong>MEDIUM</strong> - Response within 1 day (email)"]
M1[Minor accuracy drop 2-5%]
M2[Elevated error rate 1-5%]
M3[Drift in non-critical features]
end
subgraph LOW["<strong>LOW</strong> - Weekly review (dashboard)"]
L1[Performance trending down]
L2[Resource utilization rising]
L3[Minor fairness changes]
end
CRIT --> HIGH --> MED --> LOW
style CRIT fill:#ef9a9a,stroke:#c62828,stroke-width:2px
style HIGH fill:#ffcc80,stroke:#ef6c00,stroke-width:2px
style MED fill:#fff9c4,stroke:#f9a825,stroke-width:2px
style LOW fill:#c8e6c9,stroke:#388e3c,stroke-width:2px Alert Configuration Example¶
# alerting_config.py
ALERT_RULES = [
{
'name': 'model_error_rate_critical',
'metric': 'ai_prediction_errors_total / ai_predictions_total',
'threshold': 0.10,
'operator': '>',
'severity': 'critical',
'window': '5m',
'message': 'Model error rate exceeds 10%'
},
{
'name': 'accuracy_degradation',
'metric': 'baseline_accuracy - current_accuracy',
'threshold': 0.05,
'operator': '>',
'severity': 'high',
'window': '24h',
'message': 'Model accuracy has dropped by more than 5%'
},
{
'name': 'fairness_violation',
'metric': 'min(group_rates) / max(group_rates)',
'threshold': 0.80,
'operator': '<',
'severity': 'high',
'window': '7d',
'message': 'Fairness disparity exceeds threshold'
},
{
'name': 'latency_high',
'metric': 'ai_prediction_latency_seconds_p99',
'threshold': 2.0,
'operator': '>',
'severity': 'high',
'window': '10m',
'message': 'P99 latency exceeds 2 seconds'
},
{
'name': 'data_drift_detected',
'metric': 'drift_features_percentage',
'threshold': 0.20,
'operator': '>',
'severity': 'medium',
'window': '24h',
'message': 'Data drift detected in >20% of features'
}
]
Step 5: Dashboard Design¶
Key Dashboard Panels¶
The dashboard should include these key panels:
| Panel | Metrics | Visual Type |
|---|---|---|
| Health Status | Model status, Error rate, Latency, Fairness | Status indicators |
| Accuracy Over Time | Daily/weekly accuracy trends | Line chart |
| Predictions/Hour | Throughput and volume | Bar chart |
| Fairness by Group | Outcome rates by demographic | Horizontal bar |
| Feature Drift Status | Drift detection per feature | Status list |
flowchart TB
subgraph DASH["<strong>AI MODEL DASHBOARD</strong>"]
direction TB
subgraph HEALTH["Health Status"]
direction LR
H1["✓ Model UP"]
H2["Error: 0.02%"]
H3["Latency: 180ms"]
H4["Fairness: PASS"]
end
subgraph METRICS["Performance Metrics"]
direction LR
M1["Accuracy Trend"]
M2["Predictions/Hour"]
end
subgraph FAIR["Fairness & Drift"]
direction LR
F1["Group Rates"]
F2["Drift Status"]
end
end
style DASH fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
style HEALTH fill:#e8f5e9,stroke:#388e3c,stroke-width:2px
style METRICS fill:#fff3e0,stroke:#f57c00,stroke-width:2px
style FAIR fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px Step 6: Incident Response¶
AI-Specific Incident Playbook¶
## AI Incident Response Playbook
### 1. Detection
- Automated alert triggered
- User reports unexpected behavior
- Quality review identifies issues
### 2. Initial Assessment (within 15 minutes)
- [ ] Confirm the issue is real (not false positive)
- [ ] Determine impact scope (users affected, predictions affected)
- [ ] Assess severity level
- [ ] Notify stakeholders per severity
### 3. Containment (within 1 hour for critical)
Options:
a) **Disable model**: Fall back to rules-based or human processing
b) **Rollback**: Revert to previous model version
c) **Adjust thresholds**: Modify decision boundaries
d) **Add human review**: Route predictions for manual review
### 4. Investigation
- [ ] Review recent predictions for patterns
- [ ] Check for data drift
- [ ] Review recent changes (data, model, infrastructure)
- [ ] Identify root cause
### 5. Resolution
- [ ] Implement fix
- [ ] Test fix in staging
- [ ] Deploy fix with monitoring
- [ ] Verify resolution
### 6. Post-Incident
- [ ] Complete incident report
- [ ] Conduct blameless post-mortem
- [ ] Update monitoring and alerts
- [ ] Update documentation
Monitoring Checklist¶
Setup¶
- Instrument model serving code with metrics
- Configure metric collection (Prometheus, CloudWatch, etc.)
- Create monitoring dashboards
- Set up alerting rules
- Establish baseline metrics
- Configure log aggregation
Ongoing Operations¶
- Daily: Check dashboard for anomalies
- Weekly: Review accuracy trends
- Monthly: Full fairness audit
- Quarterly: Drift analysis and model review
Documentation¶
- Monitoring architecture documented
- Alert runbooks created
- Incident response playbook ready
- Escalation paths defined
Tools and Technologies¶
Metrics & Monitoring¶
- Prometheus: Metric collection
- Grafana: Dashboards
- CloudWatch: AWS monitoring
- Datadog: Full observability platform
ML-Specific Monitoring¶
- Evidently AI: Data and model monitoring
- WhyLabs: ML observability
- Arize: ML observability platform
- MLflow: Model tracking and monitoring
Alerting¶
- PagerDuty: Incident management
- Opsgenie: Alert management
- Slack: Team notifications