How to Set Up AI System Monitoring¶

Ready to Use

Quick Reference

Three pillars: System health, model performance, business impact
Key difference: AI accuracy degrades without code changes
Watch for: Data drift, concept drift, fairness regression
Frequency: Real-time for system, daily/weekly for model metrics

Purpose¶

This guide provides practical steps for implementing comprehensive monitoring of AI systems in production, ensuring reliability, fairness, and compliance.

Why AI Monitoring is Different¶

Traditional software monitoring focuses on uptime and response times. AI systems require additional monitoring for:

Aspect	Why It Matters
Model Performance	Accuracy can degrade without code changes
Data Drift	Input data patterns change over time
Concept Drift	Relationships between inputs and outcomes change
Fairness	Bias can emerge or worsen in production
Explainability	Explanations should remain valid

Monitoring Framework¶

The Three Pillars of AI Monitoring¶

flowchart TB
    MON["<strong>AI MONITORING</strong>"] --> SYS
    MON --> PERF
    MON --> BIZ

    subgraph SYS["<strong>SYSTEM HEALTH</strong>"]
        S1[Uptime]
        S2[Latency]
        S3[Errors]
        S4[Resources]
    end

    subgraph PERF["<strong>MODEL PERFORMANCE</strong>"]
        P1[Accuracy]
        P2[Drift]
        P3[Fairness]
        P4[Explanations]
    end

    subgraph BIZ["<strong>BUSINESS IMPACT</strong>"]
        B1[User satisfaction]
        B2[Decision quality]
        B3[Outcomes]
        B4[Compliance]
    end

    style MON fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
    style SYS fill:#e8f5e9,stroke:#388e3c,stroke-width:2px
    style PERF fill:#fff3e0,stroke:#f57c00,stroke-width:2px
    style BIZ fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px

Step 1: System Health Monitoring¶

Key Metrics¶

Metric	Target	Alert Threshold
Availability	99.9%	< 99.5%
Response time (p50)	< 200ms	> 500ms
Response time (p99)	< 1s	> 2s
Error rate	< 0.1%	> 1%
CPU utilization	< 70%	> 85%
Memory utilization	< 80%	> 90%
GPU utilization (if used)	< 80%	> 95%

Implementation¶

import time
from prometheus_client import Counter, Histogram, Gauge
import logging

# Define metrics
PREDICTION_COUNT = Counter(
    'ai_predictions_total',
    'Total number of predictions',
    ['model_name', 'model_version']
)

PREDICTION_LATENCY = Histogram(
    'ai_prediction_latency_seconds',
    'Prediction latency in seconds',
    ['model_name'],
    buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0]
)

PREDICTION_ERRORS = Counter(
    'ai_prediction_errors_total',
    'Total prediction errors',
    ['model_name', 'error_type']
)

MODEL_LOADED = Gauge(
    'ai_model_loaded',
    'Whether model is loaded and ready',
    ['model_name', 'model_version']
)

def predict_with_monitoring(model, input_data):
    """Wrapper that adds monitoring to predictions."""
    start_time = time.time()

    try:
        prediction = model.predict(input_data)

        # Record success metrics
        PREDICTION_COUNT.labels(
            model_name=model.name,
            model_version=model.version
        ).inc()

        latency = time.time() - start_time
        PREDICTION_LATENCY.labels(model_name=model.name).observe(latency)

        return prediction

    except Exception as e:
        PREDICTION_ERRORS.labels(
            model_name=model.name,
            error_type=type(e).__name__
        ).inc()
        logging.error(f"Prediction error: {e}")
        raise

Step 2: Model Performance Monitoring¶

Accuracy Monitoring¶

Monitor model accuracy continuously using: 1. Ground truth feedback: When actual outcomes become known 2. Human review samples: Regular expert review of predictions 3. Proxy metrics: Indicators that correlate with accuracy

import pandas as pd
from datetime import datetime, timedelta

class AccuracyMonitor:
    """Monitor model accuracy over time."""

    def __init__(self, model_name, alert_threshold=0.05):
        self.model_name = model_name
        self.baseline_accuracy = None
        self.alert_threshold = alert_threshold
        self.predictions = []

    def log_prediction(self, prediction_id, predicted, probability):
        """Log a prediction for later evaluation."""
        self.predictions.append({
            'prediction_id': prediction_id,
            'timestamp': datetime.now(),
            'predicted': predicted,
            'probability': probability,
            'actual': None
        })

    def record_actual(self, prediction_id, actual):
        """Record the actual outcome when known."""
        for pred in self.predictions:
            if pred['prediction_id'] == prediction_id:
                pred['actual'] = actual
                break

    def calculate_accuracy(self, days=7):
        """Calculate accuracy over recent period."""
        cutoff = datetime.now() - timedelta(days=days)

        recent = [p for p in self.predictions
                  if p['timestamp'] > cutoff and p['actual'] is not None]

        if not recent:
            return None

        correct = sum(1 for p in recent if p['predicted'] == p['actual'])
        accuracy = correct / len(recent)

        return accuracy

    def check_degradation(self):
        """Check if accuracy has degraded significantly."""
        current = self.calculate_accuracy(days=7)
        baseline = self.baseline_accuracy

        if current is None or baseline is None:
            return None

        degradation = baseline - current

        if degradation > self.alert_threshold:
            return {
                'alert': True,
                'message': f'Accuracy degraded by {degradation:.2%}',
                'baseline': baseline,
                'current': current
            }

        return {'alert': False, 'current': current}

Drift Monitoring¶

from scipy import stats
import numpy as np

class DriftMonitor:
    """Monitor for data and concept drift."""

    def __init__(self, reference_data, feature_names):
        self.reference_data = reference_data
        self.feature_names = feature_names
        self.reference_stats = self._compute_stats(reference_data)

    def _compute_stats(self, data):
        """Compute distribution statistics."""
        stats_dict = {}
        for i, feature in enumerate(self.feature_names):
            if isinstance(data, np.ndarray):
                col = data[:, i]
            else:
                col = data[feature]

            stats_dict[feature] = {
                'mean': np.mean(col),
                'std': np.std(col),
                'min': np.min(col),
                'max': np.max(col),
                'distribution': col
            }
        return stats_dict

    def detect_drift(self, current_data, threshold=0.05):
        """Detect drift using statistical tests."""
        current_stats = self._compute_stats(current_data)
        drift_results = {}

        for feature in self.feature_names:
            ref_dist = self.reference_stats[feature]['distribution']
            cur_dist = current_stats[feature]['distribution']

            # Kolmogorov-Smirnov test
            ks_stat, p_value = stats.ks_2samp(ref_dist, cur_dist)

            drift_detected = p_value < threshold

            drift_results[feature] = {
                'drift_detected': drift_detected,
                'ks_statistic': ks_stat,
                'p_value': p_value,
                'mean_shift': (current_stats[feature]['mean'] -
                              self.reference_stats[feature]['mean'])
            }

        return drift_results

    def get_drift_summary(self, drift_results):
        """Summarize drift detection results."""
        drifted_features = [f for f, r in drift_results.items()
                          if r['drift_detected']]

        return {
            'total_features': len(self.feature_names),
            'drifted_features': len(drifted_features),
            'drift_percentage': len(drifted_features) / len(self.feature_names),
            'features_with_drift': drifted_features
        }

Step 3: Fairness Monitoring¶

Continuous Fairness Tracking¶

class FairnessMonitor:
    """Monitor fairness metrics in production."""

    def __init__(self, protected_attributes, alert_threshold=0.8):
        self.protected_attributes = protected_attributes
        self.alert_threshold = alert_threshold  # 80% rule
        self.predictions_log = []

    def log_prediction(self, prediction, protected_values):
        """Log prediction with protected attribute values."""
        entry = {
            'timestamp': datetime.now(),
            'prediction': prediction,
            **{attr: val for attr, val in
               zip(self.protected_attributes, protected_values)}
        }
        self.predictions_log.append(entry)

    def calculate_disparities(self, days=7):
        """Calculate fairness disparities over recent period."""
        cutoff = datetime.now() - timedelta(days=days)
        recent = [p for p in self.predictions_log if p['timestamp'] > cutoff]

        if len(recent) < 100:
            return None  # Not enough data

        df = pd.DataFrame(recent)
        results = {}

        for attr in self.protected_attributes:
            groups = df.groupby(attr)['prediction'].mean()

            if len(groups) < 2:
                continue

            min_rate = groups.min()
            max_rate = groups.max()
            disparity = min_rate / max_rate if max_rate > 0 else 1.0

            results[attr] = {
                'disparity_ratio': disparity,
                'alert': disparity < self.alert_threshold,
                'group_rates': groups.to_dict()
            }

        return results

    def generate_fairness_report(self):
        """Generate fairness monitoring report."""
        disparities = self.calculate_disparities()

        if disparities is None:
            return "Insufficient data for fairness analysis"

        report = ["Fairness Monitoring Report", "=" * 40]

        for attr, results in disparities.items():
            status = "ALERT" if results['alert'] else "OK"
            report.append(f"\n{attr}: {status}")
            report.append(f"  Disparity ratio: {results['disparity_ratio']:.3f}")
            report.append("  Group rates:")
            for group, rate in results['group_rates'].items():
                report.append(f"    {group}: {rate:.3f}")

        return "\n".join(report)

Step 4: Alerting Configuration¶

Alert Hierarchy¶

flowchart TB
    subgraph CRIT["<strong>CRITICAL</strong> - Immediate response (page on-call)"]
        C1[Model failing >10% error rate]
        C2[System down or unresponsive]
        C3[Major fairness violation]
    end

    subgraph HIGH["<strong>HIGH</strong> - Response within 1 hour (Slack + email)"]
        H1[Accuracy degradation >5%]
        H2[Sustained high latency >2s]
        H3[Drift in critical features]
    end

    subgraph MED["<strong>MEDIUM</strong> - Response within 1 day (email)"]
        M1[Minor accuracy drop 2-5%]
        M2[Elevated error rate 1-5%]
        M3[Drift in non-critical features]
    end

    subgraph LOW["<strong>LOW</strong> - Weekly review (dashboard)"]
        L1[Performance trending down]
        L2[Resource utilization rising]
        L3[Minor fairness changes]
    end

    CRIT --> HIGH --> MED --> LOW

    style CRIT fill:#ef9a9a,stroke:#c62828,stroke-width:2px
    style HIGH fill:#ffcc80,stroke:#ef6c00,stroke-width:2px
    style MED fill:#fff9c4,stroke:#f9a825,stroke-width:2px
    style LOW fill:#c8e6c9,stroke:#388e3c,stroke-width:2px

Alert Configuration Example¶

# alerting_config.py

ALERT_RULES = [
    {
        'name': 'model_error_rate_critical',
        'metric': 'ai_prediction_errors_total / ai_predictions_total',
        'threshold': 0.10,
        'operator': '>',
        'severity': 'critical',
        'window': '5m',
        'message': 'Model error rate exceeds 10%'
    },
    {
        'name': 'accuracy_degradation',
        'metric': 'baseline_accuracy - current_accuracy',
        'threshold': 0.05,
        'operator': '>',
        'severity': 'high',
        'window': '24h',
        'message': 'Model accuracy has dropped by more than 5%'
    },
    {
        'name': 'fairness_violation',
        'metric': 'min(group_rates) / max(group_rates)',
        'threshold': 0.80,
        'operator': '<',
        'severity': 'high',
        'window': '7d',
        'message': 'Fairness disparity exceeds threshold'
    },
    {
        'name': 'latency_high',
        'metric': 'ai_prediction_latency_seconds_p99',
        'threshold': 2.0,
        'operator': '>',
        'severity': 'high',
        'window': '10m',
        'message': 'P99 latency exceeds 2 seconds'
    },
    {
        'name': 'data_drift_detected',
        'metric': 'drift_features_percentage',
        'threshold': 0.20,
        'operator': '>',
        'severity': 'medium',
        'window': '24h',
        'message': 'Data drift detected in >20% of features'
    }
]

Step 5: Dashboard Design¶

Key Dashboard Panels¶

The dashboard should include these key panels:

Panel	Metrics	Visual Type
Health Status	Model status, Error rate, Latency, Fairness	Status indicators
Accuracy Over Time	Daily/weekly accuracy trends	Line chart
Predictions/Hour	Throughput and volume	Bar chart
Fairness by Group	Outcome rates by demographic	Horizontal bar
Feature Drift Status	Drift detection per feature	Status list

flowchart TB
    subgraph DASH["<strong>AI MODEL DASHBOARD</strong>"]
        direction TB
        subgraph HEALTH["Health Status"]
            direction LR
            H1["✓ Model UP"]
            H2["Error: 0.02%"]
            H3["Latency: 180ms"]
            H4["Fairness: PASS"]
        end

        subgraph METRICS["Performance Metrics"]
            direction LR
            M1["Accuracy Trend"]
            M2["Predictions/Hour"]
        end

        subgraph FAIR["Fairness & Drift"]
            direction LR
            F1["Group Rates"]
            F2["Drift Status"]
        end
    end

    style DASH fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
    style HEALTH fill:#e8f5e9,stroke:#388e3c,stroke-width:2px
    style METRICS fill:#fff3e0,stroke:#f57c00,stroke-width:2px
    style FAIR fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px

Step 6: Incident Response¶

AI-Specific Incident Playbook¶

## AI Incident Response Playbook

### 1. Detection
- Automated alert triggered
- User reports unexpected behavior
- Quality review identifies issues

### 2. Initial Assessment (within 15 minutes)
- [ ] Confirm the issue is real (not false positive)
- [ ] Determine impact scope (users affected, predictions affected)
- [ ] Assess severity level
- [ ] Notify stakeholders per severity

### 3. Containment (within 1 hour for critical)
Options:
a) **Disable model**: Fall back to rules-based or human processing
b) **Rollback**: Revert to previous model version
c) **Adjust thresholds**: Modify decision boundaries
d) **Add human review**: Route predictions for manual review

### 4. Investigation
- [ ] Review recent predictions for patterns
- [ ] Check for data drift
- [ ] Review recent changes (data, model, infrastructure)
- [ ] Identify root cause

### 5. Resolution
- [ ] Implement fix
- [ ] Test fix in staging
- [ ] Deploy fix with monitoring
- [ ] Verify resolution

### 6. Post-Incident
- [ ] Complete incident report
- [ ] Conduct blameless post-mortem
- [ ] Update monitoring and alerts
- [ ] Update documentation

Monitoring Checklist¶

Setup¶

Instrument model serving code with metrics
Configure metric collection (Prometheus, CloudWatch, etc.)
Create monitoring dashboards
Set up alerting rules
Establish baseline metrics
Configure log aggregation

Ongoing Operations¶

Daily: Check dashboard for anomalies
Weekly: Review accuracy trends
Monthly: Full fairness audit
Quarterly: Drift analysis and model review

Documentation¶

Monitoring architecture documented
Alert runbooks created
Incident response playbook ready
Escalation paths defined

Tools and Technologies¶

Metrics & Monitoring¶

Prometheus: Metric collection
Grafana: Dashboards
CloudWatch: AWS monitoring
Datadog: Full observability platform

ML-Specific Monitoring¶

Evidently AI: Data and model monitoring
WhyLabs: ML observability
Arize: ML observability platform
MLflow: Model tracking and monitoring

Alerting¶

PagerDuty: Incident management
Opsgenie: Alert management
Slack: Team notifications