Skip to content

How to Set Up AI System Monitoring

Ready to Use

Quick Reference
  • Three pillars: System health, model performance, business impact
  • Key difference: AI accuracy degrades without code changes
  • Watch for: Data drift, concept drift, fairness regression
  • Frequency: Real-time for system, daily/weekly for model metrics

Purpose

This guide provides practical steps for implementing comprehensive monitoring of AI systems in production, ensuring reliability, fairness, and compliance.


Why AI Monitoring is Different

Traditional software monitoring focuses on uptime and response times. AI systems require additional monitoring for:

Aspect Why It Matters
Model Performance Accuracy can degrade without code changes
Data Drift Input data patterns change over time
Concept Drift Relationships between inputs and outcomes change
Fairness Bias can emerge or worsen in production
Explainability Explanations should remain valid

Monitoring Framework

The Three Pillars of AI Monitoring

flowchart TB
    MON["<strong>AI MONITORING</strong>"] --> SYS
    MON --> PERF
    MON --> BIZ

    subgraph SYS["<strong>SYSTEM HEALTH</strong>"]
        S1[Uptime]
        S2[Latency]
        S3[Errors]
        S4[Resources]
    end

    subgraph PERF["<strong>MODEL PERFORMANCE</strong>"]
        P1[Accuracy]
        P2[Drift]
        P3[Fairness]
        P4[Explanations]
    end

    subgraph BIZ["<strong>BUSINESS IMPACT</strong>"]
        B1[User satisfaction]
        B2[Decision quality]
        B3[Outcomes]
        B4[Compliance]
    end

    style MON fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
    style SYS fill:#e8f5e9,stroke:#388e3c,stroke-width:2px
    style PERF fill:#fff3e0,stroke:#f57c00,stroke-width:2px
    style BIZ fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px

Step 1: System Health Monitoring

Key Metrics

Metric Target Alert Threshold
Availability 99.9% < 99.5%
Response time (p50) < 200ms > 500ms
Response time (p99) < 1s > 2s
Error rate < 0.1% > 1%
CPU utilization < 70% > 85%
Memory utilization < 80% > 90%
GPU utilization (if used) < 80% > 95%

Implementation

import time
from prometheus_client import Counter, Histogram, Gauge
import logging

# Define metrics
PREDICTION_COUNT = Counter(
    'ai_predictions_total',
    'Total number of predictions',
    ['model_name', 'model_version']
)

PREDICTION_LATENCY = Histogram(
    'ai_prediction_latency_seconds',
    'Prediction latency in seconds',
    ['model_name'],
    buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0]
)

PREDICTION_ERRORS = Counter(
    'ai_prediction_errors_total',
    'Total prediction errors',
    ['model_name', 'error_type']
)

MODEL_LOADED = Gauge(
    'ai_model_loaded',
    'Whether model is loaded and ready',
    ['model_name', 'model_version']
)

def predict_with_monitoring(model, input_data):
    """Wrapper that adds monitoring to predictions."""
    start_time = time.time()

    try:
        prediction = model.predict(input_data)

        # Record success metrics
        PREDICTION_COUNT.labels(
            model_name=model.name,
            model_version=model.version
        ).inc()

        latency = time.time() - start_time
        PREDICTION_LATENCY.labels(model_name=model.name).observe(latency)

        return prediction

    except Exception as e:
        PREDICTION_ERRORS.labels(
            model_name=model.name,
            error_type=type(e).__name__
        ).inc()
        logging.error(f"Prediction error: {e}")
        raise

Step 2: Model Performance Monitoring

Accuracy Monitoring

Monitor model accuracy continuously using: 1. Ground truth feedback: When actual outcomes become known 2. Human review samples: Regular expert review of predictions 3. Proxy metrics: Indicators that correlate with accuracy

import pandas as pd
from datetime import datetime, timedelta

class AccuracyMonitor:
    """Monitor model accuracy over time."""

    def __init__(self, model_name, alert_threshold=0.05):
        self.model_name = model_name
        self.baseline_accuracy = None
        self.alert_threshold = alert_threshold
        self.predictions = []

    def log_prediction(self, prediction_id, predicted, probability):
        """Log a prediction for later evaluation."""
        self.predictions.append({
            'prediction_id': prediction_id,
            'timestamp': datetime.now(),
            'predicted': predicted,
            'probability': probability,
            'actual': None
        })

    def record_actual(self, prediction_id, actual):
        """Record the actual outcome when known."""
        for pred in self.predictions:
            if pred['prediction_id'] == prediction_id:
                pred['actual'] = actual
                break

    def calculate_accuracy(self, days=7):
        """Calculate accuracy over recent period."""
        cutoff = datetime.now() - timedelta(days=days)

        recent = [p for p in self.predictions
                  if p['timestamp'] > cutoff and p['actual'] is not None]

        if not recent:
            return None

        correct = sum(1 for p in recent if p['predicted'] == p['actual'])
        accuracy = correct / len(recent)

        return accuracy

    def check_degradation(self):
        """Check if accuracy has degraded significantly."""
        current = self.calculate_accuracy(days=7)
        baseline = self.baseline_accuracy

        if current is None or baseline is None:
            return None

        degradation = baseline - current

        if degradation > self.alert_threshold:
            return {
                'alert': True,
                'message': f'Accuracy degraded by {degradation:.2%}',
                'baseline': baseline,
                'current': current
            }

        return {'alert': False, 'current': current}

Drift Monitoring

from scipy import stats
import numpy as np

class DriftMonitor:
    """Monitor for data and concept drift."""

    def __init__(self, reference_data, feature_names):
        self.reference_data = reference_data
        self.feature_names = feature_names
        self.reference_stats = self._compute_stats(reference_data)

    def _compute_stats(self, data):
        """Compute distribution statistics."""
        stats_dict = {}
        for i, feature in enumerate(self.feature_names):
            if isinstance(data, np.ndarray):
                col = data[:, i]
            else:
                col = data[feature]

            stats_dict[feature] = {
                'mean': np.mean(col),
                'std': np.std(col),
                'min': np.min(col),
                'max': np.max(col),
                'distribution': col
            }
        return stats_dict

    def detect_drift(self, current_data, threshold=0.05):
        """Detect drift using statistical tests."""
        current_stats = self._compute_stats(current_data)
        drift_results = {}

        for feature in self.feature_names:
            ref_dist = self.reference_stats[feature]['distribution']
            cur_dist = current_stats[feature]['distribution']

            # Kolmogorov-Smirnov test
            ks_stat, p_value = stats.ks_2samp(ref_dist, cur_dist)

            drift_detected = p_value < threshold

            drift_results[feature] = {
                'drift_detected': drift_detected,
                'ks_statistic': ks_stat,
                'p_value': p_value,
                'mean_shift': (current_stats[feature]['mean'] -
                              self.reference_stats[feature]['mean'])
            }

        return drift_results

    def get_drift_summary(self, drift_results):
        """Summarize drift detection results."""
        drifted_features = [f for f, r in drift_results.items()
                          if r['drift_detected']]

        return {
            'total_features': len(self.feature_names),
            'drifted_features': len(drifted_features),
            'drift_percentage': len(drifted_features) / len(self.feature_names),
            'features_with_drift': drifted_features
        }

Step 3: Fairness Monitoring

Continuous Fairness Tracking

class FairnessMonitor:
    """Monitor fairness metrics in production."""

    def __init__(self, protected_attributes, alert_threshold=0.8):
        self.protected_attributes = protected_attributes
        self.alert_threshold = alert_threshold  # 80% rule
        self.predictions_log = []

    def log_prediction(self, prediction, protected_values):
        """Log prediction with protected attribute values."""
        entry = {
            'timestamp': datetime.now(),
            'prediction': prediction,
            **{attr: val for attr, val in
               zip(self.protected_attributes, protected_values)}
        }
        self.predictions_log.append(entry)

    def calculate_disparities(self, days=7):
        """Calculate fairness disparities over recent period."""
        cutoff = datetime.now() - timedelta(days=days)
        recent = [p for p in self.predictions_log if p['timestamp'] > cutoff]

        if len(recent) < 100:
            return None  # Not enough data

        df = pd.DataFrame(recent)
        results = {}

        for attr in self.protected_attributes:
            groups = df.groupby(attr)['prediction'].mean()

            if len(groups) < 2:
                continue

            min_rate = groups.min()
            max_rate = groups.max()
            disparity = min_rate / max_rate if max_rate > 0 else 1.0

            results[attr] = {
                'disparity_ratio': disparity,
                'alert': disparity < self.alert_threshold,
                'group_rates': groups.to_dict()
            }

        return results

    def generate_fairness_report(self):
        """Generate fairness monitoring report."""
        disparities = self.calculate_disparities()

        if disparities is None:
            return "Insufficient data for fairness analysis"

        report = ["Fairness Monitoring Report", "=" * 40]

        for attr, results in disparities.items():
            status = "ALERT" if results['alert'] else "OK"
            report.append(f"\n{attr}: {status}")
            report.append(f"  Disparity ratio: {results['disparity_ratio']:.3f}")
            report.append("  Group rates:")
            for group, rate in results['group_rates'].items():
                report.append(f"    {group}: {rate:.3f}")

        return "\n".join(report)

Step 4: Alerting Configuration

Alert Hierarchy

flowchart TB
    subgraph CRIT["<strong>CRITICAL</strong> - Immediate response (page on-call)"]
        C1[Model failing >10% error rate]
        C2[System down or unresponsive]
        C3[Major fairness violation]
    end

    subgraph HIGH["<strong>HIGH</strong> - Response within 1 hour (Slack + email)"]
        H1[Accuracy degradation >5%]
        H2[Sustained high latency >2s]
        H3[Drift in critical features]
    end

    subgraph MED["<strong>MEDIUM</strong> - Response within 1 day (email)"]
        M1[Minor accuracy drop 2-5%]
        M2[Elevated error rate 1-5%]
        M3[Drift in non-critical features]
    end

    subgraph LOW["<strong>LOW</strong> - Weekly review (dashboard)"]
        L1[Performance trending down]
        L2[Resource utilization rising]
        L3[Minor fairness changes]
    end

    CRIT --> HIGH --> MED --> LOW

    style CRIT fill:#ef9a9a,stroke:#c62828,stroke-width:2px
    style HIGH fill:#ffcc80,stroke:#ef6c00,stroke-width:2px
    style MED fill:#fff9c4,stroke:#f9a825,stroke-width:2px
    style LOW fill:#c8e6c9,stroke:#388e3c,stroke-width:2px

Alert Configuration Example

# alerting_config.py

ALERT_RULES = [
    {
        'name': 'model_error_rate_critical',
        'metric': 'ai_prediction_errors_total / ai_predictions_total',
        'threshold': 0.10,
        'operator': '>',
        'severity': 'critical',
        'window': '5m',
        'message': 'Model error rate exceeds 10%'
    },
    {
        'name': 'accuracy_degradation',
        'metric': 'baseline_accuracy - current_accuracy',
        'threshold': 0.05,
        'operator': '>',
        'severity': 'high',
        'window': '24h',
        'message': 'Model accuracy has dropped by more than 5%'
    },
    {
        'name': 'fairness_violation',
        'metric': 'min(group_rates) / max(group_rates)',
        'threshold': 0.80,
        'operator': '<',
        'severity': 'high',
        'window': '7d',
        'message': 'Fairness disparity exceeds threshold'
    },
    {
        'name': 'latency_high',
        'metric': 'ai_prediction_latency_seconds_p99',
        'threshold': 2.0,
        'operator': '>',
        'severity': 'high',
        'window': '10m',
        'message': 'P99 latency exceeds 2 seconds'
    },
    {
        'name': 'data_drift_detected',
        'metric': 'drift_features_percentage',
        'threshold': 0.20,
        'operator': '>',
        'severity': 'medium',
        'window': '24h',
        'message': 'Data drift detected in >20% of features'
    }
]

Step 5: Dashboard Design

Key Dashboard Panels

The dashboard should include these key panels:

Panel Metrics Visual Type
Health Status Model status, Error rate, Latency, Fairness Status indicators
Accuracy Over Time Daily/weekly accuracy trends Line chart
Predictions/Hour Throughput and volume Bar chart
Fairness by Group Outcome rates by demographic Horizontal bar
Feature Drift Status Drift detection per feature Status list
flowchart TB
    subgraph DASH["<strong>AI MODEL DASHBOARD</strong>"]
        direction TB
        subgraph HEALTH["Health Status"]
            direction LR
            H1["✓ Model UP"]
            H2["Error: 0.02%"]
            H3["Latency: 180ms"]
            H4["Fairness: PASS"]
        end

        subgraph METRICS["Performance Metrics"]
            direction LR
            M1["Accuracy Trend"]
            M2["Predictions/Hour"]
        end

        subgraph FAIR["Fairness & Drift"]
            direction LR
            F1["Group Rates"]
            F2["Drift Status"]
        end
    end

    style DASH fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
    style HEALTH fill:#e8f5e9,stroke:#388e3c,stroke-width:2px
    style METRICS fill:#fff3e0,stroke:#f57c00,stroke-width:2px
    style FAIR fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px

Step 6: Incident Response

AI-Specific Incident Playbook

## AI Incident Response Playbook

### 1. Detection
- Automated alert triggered
- User reports unexpected behavior
- Quality review identifies issues

### 2. Initial Assessment (within 15 minutes)
- [ ] Confirm the issue is real (not false positive)
- [ ] Determine impact scope (users affected, predictions affected)
- [ ] Assess severity level
- [ ] Notify stakeholders per severity

### 3. Containment (within 1 hour for critical)
Options:
a) **Disable model**: Fall back to rules-based or human processing
b) **Rollback**: Revert to previous model version
c) **Adjust thresholds**: Modify decision boundaries
d) **Add human review**: Route predictions for manual review

### 4. Investigation
- [ ] Review recent predictions for patterns
- [ ] Check for data drift
- [ ] Review recent changes (data, model, infrastructure)
- [ ] Identify root cause

### 5. Resolution
- [ ] Implement fix
- [ ] Test fix in staging
- [ ] Deploy fix with monitoring
- [ ] Verify resolution

### 6. Post-Incident
- [ ] Complete incident report
- [ ] Conduct blameless post-mortem
- [ ] Update monitoring and alerts
- [ ] Update documentation

Monitoring Checklist

Setup

  • Instrument model serving code with metrics
  • Configure metric collection (Prometheus, CloudWatch, etc.)
  • Create monitoring dashboards
  • Set up alerting rules
  • Establish baseline metrics
  • Configure log aggregation

Ongoing Operations

  • Daily: Check dashboard for anomalies
  • Weekly: Review accuracy trends
  • Monthly: Full fairness audit
  • Quarterly: Drift analysis and model review

Documentation

  • Monitoring architecture documented
  • Alert runbooks created
  • Incident response playbook ready
  • Escalation paths defined

Tools and Technologies

Metrics & Monitoring

  • Prometheus: Metric collection
  • Grafana: Dashboards
  • CloudWatch: AWS monitoring
  • Datadog: Full observability platform

ML-Specific Monitoring

  • Evidently AI: Data and model monitoring
  • WhyLabs: ML observability
  • Arize: ML observability platform
  • MLflow: Model tracking and monitoring

Alerting

  • PagerDuty: Incident management
  • Opsgenie: Alert management
  • Slack: Team notifications