Skip to content

How to Write a Model Card

Ready to Use

Quick Reference
  • What: Standardized documentation for ML models ("nutrition labels for AI")
  • Why: Transparency, accountability, risk management, compliance
  • When: Create during development, update at deployment and reviews
  • Who: Data scientists write, business owners approve

Purpose

Model cards are standardized documentation for machine learning models, providing transparency about a model's intended use, performance, and limitations. This guide explains how to create effective model cards for government AI systems.


What is a Model Card?

A model card is a short document that accompanies a trained ML model, providing: - What the model does - How it was built - How well it performs - Known limitations - Ethical considerations

Think of it as a "nutrition label" for AI models.


Why Model Cards Matter for Government

Benefit Description
Accountability Documents who built the model and why
Transparency Explains model behavior to stakeholders
Risk Management Highlights limitations and failure modes
Compliance Supports audit and regulatory requirements
Knowledge Transfer Enables others to understand and maintain the model
Responsible AI Demonstrates ethical consideration

Model Card Template

Section 1: Model Details

## Model Details

### Basic Information
| Field | Value |
|-------|-------|
| Model Name | |
| Version | |
| Date | |
| Model Type | (e.g., Classification, Regression, NLP) |
| Organization | |
| Contact | |

### Model Architecture
- Type: (e.g., Random Forest, Neural Network, Linear Regression)
- Framework: (e.g., scikit-learn, TensorFlow, PyTorch)
- Size: (e.g., number of parameters, tree depth)

### Training Information
- Training data period: [dates]
- Training compute: [hardware used]
- Training time: [duration]
- Hyperparameters: [key parameters]

Section 2: Intended Use

## Intended Use

### Primary Use Case
[Describe the specific task the model is designed for]

### Intended Users
- [User type 1]
- [User type 2]

### Out-of-Scope Uses
The model should NOT be used for:
- [Inappropriate use 1]
- [Inappropriate use 2]

### Human Oversight
[Describe the role of human reviewers]

Section 3: Training Data

## Training Data

### Data Sources
| Source | Description | Records | Date Range |
|--------|-------------|---------|------------|
| | | | |

### Data Preprocessing
- [Preprocessing step 1]
- [Preprocessing step 2]

### Data Limitations
- [Known limitation 1]
- [Known limitation 2]

### Privacy and Consent
[Describe PII handling and consent basis]

Section 4: Performance

## Performance

### Overall Metrics
| Metric | Value |
|--------|-------|
| Accuracy | |
| Precision | |
| Recall | |
| F1 Score | |
| AUC-ROC | |

### Performance by Subgroup
| Group | Accuracy | Precision | Recall | N |
|-------|----------|-----------|--------|---|
| | | | | |

### Comparison to Baseline
| Approach | Accuracy | Notes |
|----------|----------|-------|
| This model | | |
| Previous model | | |
| Human baseline | | |

Section 5: Limitations

## Limitations

### Technical Limitations
- [Limitation 1]
- [Limitation 2]

### Known Failure Modes
| Scenario | Model Behavior | Mitigation |
|----------|---------------|------------|
| | | |

### Data Drift Sensitivity
[Describe how model might degrade with changing data]

### Not Suitable For
- [Scenario 1]
- [Scenario 2]

Section 6: Ethical Considerations

## Ethical Considerations

### Fairness Assessment
| Protected Attribute | Tested | Disparity | Status |
|--------------------|--------|-----------|--------|
| | Yes/No | | Pass/Fail |

### Potential Harms
| Harm Type | Risk Level | Mitigation |
|-----------|------------|------------|
| | Low/Med/High | |

### Human Rights Considerations
[Describe any human rights implications]

### Environmental Impact
- Training compute: [CO2 estimate]
- Inference compute: [CO2 estimate]

Section 7: Governance

## Governance

### Approvals
| Review Type | Reviewer | Date | Status |
|-------------|----------|------|--------|
| Technical Review | | | |
| Ethics Review | | | |
| Privacy Review | | | |
| Business Sign-off | | | |

### Monitoring
- Performance monitoring: [frequency]
- Bias monitoring: [frequency]
- Retraining schedule: [plan]

### Incident Response
[Link to incident response procedure]

### Version History
| Version | Date | Changes |
|---------|------|---------|
| | | |

Step-by-Step Guide

Step 1: Gather Information

Before writing, collect: - [ ] Model architecture details - [ ] Training data documentation - [ ] Performance metrics from testing - [ ] Fairness testing results - [ ] Known issues from development - [ ] Stakeholder feedback

Step 2: Write the Summary (for Executives)

Start with a 2-3 sentence summary:

This model [predicts/classifies/recommends] [what] for [whom].
It achieves [key metric] accuracy and is used to [business purpose].
Key limitations include [main limitation].

Example:

This model predicts the likelihood of grant application success to help
assessors prioritize their review queue. It achieves 85% accuracy in
identifying applications likely to be approved. The model should not
be used for final grant decisions, which require human review.

Step 3: Document Technical Details

Be specific and quantitative:

Bad: "The model uses machine learning to make predictions."

Good: "The model is a gradient boosted decision tree (XGBoost) with 500 estimators, max depth of 6, and learning rate of 0.1. It was trained on 150,000 historical applications from 2018-2023."

Step 4: Be Honest About Limitations

Don't hide problems - document them:

### Known Limitations

1. **Temporal bias**: Model was trained on 2018-2023 data and may not
   reflect post-2023 policy changes. Performance should be monitored
   after any significant policy updates.

2. **Geographic coverage**: Training data underrepresents rural
   applications (8% of training vs 15% of actual applications).
   Accuracy is 5% lower for rural applications.

3. **Edge cases**: Model performs poorly (< 70% accuracy) on applications
   from newly established organizations (< 2 years old).

Step 5: Explain Performance Clearly

Use tables and visualizations:

### Performance Summary

| Dataset | Accuracy | Precision | Recall | F1 |
|---------|----------|-----------|--------|-----|
| Training | 92% | 0.91 | 0.88 | 0.89 |
| Validation | 87% | 0.85 | 0.82 | 0.83 |
| Test | 85% | 0.83 | 0.80 | 0.81 |
| Production (Month 1) | 84% | 0.82 | 0.79 | 0.80 |

Note: 2% accuracy drop from test to production is within expected range.

Step 6: Address Fairness

Document fairness testing even if results are good:

### Fairness Assessment

Tested for disparities across: age groups, geographic location, organization type.

| Attribute | Positive Rate Disparity | Equal Opportunity Disparity | Status |
|-----------|------------------------|----------------------------|--------|
| Age (<30 vs 30+) | 0.92 | 0.89 | Pass |
| Location (Urban/Rural) | 0.85 | 0.82 | Pass |
| Org Type (For-profit/Non-profit) | 0.88 | 0.86 | Pass |

Threshold: Disparity ratio >= 0.80 required for pass.

Step 7: Define Governance

Make accountability clear:

### Governance and Oversight

**Model Owner:** Data Science Team, [Department]
**Business Owner:** Grants Management Branch
**Escalation Contact:** [Name], [Email]

**Review Schedule:**
- Monthly: Performance metrics review
- Quarterly: Fairness audit
- Annually: Full model review and revalidation

**Change Control:**
Any changes to model or thresholds require approval from:
- Technical Lead (for technical changes)
- Business Owner (for threshold changes)
- Ethics Committee (for significant changes)

Model Card Examples

Example 1: Document Classification Model

# Model Card: Document Classification Model

## Summary
Classifies incoming correspondence into 15 categories to support
automatic routing. Achieves 91% accuracy on test data.

## Model Details
| Field | Value |
|-------|-------|
| Model Name | DocClassifier-v2.3 |
| Type | Multi-class classification |
| Architecture | Fine-tuned DistilBERT |
| Version | 2.3.1 |
| Date | 2024-01-15 |

## Intended Use
- **Primary use**: Automatically categorize incoming emails and letters
- **Users**: Mailroom staff, correspondence officers
- **NOT for**: Legal document classification, FOI requests

## Performance
| Class | Precision | Recall | F1 | Support |
|-------|-----------|--------|-----|---------|
| Complaint | 0.94 | 0.91 | 0.92 | 1,250 |
| Enquiry | 0.89 | 0.92 | 0.90 | 3,400 |
| Feedback | 0.88 | 0.85 | 0.86 | 890 |
| ... | ... | ... | ... | ... |

## Limitations
- Accuracy drops to 75% for documents < 50 words
- Does not handle scanned/image documents
- May misclassify multi-topic correspondence

## Governance
- Owner: IT Service Delivery
- Review: Quarterly accuracy assessment
- Human oversight: All classifications reviewable by staff

Example 2: Risk Scoring Model

# Model Card: Compliance Risk Scoring

## Summary
Scores business entities for compliance audit prioritization.
Higher scores indicate higher likelihood of compliance issues.

## Model Details
| Field | Value |
|-------|-------|
| Model Name | ComplianceRisk-v1.0 |
| Type | Regression (0-100 score) |
| Architecture | Gradient Boosted Trees |
| Features | 47 input features |

## Intended Use
- **Primary use**: Prioritize compliance audits
- **Users**: Compliance officers, audit planners
- **NOT for**: Automatic penalties, public disclosure

## Training Data
- Source: Historical audit outcomes 2019-2023
- Records: 25,000 audit results
- Labels: Binary (issue found / no issue)

## Performance
- AUC-ROC: 0.82
- Top 10% captures 45% of all issues
- Top 20% captures 68% of all issues

## Fairness Testing
| Business Size | Mean Score | Issue Rate | Calibration |
|--------------|------------|------------|-------------|
| Small (<20 emp) | 42 | 12% | Good |
| Medium | 45 | 14% | Good |
| Large (>200 emp) | 48 | 16% | Good |

## Limitations
- New businesses (< 1 year) have less predictive data
- Score does not indicate severity, only likelihood
- Cannot detect novel compliance issues

## Human Oversight
- Scores are advisory only
- Audit selection requires manager approval
- Appeals process available for affected businesses

Checklist: Model Card Quality

Completeness

  • All sections filled out
  • Quantitative metrics included
  • Specific examples of limitations
  • Clear intended use and non-use cases

Clarity

  • Non-technical summary provided
  • Jargon explained or avoided
  • Tables used for complex information
  • Contact information clear

Honesty

  • Limitations honestly documented
  • Performance reported on realistic test data
  • Failure modes described
  • Uncertainty acknowledged

Governance

  • Ownership documented
  • Review schedule defined
  • Approval chain clear
  • Version history maintained

Resources

Templates

  • Google Model Cards: https://modelcards.withgoogle.com
  • Hugging Face Model Cards: https://huggingface.co/docs/hub/model-cards

Further Reading

  • Mitchell et al. (2019) "Model Cards for Model Reporting"
  • Partnership on AI: Model Card Guidelines