How to Write a Model Card¶

Ready to Use

Quick Reference

What: Standardized documentation for ML models ("nutrition labels for AI")
Why: Transparency, accountability, risk management, compliance
When: Create during development, update at deployment and reviews
Who: Data scientists write, business owners approve

Purpose¶

Model cards are standardized documentation for machine learning models, providing transparency about a model's intended use, performance, and limitations. This guide explains how to create effective model cards for government AI systems.

What is a Model Card?¶

A model card is a short document that accompanies a trained ML model, providing: - What the model does - How it was built - How well it performs - Known limitations - Ethical considerations

Think of it as a "nutrition label" for AI models.

Why Model Cards Matter for Government¶

Benefit	Description
Accountability	Documents who built the model and why
Transparency	Explains model behavior to stakeholders
Risk Management	Highlights limitations and failure modes
Compliance	Supports audit and regulatory requirements
Knowledge Transfer	Enables others to understand and maintain the model
Responsible AI	Demonstrates ethical consideration

Model Card Template¶

Section 1: Model Details¶

## Model Details

### Basic Information
| Field | Value |
|-------|-------|
| Model Name | |
| Version | |
| Date | |
| Model Type | (e.g., Classification, Regression, NLP) |
| Organization | |
| Contact | |

### Model Architecture
- Type: (e.g., Random Forest, Neural Network, Linear Regression)
- Framework: (e.g., scikit-learn, TensorFlow, PyTorch)
- Size: (e.g., number of parameters, tree depth)

### Training Information
- Training data period: [dates]
- Training compute: [hardware used]
- Training time: [duration]
- Hyperparameters: [key parameters]

Section 2: Intended Use¶

## Intended Use

### Primary Use Case
[Describe the specific task the model is designed for]

### Intended Users
- [User type 1]
- [User type 2]

### Out-of-Scope Uses
The model should NOT be used for:
- [Inappropriate use 1]
- [Inappropriate use 2]

### Human Oversight
[Describe the role of human reviewers]

Section 3: Training Data¶

## Training Data

### Data Sources
| Source | Description | Records | Date Range |
|--------|-------------|---------|------------|
| | | | |

### Data Preprocessing
- [Preprocessing step 1]
- [Preprocessing step 2]

### Data Limitations
- [Known limitation 1]
- [Known limitation 2]

### Privacy and Consent
[Describe PII handling and consent basis]

Section 4: Performance¶

## Performance

### Overall Metrics
| Metric | Value |
|--------|-------|
| Accuracy | |
| Precision | |
| Recall | |
| F1 Score | |
| AUC-ROC | |

### Performance by Subgroup
| Group | Accuracy | Precision | Recall | N |
|-------|----------|-----------|--------|---|
| | | | | |

### Comparison to Baseline
| Approach | Accuracy | Notes |
|----------|----------|-------|
| This model | | |
| Previous model | | |
| Human baseline | | |

Section 5: Limitations¶

## Limitations

### Technical Limitations
- [Limitation 1]
- [Limitation 2]

### Known Failure Modes
| Scenario | Model Behavior | Mitigation |
|----------|---------------|------------|
| | | |

### Data Drift Sensitivity
[Describe how model might degrade with changing data]

### Not Suitable For
- [Scenario 1]
- [Scenario 2]

Section 6: Ethical Considerations¶

## Ethical Considerations

### Fairness Assessment
| Protected Attribute | Tested | Disparity | Status |
|--------------------|--------|-----------|--------|
| | Yes/No | | Pass/Fail |

### Potential Harms
| Harm Type | Risk Level | Mitigation |
|-----------|------------|------------|
| | Low/Med/High | |

### Human Rights Considerations
[Describe any human rights implications]

### Environmental Impact
- Training compute: [CO2 estimate]
- Inference compute: [CO2 estimate]

Section 7: Governance¶

## Governance

### Approvals
| Review Type | Reviewer | Date | Status |
|-------------|----------|------|--------|
| Technical Review | | | |
| Ethics Review | | | |
| Privacy Review | | | |
| Business Sign-off | | | |

### Monitoring
- Performance monitoring: [frequency]
- Bias monitoring: [frequency]
- Retraining schedule: [plan]

### Incident Response
[Link to incident response procedure]

### Version History
| Version | Date | Changes |
|---------|------|---------|
| | | |

Step-by-Step Guide¶

Step 1: Gather Information¶

Before writing, collect: - [ ] Model architecture details - [ ] Training data documentation - [ ] Performance metrics from testing - [ ] Fairness testing results - [ ] Known issues from development - [ ] Stakeholder feedback

Step 2: Write the Summary (for Executives)¶

Start with a 2-3 sentence summary:

This model [predicts/classifies/recommends] [what] for [whom].
It achieves [key metric] accuracy and is used to [business purpose].
Key limitations include [main limitation].

Example:

This model predicts the likelihood of grant application success to help
assessors prioritize their review queue. It achieves 85% accuracy in
identifying applications likely to be approved. The model should not
be used for final grant decisions, which require human review.

Step 3: Document Technical Details¶

Be specific and quantitative:

Bad: "The model uses machine learning to make predictions."

Good: "The model is a gradient boosted decision tree (XGBoost) with 500 estimators, max depth of 6, and learning rate of 0.1. It was trained on 150,000 historical applications from 2018-2023."

Step 4: Be Honest About Limitations¶

Don't hide problems - document them:

### Known Limitations

1. **Temporal bias**: Model was trained on 2018-2023 data and may not
   reflect post-2023 policy changes. Performance should be monitored
   after any significant policy updates.

2. **Geographic coverage**: Training data underrepresents rural
   applications (8% of training vs 15% of actual applications).
   Accuracy is 5% lower for rural applications.

3. **Edge cases**: Model performs poorly (< 70% accuracy) on applications
   from newly established organizations (< 2 years old).

Step 5: Explain Performance Clearly¶

Use tables and visualizations:

### Performance Summary

| Dataset | Accuracy | Precision | Recall | F1 |
|---------|----------|-----------|--------|-----|
| Training | 92% | 0.91 | 0.88 | 0.89 |
| Validation | 87% | 0.85 | 0.82 | 0.83 |
| Test | 85% | 0.83 | 0.80 | 0.81 |
| Production (Month 1) | 84% | 0.82 | 0.79 | 0.80 |

Note: 2% accuracy drop from test to production is within expected range.

Step 6: Address Fairness¶

Document fairness testing even if results are good:

### Fairness Assessment

Tested for disparities across: age groups, geographic location, organization type.

| Attribute | Positive Rate Disparity | Equal Opportunity Disparity | Status |
|-----------|------------------------|----------------------------|--------|
| Age (<30 vs 30+) | 0.92 | 0.89 | Pass |
| Location (Urban/Rural) | 0.85 | 0.82 | Pass |
| Org Type (For-profit/Non-profit) | 0.88 | 0.86 | Pass |

Threshold: Disparity ratio >= 0.80 required for pass.

Step 7: Define Governance¶

Make accountability clear:

### Governance and Oversight

**Model Owner:** Data Science Team, [Department]
**Business Owner:** Grants Management Branch
**Escalation Contact:** [Name], [Email]

**Review Schedule:**
- Monthly: Performance metrics review
- Quarterly: Fairness audit
- Annually: Full model review and revalidation

**Change Control:**
Any changes to model or thresholds require approval from:
- Technical Lead (for technical changes)
- Business Owner (for threshold changes)
- Ethics Committee (for significant changes)

Model Card Examples¶

Example 1: Document Classification Model¶

# Model Card: Document Classification Model

## Summary
Classifies incoming correspondence into 15 categories to support
automatic routing. Achieves 91% accuracy on test data.

## Model Details
| Field | Value |
|-------|-------|
| Model Name | DocClassifier-v2.3 |
| Type | Multi-class classification |
| Architecture | Fine-tuned DistilBERT |
| Version | 2.3.1 |
| Date | 2024-01-15 |

## Intended Use
- **Primary use**: Automatically categorize incoming emails and letters
- **Users**: Mailroom staff, correspondence officers
- **NOT for**: Legal document classification, FOI requests

## Performance
| Class | Precision | Recall | F1 | Support |
|-------|-----------|--------|-----|---------|
| Complaint | 0.94 | 0.91 | 0.92 | 1,250 |
| Enquiry | 0.89 | 0.92 | 0.90 | 3,400 |
| Feedback | 0.88 | 0.85 | 0.86 | 890 |
| ... | ... | ... | ... | ... |

## Limitations
- Accuracy drops to 75% for documents < 50 words
- Does not handle scanned/image documents
- May misclassify multi-topic correspondence

## Governance
- Owner: IT Service Delivery
- Review: Quarterly accuracy assessment
- Human oversight: All classifications reviewable by staff

Example 2: Risk Scoring Model¶

# Model Card: Compliance Risk Scoring

## Summary
Scores business entities for compliance audit prioritization.
Higher scores indicate higher likelihood of compliance issues.

## Model Details
| Field | Value |
|-------|-------|
| Model Name | ComplianceRisk-v1.0 |
| Type | Regression (0-100 score) |
| Architecture | Gradient Boosted Trees |
| Features | 47 input features |

## Intended Use
- **Primary use**: Prioritize compliance audits
- **Users**: Compliance officers, audit planners
- **NOT for**: Automatic penalties, public disclosure

## Training Data
- Source: Historical audit outcomes 2019-2023
- Records: 25,000 audit results
- Labels: Binary (issue found / no issue)

## Performance
- AUC-ROC: 0.82
- Top 10% captures 45% of all issues
- Top 20% captures 68% of all issues

## Fairness Testing
| Business Size | Mean Score | Issue Rate | Calibration |
|--------------|------------|------------|-------------|
| Small (<20 emp) | 42 | 12% | Good |
| Medium | 45 | 14% | Good |
| Large (>200 emp) | 48 | 16% | Good |

## Limitations
- New businesses (< 1 year) have less predictive data
- Score does not indicate severity, only likelihood
- Cannot detect novel compliance issues

## Human Oversight
- Scores are advisory only
- Audit selection requires manager approval
- Appeals process available for affected businesses

Checklist: Model Card Quality¶

Completeness¶

All sections filled out
Quantitative metrics included
Specific examples of limitations
Clear intended use and non-use cases

Clarity¶

Non-technical summary provided
Jargon explained or avoided
Tables used for complex information
Contact information clear

Honesty¶

Limitations honestly documented
Performance reported on realistic test data
Failure modes described
Uncertainty acknowledged

Governance¶

Ownership documented
Review schedule defined
Approval chain clear
Version history maintained

Resources¶

Templates¶

Google Model Cards: https://modelcards.withgoogle.com
Hugging Face Model Cards: https://huggingface.co/docs/hub/model-cards