How to Write a Model Card¶
Ready to Use
- What: Standardized documentation for ML models ("nutrition labels for AI")
- Why: Transparency, accountability, risk management, compliance
- When: Create during development, update at deployment and reviews
- Who: Data scientists write, business owners approve
Purpose¶
Model cards are standardized documentation for machine learning models, providing transparency about a model's intended use, performance, and limitations. This guide explains how to create effective model cards for government AI systems.
What is a Model Card?¶
A model card is a short document that accompanies a trained ML model, providing: - What the model does - How it was built - How well it performs - Known limitations - Ethical considerations
Think of it as a "nutrition label" for AI models.
Why Model Cards Matter for Government¶
| Benefit | Description |
|---|---|
| Accountability | Documents who built the model and why |
| Transparency | Explains model behavior to stakeholders |
| Risk Management | Highlights limitations and failure modes |
| Compliance | Supports audit and regulatory requirements |
| Knowledge Transfer | Enables others to understand and maintain the model |
| Responsible AI | Demonstrates ethical consideration |
Model Card Template¶
Section 1: Model Details¶
## Model Details
### Basic Information
| Field | Value |
|-------|-------|
| Model Name | |
| Version | |
| Date | |
| Model Type | (e.g., Classification, Regression, NLP) |
| Organization | |
| Contact | |
### Model Architecture
- Type: (e.g., Random Forest, Neural Network, Linear Regression)
- Framework: (e.g., scikit-learn, TensorFlow, PyTorch)
- Size: (e.g., number of parameters, tree depth)
### Training Information
- Training data period: [dates]
- Training compute: [hardware used]
- Training time: [duration]
- Hyperparameters: [key parameters]
Section 2: Intended Use¶
## Intended Use
### Primary Use Case
[Describe the specific task the model is designed for]
### Intended Users
- [User type 1]
- [User type 2]
### Out-of-Scope Uses
The model should NOT be used for:
- [Inappropriate use 1]
- [Inappropriate use 2]
### Human Oversight
[Describe the role of human reviewers]
Section 3: Training Data¶
## Training Data
### Data Sources
| Source | Description | Records | Date Range |
|--------|-------------|---------|------------|
| | | | |
### Data Preprocessing
- [Preprocessing step 1]
- [Preprocessing step 2]
### Data Limitations
- [Known limitation 1]
- [Known limitation 2]
### Privacy and Consent
[Describe PII handling and consent basis]
Section 4: Performance¶
## Performance
### Overall Metrics
| Metric | Value |
|--------|-------|
| Accuracy | |
| Precision | |
| Recall | |
| F1 Score | |
| AUC-ROC | |
### Performance by Subgroup
| Group | Accuracy | Precision | Recall | N |
|-------|----------|-----------|--------|---|
| | | | | |
### Comparison to Baseline
| Approach | Accuracy | Notes |
|----------|----------|-------|
| This model | | |
| Previous model | | |
| Human baseline | | |
Section 5: Limitations¶
## Limitations
### Technical Limitations
- [Limitation 1]
- [Limitation 2]
### Known Failure Modes
| Scenario | Model Behavior | Mitigation |
|----------|---------------|------------|
| | | |
### Data Drift Sensitivity
[Describe how model might degrade with changing data]
### Not Suitable For
- [Scenario 1]
- [Scenario 2]
Section 6: Ethical Considerations¶
## Ethical Considerations
### Fairness Assessment
| Protected Attribute | Tested | Disparity | Status |
|--------------------|--------|-----------|--------|
| | Yes/No | | Pass/Fail |
### Potential Harms
| Harm Type | Risk Level | Mitigation |
|-----------|------------|------------|
| | Low/Med/High | |
### Human Rights Considerations
[Describe any human rights implications]
### Environmental Impact
- Training compute: [CO2 estimate]
- Inference compute: [CO2 estimate]
Section 7: Governance¶
## Governance
### Approvals
| Review Type | Reviewer | Date | Status |
|-------------|----------|------|--------|
| Technical Review | | | |
| Ethics Review | | | |
| Privacy Review | | | |
| Business Sign-off | | | |
### Monitoring
- Performance monitoring: [frequency]
- Bias monitoring: [frequency]
- Retraining schedule: [plan]
### Incident Response
[Link to incident response procedure]
### Version History
| Version | Date | Changes |
|---------|------|---------|
| | | |
Step-by-Step Guide¶
Step 1: Gather Information¶
Before writing, collect: - [ ] Model architecture details - [ ] Training data documentation - [ ] Performance metrics from testing - [ ] Fairness testing results - [ ] Known issues from development - [ ] Stakeholder feedback
Step 2: Write the Summary (for Executives)¶
Start with a 2-3 sentence summary:
This model [predicts/classifies/recommends] [what] for [whom].
It achieves [key metric] accuracy and is used to [business purpose].
Key limitations include [main limitation].
Example:
This model predicts the likelihood of grant application success to help
assessors prioritize their review queue. It achieves 85% accuracy in
identifying applications likely to be approved. The model should not
be used for final grant decisions, which require human review.
Step 3: Document Technical Details¶
Be specific and quantitative:
Bad: "The model uses machine learning to make predictions."
Good: "The model is a gradient boosted decision tree (XGBoost) with 500 estimators, max depth of 6, and learning rate of 0.1. It was trained on 150,000 historical applications from 2018-2023."
Step 4: Be Honest About Limitations¶
Don't hide problems - document them:
### Known Limitations
1. **Temporal bias**: Model was trained on 2018-2023 data and may not
reflect post-2023 policy changes. Performance should be monitored
after any significant policy updates.
2. **Geographic coverage**: Training data underrepresents rural
applications (8% of training vs 15% of actual applications).
Accuracy is 5% lower for rural applications.
3. **Edge cases**: Model performs poorly (< 70% accuracy) on applications
from newly established organizations (< 2 years old).
Step 5: Explain Performance Clearly¶
Use tables and visualizations:
### Performance Summary
| Dataset | Accuracy | Precision | Recall | F1 |
|---------|----------|-----------|--------|-----|
| Training | 92% | 0.91 | 0.88 | 0.89 |
| Validation | 87% | 0.85 | 0.82 | 0.83 |
| Test | 85% | 0.83 | 0.80 | 0.81 |
| Production (Month 1) | 84% | 0.82 | 0.79 | 0.80 |
Note: 2% accuracy drop from test to production is within expected range.
Step 6: Address Fairness¶
Document fairness testing even if results are good:
### Fairness Assessment
Tested for disparities across: age groups, geographic location, organization type.
| Attribute | Positive Rate Disparity | Equal Opportunity Disparity | Status |
|-----------|------------------------|----------------------------|--------|
| Age (<30 vs 30+) | 0.92 | 0.89 | Pass |
| Location (Urban/Rural) | 0.85 | 0.82 | Pass |
| Org Type (For-profit/Non-profit) | 0.88 | 0.86 | Pass |
Threshold: Disparity ratio >= 0.80 required for pass.
Step 7: Define Governance¶
Make accountability clear:
### Governance and Oversight
**Model Owner:** Data Science Team, [Department]
**Business Owner:** Grants Management Branch
**Escalation Contact:** [Name], [Email]
**Review Schedule:**
- Monthly: Performance metrics review
- Quarterly: Fairness audit
- Annually: Full model review and revalidation
**Change Control:**
Any changes to model or thresholds require approval from:
- Technical Lead (for technical changes)
- Business Owner (for threshold changes)
- Ethics Committee (for significant changes)
Model Card Examples¶
Example 1: Document Classification Model¶
# Model Card: Document Classification Model
## Summary
Classifies incoming correspondence into 15 categories to support
automatic routing. Achieves 91% accuracy on test data.
## Model Details
| Field | Value |
|-------|-------|
| Model Name | DocClassifier-v2.3 |
| Type | Multi-class classification |
| Architecture | Fine-tuned DistilBERT |
| Version | 2.3.1 |
| Date | 2024-01-15 |
## Intended Use
- **Primary use**: Automatically categorize incoming emails and letters
- **Users**: Mailroom staff, correspondence officers
- **NOT for**: Legal document classification, FOI requests
## Performance
| Class | Precision | Recall | F1 | Support |
|-------|-----------|--------|-----|---------|
| Complaint | 0.94 | 0.91 | 0.92 | 1,250 |
| Enquiry | 0.89 | 0.92 | 0.90 | 3,400 |
| Feedback | 0.88 | 0.85 | 0.86 | 890 |
| ... | ... | ... | ... | ... |
## Limitations
- Accuracy drops to 75% for documents < 50 words
- Does not handle scanned/image documents
- May misclassify multi-topic correspondence
## Governance
- Owner: IT Service Delivery
- Review: Quarterly accuracy assessment
- Human oversight: All classifications reviewable by staff
Example 2: Risk Scoring Model¶
# Model Card: Compliance Risk Scoring
## Summary
Scores business entities for compliance audit prioritization.
Higher scores indicate higher likelihood of compliance issues.
## Model Details
| Field | Value |
|-------|-------|
| Model Name | ComplianceRisk-v1.0 |
| Type | Regression (0-100 score) |
| Architecture | Gradient Boosted Trees |
| Features | 47 input features |
## Intended Use
- **Primary use**: Prioritize compliance audits
- **Users**: Compliance officers, audit planners
- **NOT for**: Automatic penalties, public disclosure
## Training Data
- Source: Historical audit outcomes 2019-2023
- Records: 25,000 audit results
- Labels: Binary (issue found / no issue)
## Performance
- AUC-ROC: 0.82
- Top 10% captures 45% of all issues
- Top 20% captures 68% of all issues
## Fairness Testing
| Business Size | Mean Score | Issue Rate | Calibration |
|--------------|------------|------------|-------------|
| Small (<20 emp) | 42 | 12% | Good |
| Medium | 45 | 14% | Good |
| Large (>200 emp) | 48 | 16% | Good |
## Limitations
- New businesses (< 1 year) have less predictive data
- Score does not indicate severity, only likelihood
- Cannot detect novel compliance issues
## Human Oversight
- Scores are advisory only
- Audit selection requires manager approval
- Appeals process available for affected businesses
Checklist: Model Card Quality¶
Completeness¶
- All sections filled out
- Quantitative metrics included
- Specific examples of limitations
- Clear intended use and non-use cases
Clarity¶
- Non-technical summary provided
- Jargon explained or avoided
- Tables used for complex information
- Contact information clear
Honesty¶
- Limitations honestly documented
- Performance reported on realistic test data
- Failure modes described
- Uncertainty acknowledged
Governance¶
- Ownership documented
- Review schedule defined
- Approval chain clear
- Version history maintained
Resources¶
Templates¶
- Google Model Cards: https://modelcards.withgoogle.com
- Hugging Face Model Cards: https://huggingface.co/docs/hub/model-cards
Further Reading¶
- Mitchell et al. (2019) "Model Cards for Model Reporting"
- Partnership on AI: Model Card Guidelines