Data Quality Assessment¶
Template Download Template
- Time to complete: 2-4 hours per dataset
- Who should complete: Data engineers, Data scientists, Domain experts
- Key dimensions: Completeness, Accuracy, Consistency, Timeliness, Representativeness
- Related tool: Data Quality Analyzer
Data Quality is Critical for AI
Poor data quality is the leading cause of AI project failure. Models trained on incomplete, biased, or inaccurate data will produce unreliable predictions regardless of algorithm sophistication.
Assessment Information¶
| Field | Details |
|---|---|
| Dataset Name | |
| Data Source | |
| Assessment Date | |
| Assessor(s) | |
| Intended AI Use Case | |
| Assessment Type | Initial / Periodic / Pre-Production |
Dataset Overview¶
Basic Information¶
| Attribute | Value |
|---|---|
| Total Records | |
| Total Features/Columns | |
| Date Range | |
| Data Format | CSV / Parquet / Database / API |
| Storage Location | |
| Update Frequency | Real-time / Daily / Weekly / Monthly / Static |
| Data Owner |
Data Schema Summary¶
| Column Name | Data Type | Description | Required | Sensitive |
|---|---|---|---|---|
| Yes/No | Yes/No | |||
Quality Dimension Assessments¶
1. Completeness¶
Definition: Extent to which required data is present
| Metric | Target | Actual | Status |
|---|---|---|---|
| Overall completeness rate | > 95% | Pass/Fail | |
| Critical field completeness | 100% | Pass/Fail | |
| Record completeness | > 90% | Pass/Fail |
Field-Level Completeness:
| Field Name | Critical | % Complete | Null Count | Action Required |
|---|---|---|---|---|
| Yes/No | ||||
Issues Identified: - [ ] Missing mandatory fields - [ ] Excessive null values - [ ] Incomplete records - [ ] Missing time periods
Completeness Score: ___/5
2. Accuracy¶
Definition: Extent to which data correctly represents real-world values
| Metric | Target | Actual | Status |
|---|---|---|---|
| Known error rate | < 2% | Pass/Fail | |
| Validation rule compliance | > 98% | Pass/Fail | |
| Cross-reference accuracy | > 99% | Pass/Fail |
Accuracy Checks:
| Check Type | Field(s) | Method | Result | Issues Found |
|---|---|---|---|---|
| Range validation | Pass/Fail | |||
| Format validation | Pass/Fail | |||
| Reference data match | Pass/Fail | |||
| Cross-field validation | Pass/Fail | |||
| Sample verification | Pass/Fail |
Issues Identified: - [ ] Out-of-range values - [ ] Invalid formats - [ ] Logical inconsistencies - [ ] Known incorrect values
Accuracy Score: ___/5
3. Consistency¶
Definition: Extent to which data is uniform and coherent
| Metric | Target | Actual | Status |
|---|---|---|---|
| Format consistency | 100% | Pass/Fail | |
| Cross-source consistency | > 98% | Pass/Fail | |
| Temporal consistency | > 99% | Pass/Fail |
Consistency Checks:
| Check Type | Description | Result | Issues |
|---|---|---|---|
| Date formats | Consistent date formatting | Pass/Fail | |
| Naming conventions | Consistent naming | Pass/Fail | |
| Unit consistency | Same units throughout | Pass/Fail | |
| Encoding consistency | Character encoding | Pass/Fail | |
| Cross-system alignment | Data matches across sources | Pass/Fail |
Issues Identified: - [ ] Mixed formats - [ ] Inconsistent naming/spelling - [ ] Different units - [ ] Cross-source conflicts
Consistency Score: ___/5
4. Timeliness¶
Definition: Extent to which data is sufficiently current
| Metric | Target | Actual | Status |
|---|---|---|---|
| Data currency | Within [X] days | Pass/Fail | |
| Update adherence | > 99% | Pass/Fail | |
| Lag time | < [X] hours | Pass/Fail |
Timeliness Assessment:
| Data Element | Required Currency | Actual Age | Status |
|---|---|---|---|
| Current/Stale | |||
Issues Identified: - [ ] Stale data - [ ] Missed updates - [ ] Excessive processing lag - [ ] Outdated reference data
Timeliness Score: ___/5
5. Uniqueness¶
Definition: Extent to which records are free from duplication
| Metric | Target | Actual | Status |
|---|---|---|---|
| Duplicate records | < 1% | Pass/Fail | |
| Key uniqueness | 100% | Pass/Fail |
Uniqueness Analysis:
| Check | Count/Rate | Threshold | Status |
|---|---|---|---|
| Exact duplicates | < 0.1% | Pass/Fail | |
| Near-duplicates | < 1% | Pass/Fail | |
| Key violations | 0 | Pass/Fail |
Duplicate Patterns Found:
| Pattern | Count | Impact | Recommended Action |
|---|---|---|---|
Uniqueness Score: ___/5
6. Validity¶
Definition: Extent to which data conforms to defined rules and constraints
| Metric | Target | Actual | Status |
|---|---|---|---|
| Schema compliance | 100% | Pass/Fail | |
| Business rule compliance | > 99% | Pass/Fail | |
| Domain validity | > 99% | Pass/Fail |
Validation Rules Assessment:
| Rule ID | Rule Description | Field(s) | Pass Rate | Violations |
|---|---|---|---|---|
| V001 | ||||
| V002 | ||||
| V003 |
Issues Identified: - [ ] Schema violations - [ ] Invalid domain values - [ ] Business rule failures - [ ] Constraint violations
Validity Score: ___/5
AI/ML-Specific Quality Checks¶
7. Representativeness¶
Definition: Extent to which data represents the target population
| Check | Assessment | Notes |
|---|---|---|
| Population coverage | Adequate / Partial / Poor | |
| Time period coverage | Adequate / Partial / Poor | |
| Edge case representation | Adequate / Partial / Poor | |
| Geographic coverage | Adequate / Partial / Poor | |
| Demographic coverage | Adequate / Partial / Poor |
Distribution Analysis:
| Feature | Expected Distribution | Actual Distribution | Concern Level |
|---|---|---|---|
| Low/Medium/High | |||
Representativeness Score: ___/5
8. Bias Assessment¶
Definition: Extent to which data may introduce unfair bias into models
| Protected Attribute | Present | Distribution | Potential Bias Risk |
|---|---|---|---|
| Age | Yes/No | Low/Medium/High | |
| Gender | Yes/No | Low/Medium/High | |
| Location | Yes/No | Low/Medium/High | |
| Socioeconomic | Yes/No | Low/Medium/High | |
| Other: | Yes/No | Low/Medium/High |
Bias Checks:
| Check | Finding | Risk Level |
|---|---|---|
| Class imbalance | Low/Medium/High | |
| Historical bias indicators | Low/Medium/High | |
| Proxy discrimination risk | Low/Medium/High | |
| Sampling bias | Low/Medium/High |
Bias Assessment Score: ___/5
9. Label Quality (for supervised learning)¶
Definition: Quality of target variable/labels for training
| Metric | Target | Actual | Status |
|---|---|---|---|
| Label completeness | 100% | Pass/Fail | |
| Label accuracy | > 98% | Pass/Fail | |
| Inter-rater reliability | > 90% | Pass/Fail |
Label Assessment:
| Check | Result | Notes |
|---|---|---|
| Labels present for all records | Yes/No | |
| Labeling methodology documented | Yes/No | |
| Label definitions clear | Yes/No | |
| Labels verified/validated | Yes/No | |
| Label distribution reasonable | Yes/No |
Label Quality Score: ___/5
10. Feature Quality¶
Definition: Suitability of features for ML model development
| Feature | Type | Cardinality | Missing % | Variance | ML Suitability |
|---|---|---|---|---|---|
| Numeric/Cat | High/Med/Low | ||||
Feature Issues:
| Issue Type | Features Affected | Severity | Recommendation |
|---|---|---|---|
| High cardinality | |||
| Low variance | |||
| High correlation | |||
| Leakage risk |
Feature Quality Score: ___/5
Quality Score Summary¶
| Dimension | Score | Weight | Weighted Score |
|---|---|---|---|
| Completeness | /5 | 15% | |
| Accuracy | /5 | 20% | |
| Consistency | /5 | 10% | |
| Timeliness | /5 | 10% | |
| Uniqueness | /5 | 10% | |
| Validity | /5 | 10% | |
| Representativeness | /5 | 10% | |
| Bias Assessment | /5 | 5% | |
| Label Quality | /5 | 5% | |
| Feature Quality | /5 | 5% | |
| Overall Score | 100% | /5 |
Quality Level Interpretation¶
| Score | Level | Recommendation |
|---|---|---|
| 4.5 - 5.0 | Excellent | Data ready for production AI |
| 3.5 - 4.4 | Good | Minor remediation, proceed with caution |
| 2.5 - 3.4 | Fair | Significant remediation required |
| 1.5 - 2.4 | Poor | Major data quality work needed |
| < 1.5 | Inadequate | Data not suitable for AI |
Issue Summary & Remediation Plan¶
Critical Issues¶
| ID | Dimension | Issue | Impact | Remediation | Owner | Due |
|---|---|---|---|---|---|---|
| DQ001 | High | |||||
| DQ002 | High |
Moderate Issues¶
| ID | Dimension | Issue | Impact | Remediation | Owner | Due |
|---|---|---|---|---|---|---|
| DQ003 | Medium | |||||
| DQ004 | Medium |
Minor Issues¶
| ID | Dimension | Issue | Impact | Remediation | Owner | Due |
|---|---|---|---|---|---|---|
| DQ005 | Low | |||||
| DQ006 | Low |
Recommendations¶
Data Quality Improvements¶
1. 2. 3.
Process Improvements¶
1. 2. 3.
Monitoring Requirements¶
1. 2. 3.
Sign-Off¶
| Role | Name | Date | Decision |
|---|---|---|---|
| Data Quality Assessor | |||
| Data Owner | Approved / Conditional / Rejected | ||
| AI/ML Lead | Suitable / Not Suitable |
Appendices¶
Appendix A: Data Profiling Results¶
Attach detailed profiling outputs
Appendix B: Sample Data Review¶
Document sample records reviewed
Appendix C: Tools Used¶
| Tool | Purpose | Version |
|---|---|---|
| Data profiling | ||
| Statistical analysis | ||
| Visualization |