Skip to content

Data Quality Assessment

Template Download Template

Purpose: Evaluate the quality of data intended for AI/ML model training and operation. Identifies issues that could impact model performance, introduce bias, or create compliance risks.
At a Glance
  • Time to complete: 2-4 hours per dataset
  • Who should complete: Data engineers, Data scientists, Domain experts
  • Key dimensions: Completeness, Accuracy, Consistency, Timeliness, Representativeness
  • Related tool: Data Quality Analyzer

Data Quality is Critical for AI

Poor data quality is the leading cause of AI project failure. Models trained on incomplete, biased, or inaccurate data will produce unreliable predictions regardless of algorithm sophistication.


Assessment Information

Field Details
Dataset Name
Data Source
Assessment Date
Assessor(s)
Intended AI Use Case
Assessment Type Initial / Periodic / Pre-Production

Dataset Overview

Basic Information

Attribute Value
Total Records
Total Features/Columns
Date Range
Data Format CSV / Parquet / Database / API
Storage Location
Update Frequency Real-time / Daily / Weekly / Monthly / Static
Data Owner

Data Schema Summary

Column Name Data Type Description Required Sensitive
Yes/No Yes/No

Quality Dimension Assessments

1. Completeness

Definition: Extent to which required data is present

Metric Target Actual Status
Overall completeness rate > 95% Pass/Fail
Critical field completeness 100% Pass/Fail
Record completeness > 90% Pass/Fail

Field-Level Completeness:

Field Name Critical % Complete Null Count Action Required
Yes/No

Issues Identified: - [ ] Missing mandatory fields - [ ] Excessive null values - [ ] Incomplete records - [ ] Missing time periods

Completeness Score: ___/5


2. Accuracy

Definition: Extent to which data correctly represents real-world values

Metric Target Actual Status
Known error rate < 2% Pass/Fail
Validation rule compliance > 98% Pass/Fail
Cross-reference accuracy > 99% Pass/Fail

Accuracy Checks:

Check Type Field(s) Method Result Issues Found
Range validation Pass/Fail
Format validation Pass/Fail
Reference data match Pass/Fail
Cross-field validation Pass/Fail
Sample verification Pass/Fail

Issues Identified: - [ ] Out-of-range values - [ ] Invalid formats - [ ] Logical inconsistencies - [ ] Known incorrect values

Accuracy Score: ___/5


3. Consistency

Definition: Extent to which data is uniform and coherent

Metric Target Actual Status
Format consistency 100% Pass/Fail
Cross-source consistency > 98% Pass/Fail
Temporal consistency > 99% Pass/Fail

Consistency Checks:

Check Type Description Result Issues
Date formats Consistent date formatting Pass/Fail
Naming conventions Consistent naming Pass/Fail
Unit consistency Same units throughout Pass/Fail
Encoding consistency Character encoding Pass/Fail
Cross-system alignment Data matches across sources Pass/Fail

Issues Identified: - [ ] Mixed formats - [ ] Inconsistent naming/spelling - [ ] Different units - [ ] Cross-source conflicts

Consistency Score: ___/5


4. Timeliness

Definition: Extent to which data is sufficiently current

Metric Target Actual Status
Data currency Within [X] days Pass/Fail
Update adherence > 99% Pass/Fail
Lag time < [X] hours Pass/Fail

Timeliness Assessment:

Data Element Required Currency Actual Age Status
Current/Stale

Issues Identified: - [ ] Stale data - [ ] Missed updates - [ ] Excessive processing lag - [ ] Outdated reference data

Timeliness Score: ___/5


5. Uniqueness

Definition: Extent to which records are free from duplication

Metric Target Actual Status
Duplicate records < 1% Pass/Fail
Key uniqueness 100% Pass/Fail

Uniqueness Analysis:

Check Count/Rate Threshold Status
Exact duplicates < 0.1% Pass/Fail
Near-duplicates < 1% Pass/Fail
Key violations 0 Pass/Fail

Duplicate Patterns Found:

Pattern Count Impact Recommended Action

Uniqueness Score: ___/5


6. Validity

Definition: Extent to which data conforms to defined rules and constraints

Metric Target Actual Status
Schema compliance 100% Pass/Fail
Business rule compliance > 99% Pass/Fail
Domain validity > 99% Pass/Fail

Validation Rules Assessment:

Rule ID Rule Description Field(s) Pass Rate Violations
V001
V002
V003

Issues Identified: - [ ] Schema violations - [ ] Invalid domain values - [ ] Business rule failures - [ ] Constraint violations

Validity Score: ___/5


AI/ML-Specific Quality Checks

7. Representativeness

Definition: Extent to which data represents the target population

Check Assessment Notes
Population coverage Adequate / Partial / Poor
Time period coverage Adequate / Partial / Poor
Edge case representation Adequate / Partial / Poor
Geographic coverage Adequate / Partial / Poor
Demographic coverage Adequate / Partial / Poor

Distribution Analysis:

Feature Expected Distribution Actual Distribution Concern Level
Low/Medium/High

Representativeness Score: ___/5


8. Bias Assessment

Definition: Extent to which data may introduce unfair bias into models

Protected Attribute Present Distribution Potential Bias Risk
Age Yes/No Low/Medium/High
Gender Yes/No Low/Medium/High
Location Yes/No Low/Medium/High
Socioeconomic Yes/No Low/Medium/High
Other: Yes/No Low/Medium/High

Bias Checks:

Check Finding Risk Level
Class imbalance Low/Medium/High
Historical bias indicators Low/Medium/High
Proxy discrimination risk Low/Medium/High
Sampling bias Low/Medium/High

Bias Assessment Score: ___/5


9. Label Quality (for supervised learning)

Definition: Quality of target variable/labels for training

Metric Target Actual Status
Label completeness 100% Pass/Fail
Label accuracy > 98% Pass/Fail
Inter-rater reliability > 90% Pass/Fail

Label Assessment:

Check Result Notes
Labels present for all records Yes/No
Labeling methodology documented Yes/No
Label definitions clear Yes/No
Labels verified/validated Yes/No
Label distribution reasonable Yes/No

Label Quality Score: ___/5


10. Feature Quality

Definition: Suitability of features for ML model development

Feature Type Cardinality Missing % Variance ML Suitability
Numeric/Cat High/Med/Low

Feature Issues:

Issue Type Features Affected Severity Recommendation
High cardinality
Low variance
High correlation
Leakage risk

Feature Quality Score: ___/5


Quality Score Summary

Dimension Score Weight Weighted Score
Completeness /5 15%
Accuracy /5 20%
Consistency /5 10%
Timeliness /5 10%
Uniqueness /5 10%
Validity /5 10%
Representativeness /5 10%
Bias Assessment /5 5%
Label Quality /5 5%
Feature Quality /5 5%
Overall Score 100% /5

Quality Level Interpretation

Score Level Recommendation
4.5 - 5.0 Excellent Data ready for production AI
3.5 - 4.4 Good Minor remediation, proceed with caution
2.5 - 3.4 Fair Significant remediation required
1.5 - 2.4 Poor Major data quality work needed
< 1.5 Inadequate Data not suitable for AI

Issue Summary & Remediation Plan

Critical Issues

ID Dimension Issue Impact Remediation Owner Due
DQ001 High
DQ002 High

Moderate Issues

ID Dimension Issue Impact Remediation Owner Due
DQ003 Medium
DQ004 Medium

Minor Issues

ID Dimension Issue Impact Remediation Owner Due
DQ005 Low
DQ006 Low

Recommendations

Data Quality Improvements

1. 2. 3.

Process Improvements

1. 2. 3.

Monitoring Requirements

1. 2. 3.


Sign-Off

Role Name Date Decision
Data Quality Assessor
Data Owner Approved / Conditional / Rejected
AI/ML Lead Suitable / Not Suitable

Appendices

Appendix A: Data Profiling Results

Attach detailed profiling outputs

Appendix B: Sample Data Review

Document sample records reviewed

Appendix C: Tools Used

Tool Purpose Version
Data profiling
Statistical analysis
Visualization