Skip to content

About the Data

Forbidden Questions

Questions About Data That Get Glossed Over
"Data is ready" is one of the most common lies in AI projects. Here are the questions that expose it.
Data Truth Table
  • "Data is available" → Data exists somewhere
  • "Data is high quality" → Someone said it was
  • "Data is representative" → We sampled something
  • "Data is ready" → It exists in some form

The Foundational Questions

1. Has anyone actually looked at the data?

The forbidden version: "Not the schema. Not the documentation. The actual data."

What you'll hear instead: - "The data dictionary is available" - "Data owners have confirmed availability" - "We'll assess quality during design"

What to probe: - Has a data scientist or analyst opened the data and explored it? - What did they find when they looked—not what did someone tell them? - What's the gap between documentation and reality? - When was the data last actually examined?

Why it matters: Documentation describes what data should be. Reality is different. Unknown unknowns hide in unexplored data.


2. Where did this data actually come from?

The forbidden version: "What's the true lineage, including the parts that make us uncomfortable?"

What you'll hear instead: - "The data comes from [System X]" - "It's from our authoritative source" - "Data governance has approved it"

What to probe: - How was this data originally collected? - Who collected it, and what were their incentives? - What decisions were made about what to collect and not collect? - What context has been lost in aggregation/transformation? - Who was systematically excluded from collection?

Why it matters: Data inherits the biases, blind spots, and limitations of its collection process. Historical data reflects historical conditions—including historical discrimination.


3. What's actually missing?

The forbidden version: "Not what percentage is 'complete'—what is systematically absent?"

What you'll hear instead: - "Data completeness is 94%" - "We can impute missing values" - "The gaps are random"

What to probe: - Is missingness actually random, or does it pattern? - Who is systematically underrepresented? - What information was never collected? - What would you need that doesn't exist? - Are the missing cases the ones that matter most?

Why it matters: Missing data isn't just holes—it's silence. Often, the people most affected by AI decisions are least represented in training data.


4. What does this data actually mean?

The forbidden version: "Can anyone here explain what each field actually represents, including the edge cases?"

What you'll hear instead: - "It's defined in the data dictionary" - "The field names are self-explanatory" - "The source team can explain"

What to probe: - What does "income" mean? (Gross? Net? Annual? At what point? Including what?) - What does "address" mean? (Current? Last known? Verified when?) - What does "status" mean? (Which status? As of when? Per whose judgment?) - What do null/blank/zero values mean? (Unknown? N/A? Never asked? Chose not to answer?)

Why it matters: Semantic ambiguity is where errors hide. When a model learns "income," it learns whatever happened to be in that field—which may not be what you think.


The Bias Questions

5. Whose reality does this data reflect?

The forbidden version: "Which groups shaped this data, and which groups were subjects of it?"

What you'll hear instead: - "The data covers our full population" - "We have representative samples" - "All relevant sources are included"

What to probe: - Who was deciding what to record and how? - Whose behaviors/decisions generated this data? - Who had power in the data-generating process? - Who was being observed/measured/judged? - Would a different collector have recorded different things?

Why it matters: Administrative data reflects the priorities and perspectives of administrators. Historical decisions data reflects historical decision-makers' biases. The data is not neutral.


6. What historical discrimination is encoded here?

The forbidden version: "If historical decisions were biased, and this data reflects historical decisions, isn't our data biased?"

What you'll hear instead: - "We're not using demographic variables" - "The model is objective" - "We've removed protected characteristics"

What to probe: - Does historical data reflect historical discrimination in the domain? - Are there proxies for protected characteristics in the data? - If past human decisions were biased, and we're training on past decisions...? - What would fair data look like, and how different is ours?

Why it matters: Removing demographic variables doesn't remove discrimination. Discrimination encodes into proxies, patterns, and outcomes. Training on biased history perpetuates biased future.


7. What would be different if the data were fair?

The forbidden version: "If we corrected for historical bias, would this data look the same?"

What you'll hear instead: - "We work with the data we have" - "We can't change history" - "The data is what it is"

What to probe: - If past decisions in this domain were fair, would outcome distributions be different? - Which groups are over/underrepresented compared to fair baseline? - What patterns in the data are artifacts of discrimination vs. real differences? - Are we training on outcomes we'd want to perpetuate?

Why it matters: "The data is what it is" is true but incomplete. The data is also what it was made to be, by processes that may have been discriminatory.


The Quality Questions

8. How wrong is this data?

The forbidden version: "Not what's the error rate in ideal conditions—what's the error rate where it matters?"

What you'll hear instead: - "Data quality is within acceptable parameters" - "We have validation processes" - "Errors are rare"

What to probe: - What's the actual error rate, measured against ground truth? - Are errors random or systematic? - Where are errors concentrated? (Populations? Fields? Time periods?) - What kind of errors exist—wrong values, wrong labels, wrong meaning? - Who is most likely to have erroneous data?

Why it matters: A 5% error rate sounds small until you realize it's concentrated in specific populations. Random errors average out; systematic errors compound.


9. How stale is this data?

The forbidden version: "At the moment the model makes a decision, how old is the information it's using?"

What you'll hear instead: - "Data is updated [frequency]" - "We use the most recent available" - "Refresh processes are in place"

What to probe: - What's the actual latency from reality to data? - How much can change in that time? - Are decisions time-sensitive in ways that stale data can't serve? - Who is most affected by decisions made on stale data?

Why it matters: A model might be "right" according to the data it has while being wrong about current reality. Staleness matters most for people whose situations are changing.


10. What happens at the edges?

The forbidden version: "We optimized for the middle—what's happening at the tails of the distribution?"

What you'll hear instead: - "Performance is strong on average" - "Metrics meet requirements" - "Edge cases are handled"

What to probe: - What's performance like for rare cases? - What's data quality like for unusual situations? - Who falls into categories with sparse data? - What happens when the data contains something never seen before?

Why it matters: Edge cases are where AI systems fail badly. They're also often where the most vulnerable people are. Optimizing for average means sacrificing edges.


The Access Questions

11. Can we actually use this data for this purpose?

The forbidden version: "Not 'is it technically possible'—is it legally, ethically, and socially legitimate?"

What you'll hear instead: - "Data has been approved for use" - "We have appropriate agreements" - "Privacy has been considered"

What to probe: - Did the people whose data this is consent to this use? - Does secondary use for AI match the original collection purpose? - What would data subjects think if they knew? - What's the reputational risk if data use becomes public?

Why it matters: Legal permission is not the same as ethical legitimacy. "We can" is not the same as "we should." Public discovery of data use can create crisis regardless of technical legality.


12. Who controls this data?

The forbidden version: "If the data owner changes their mind, what happens to our AI system?"

What you'll hear instead: - "Data sharing agreements are in place" - "We have established partnerships" - "Data flows are documented"

What to probe: - Who can cut off data access, and under what circumstances? - What happens to the model if data supply is interrupted? - Are there political/organizational risks to data access? - Do we have leverage, or are we dependent?

Why it matters: AI systems depend on data. If you don't control the data, you don't control the system. Dependencies can become vulnerabilities.


The Honest Assessment Questions

13. If we're honest, is this data good enough?

The forbidden version: "If we weren't under pressure to proceed, would we use this data?"

What you'll hear instead: - "It's the best available" - "We can work with it" - "Perfect is the enemy of good"

What to probe: - In an ideal world, what data would we want? - How far is our data from that ideal? - What are we sacrificing by using imperfect data? - Are we rationalizing "good enough" or genuinely assessing?

Why it matters: There's a difference between "this data has limitations we've considered" and "we're proceeding despite inadequate data because we're committed." Know which one you're doing.


14. What would a skeptic say about this data?

The forbidden version: "If someone wanted to attack our AI system, what would they say about our data?"

What you'll hear instead: - "Our methodology is sound" - "We've followed best practices" - "The approach is defensible"

What to probe: - What would a journalist write about our data? - What would an academic find if they analyzed it? - What would an advocate discover if they investigated? - What would a Senate committee ask?

Why it matters: External scrutiny will come. Better to apply it yourself first. The weaknesses you don't find, others will.


15. What are we pretending not to know?

The forbidden version: "What data problems are we aware of but choosing to ignore?"

What you'll hear instead: - [Silence]

What to probe: - What concerns have been raised and dismissed? - What would a thorough review find that we haven't done? - What do the data team's private concerns look like? - What are we hoping won't become an issue?

Why it matters: Often, data problems are known but not acknowledged because acknowledging them would slow the project. This creates liability and risk. What you know but don't say can still be discovered.


The Data Truth Table

What They Say What It Often Means What to Ask
"Data is available" Data exists somewhere "Can we access it today?"
"Data is high quality" Someone said it was "What's the measured error rate?"
"Data is representative" We sampled something "Representative of whom, exactly?"
"Data is clean" Someone ran a script "What does 'clean' mean here?"
"Data is compliant" Legal said it's okay "What would data subjects think?"
"Data is ready" It exists in some form "Ready for what, specifically?"
"Data is complete" Most fields have values "What's systematically missing?"
"Data is current" It was updated once "How old is it at decision time?"

Before You Trust the Data

Ask yourself:

  1. Have I seen it with my own eyes (or trusted someone who has)?
  2. Do I understand where it came from and what shaped it?
  3. Do I know what's missing and why?
  4. Have I examined it for systematic bias?
  5. Do I know the actual error rate, not the claimed one?
  6. Can I defend this data use to someone affected by it?
  7. Am I genuinely comfortable, or am I rationalizing?

If you can't answer "yes" confidently to all of these, you don't trust the data—you're hoping it's trustworthy.

That's not the same thing.


"The data doesn't lie—but it doesn't tell the whole truth either. And sometimes what it leaves out is more important than what it contains."