Skip to content

Synthetic Data for AI Projects - Fact Sheet

Fact Sheet

In brief: Synthetic data is artificially generated data that mimics real data patterns without containing actual personal information. It enables AI development while minimizing privacy and security risks.
Key Benefits
  • Privacy: No real personal information at risk
  • Security: Safe for dev/test environments
  • Availability: Generate unlimited data on demand
  • Compliance: Simpler Privacy Act compliance

What is Synthetic Data?

Synthetic data is artificially generated data that mimics the statistical properties and patterns of real data without containing actual personal information or sensitive records.

Key characteristics: - Generated by algorithms or AI models - Statistically similar to real data - Contains no actual personal information - Can be used for training, testing, and development


Why Use Synthetic Data?

Privacy Protection

  • No real personal information at risk
  • Reduces privacy impact assessment complexity
  • Minimizes data breach consequences
  • Enables data sharing without privacy concerns

Security

  • Safe to use in development and test environments
  • Can be shared with vendors and partners
  • Reduces need for production data access
  • Limits exposure of sensitive information

Compliance

  • Easier compliance with Privacy Act 1988
  • Reduces PSPF security requirements
  • Simplifies cross-border data considerations
  • Supports privacy by design principles

Availability

  • Generate unlimited data on demand
  • Create edge cases and rare scenarios
  • Balance datasets to reduce bias
  • No restrictions on data access

Cost & Efficiency

  • Avoid complex data access approvals
  • Reduce data anonymization effort
  • Enable parallel development teams
  • Faster testing and iteration

Common Use Cases in APS

1. AI Model Training

  • Train machine learning models without real data
  • Particularly useful for initial development
  • Test model architectures before production data access

2. Software Testing

  • Test application functionality
  • Validate data processing pipelines
  • Load and performance testing
  • User acceptance testing

3. Demonstrations and Proofs of Concept

  • Show system functionality to stakeholders
  • Create realistic demos without privacy concerns
  • Prototype AI solutions before approval

4. Data Augmentation

  • Supplement limited real data
  • Create balanced training datasets
  • Generate rare but important scenarios
  • Address class imbalance in training data

5. External Collaboration

  • Share data with vendors safely
  • Enable partner development
  • Support research collaborations
  • Facilitate training and education

Types of Synthetic Data

1. Rule-Based Synthetic Data

How it works: Follow predefined rules and patterns

Example: Generate random names, addresses, dates within constraints

Best for: - Simple structured data - Quick prototyping - Known data patterns

Limitations: - May not capture complex relationships - Less realistic for complex datasets

2. Statistical Synthetic Data

How it works: Match statistical distributions of real data

Example: Preserve means, variances, correlations between variables

Best for: - Maintaining data relationships - Statistical analysis - Structured tabular data

Limitations: - May not preserve complex patterns - Requires access to real data statistics

3. AI-Generated Synthetic Data

How it works: AI models (GANs, VAEs) learn from real data and generate synthetic data

Example: Train a generative model on real customer records to create synthetic records

Best for: - Complex, high-dimensional data - Realistic data with subtle patterns - Text, images, and unstructured data

Limitations: - Requires significant real data for training - Computationally intensive - Risk of memorization (reproducing real data)


Methods and Tools

Open Source Tools

Python Libraries: - SDV (Synthetic Data Vault): Comprehensive library for tabular data - Faker: Generate fake but realistic data (names, addresses, etc.) - CTGAN: Generative Adversarial Network for tabular data - Gretel Synthetics: Time series and text data - Synthpop (R): Statistical synthetic data generation

Example (using Faker):

from faker import Faker
fake = Faker('en_AU')  # Australian locale

# Generate synthetic Australian data
name = fake.name()
address = fake.address()
email = fake.email()
phone = fake.phone_number()
medicare = fake.bothify(text='#### ##### #')

Commercial Solutions

  • MOSTLY AI: Enterprise synthetic data platform
  • Gretel.ai: Cloud-based synthetic data generation
  • Hazy: Privacy-preserving synthetic data
  • Tonic.ai: Test data management with synthesis

Cloud Provider Services

  • AWS: Amazon SageMaker Data Wrangler
  • Azure: Azure Machine Learning synthetic data features
  • Google Cloud: Vertex AI synthetic data capabilities

Best Practices

1. Define Your Requirements

Before generating synthetic data, document: - What will the data be used for? - What data fields are needed? - What relationships must be preserved? - What statistical properties matter? - How much data is needed?

2. Start Simple

  • Begin with rule-based generation for simple use cases
  • Progress to statistical methods for complex relationships
  • Use AI generation only when necessary
  • Test with small datasets before scaling

3. Validate Synthetic Data Quality

Check that synthetic data: - Matches statistical distributions of real data - Preserves correlations between variables - Contains realistic values and combinations - Includes appropriate variety and edge cases - Doesn't replicate real records (privacy check)

4. Test for Privacy Leakage

Ensure synthetic data doesn't expose real data: - Check for exact matches to real records - Test distance to nearest real record - Assess membership inference risk - Review for unique identifiable combinations

Tools: - Privacy metrics in SDV library - Differential privacy validation - Manual review of samples

5. Document Generation Process

Document: - Method used to generate data - Source data characteristics (without exposing it) - Generation parameters and settings - Quality validation results - Privacy assessment - Intended use and limitations

6. Manage Expectations

Synthetic data is NOT: - A perfect replacement for real data - Guaranteed to train production-quality models - Completely without privacy risk (if poorly generated) - Suitable for all use cases

Synthetic data IS: - Useful for development and testing - Helpful for initial model development - Good for demonstrations and training - A privacy-enhancing tool when used correctly


Risks and Limitations

1. Privacy Risks

Potential issues: - Memorization: AI models might memorize and reproduce real records - Inference Attacks: Attackers might infer real data from synthetic data - Re-identification: Combinations of synthetic attributes might match real individuals

Mitigations: - Use differential privacy in generation - Test for distance to real records - Limit access to generation models - Regular privacy assessments

2. Quality Limitations

Challenges: - May not capture all real-world patterns - Edge cases might be unrealistic - Complex relationships might be lost - Temporal patterns may be inconsistent

Mitigations: - Validate against real data statistics - Human review of synthetic samples - Use synthetic data alongside small amounts of real data - Continuous quality monitoring

3. Model Performance

Considerations: - Models trained only on synthetic data may underperform - Synthetic data might not represent real-world distribution shifts - May not generalize well to production

Best practices: - Use synthetic data for initial development - Validate with real data before production - Consider hybrid approaches (synthetic + small real dataset) - Benchmark performance against real-data-trained models


Compliance and Governance

Privacy Act Considerations

Is synthetic data "personal information"?

Generally NO, if properly generated: - Contains no actual personal information - Cannot reasonably identify real individuals - Generated through appropriate methods

However: - Poorly generated synthetic data might still identify individuals - Privacy assessment recommended for AI-generated synthetic data - Document generation method and privacy validation

Security Classification

Typical classification: - Well-generated synthetic data: OFFICIAL or unclassified - Synthetic data derived from sensitive data: Case-by-case assessment

Consider: - Original data classification - Method of generation - Quality of privacy protection - Intended use

Approval and Governance

Recommended governance: - Document business case for synthetic data use - Privacy officer review for AI-generated synthetic data - Security assessment if derived from classified data - Data governance board approval for large-scale use - Regular quality and privacy audits


Getting Started: Quick Guide

Step 1: Assess Your Need

  • Identify use case for synthetic data
  • Determine if synthetic data is appropriate
  • Define requirements (fields, volume, quality)

Step 2: Choose Generation Method

  • Select method based on use case and complexity
  • Identify tools or services to use
  • Consider privacy and security requirements

Step 3: Generate Synthetic Data

  • Set up generation environment
  • Configure generation parameters
  • Generate initial sample dataset
  • Review and validate quality

Step 4: Validate and Approve

  • Test statistical similarity to real data
  • Check for privacy leakage
  • Document generation process
  • Obtain necessary approvals

Step 5: Use and Monitor

  • Deploy synthetic data for intended use
  • Monitor quality and fitness for purpose
  • Gather feedback from users
  • Iterate and improve as needed

Example: Generating Synthetic Citizen Data

Scenario: Need test data for AI service that processes citizen applications

Requirements: - 10,000 synthetic citizen records - Fields: Name, DOB, Address, Email, Phone, Application Type - Australian-realistic data - No real personal information

Solution using Python (Faker + Pandas):

from faker import Faker
import pandas as pd
import random

fake = Faker('en_AU')
Faker.seed(12345)  # Reproducible results

def generate_synthetic_record():
    return {
        'name': fake.name(),
        'date_of_birth': fake.date_of_birth(minimum_age=18, maximum_age=90),
        'address': fake.address().replace('\n', ', '),
        'email': fake.email(),
        'phone': fake.phone_number(),
        'application_type': random.choice(['New', 'Renewal', 'Update', 'Cancel']),
        'application_date': fake.date_between(start_date='-1y', end_date='today'),
        'postcode': fake.postcode()
    }

# Generate 10,000 records
synthetic_data = [generate_synthetic_record() for _ in range(10000)]
df = pd.DataFrame(synthetic_data)

# Save to file
df.to_csv('synthetic_citizen_data.csv', index=False)

print(f"Generated {len(df)} synthetic records")
print(df.head())

Result: 10,000 realistic but completely synthetic Australian citizen records ready for testing


Resources and Further Reading

Tools and Libraries

Privacy and Governance

Academic Research

  • "Synthetic Data: Opening the data floodgates to enable faster, more directed development" (MIT Technology Review)
  • "The Synthetic Data Vault" (IEEE Conference Paper)

GovSafeAI Toolkit Resources

  • Privacy Impact Assessment FAQ
  • PII Masking Utility (03-tools/utilities/pii_masking.py)
  • Data Quality Assessment Template