Go From Idea to LLM Demo in a Day with Synthetic Data

Ever heard this before:

“We love the concept, but can you show us a demo?”

If you’re building LLM-powered products, this question can stop you in your tracks. Not because of technical challenges, but because you’re waiting on data. The good news? There’s a faster way:

Generate synthetic data using LLMs.

In this guide, I’ll show you a quick and simple demo of how to do this with a Python pipeline that generates synthetic test data in minutes using OpenAI and Pydantic. Whether you’re validating a concept, building a prototype, or testing edge cases, this approach can compress weeks of preparation into hours.

The Case for Synthetic Data

Every product manager, stakeholder, or freelance client wants to see a working demo before moving forward. But here’s the typically happens:

You spend weeks collecting and cleaning real user data
Legal teams take forever to approve data usage
Privacy concerns create endless back-and-forth
The project loses momentum while waiting for backend systems

Naturally, ML apps, being data-driven, are particularly vulnerable to these delays. Even with LLM products, which tend to be more demo-friendly, you still need relevant data to showcase unique value. As a result, good ideas often die before they can prove themselves.

This is where synthetic data becomes your secret weapon. Instead of waiting weeks or months, you can:

Build Faster: Create working prototypes in hours instead of weeks
Test Earlier: Validate your ideas before investing in data collection
Iterate Quickly: Customize your data on the fly as requirements change
Demo Confidently: Show stakeholders real functionality without privacy concerns

Ultimately, you want to validate the need for any further data investment while accelerating development. Remember, you don’t need perfect data to prove an idea — you need data that’s realistic enough to demonstrate value.

Generating Synthetic Data

The problem with synthetic data is people tend not to take it seriously, meaning they don’t care enough to build a reliable and maintainable data pipeline to not only generate the synthetic data, but to have a repeatable, maintainable, and customizable process that follows good software practices.

Below I’ve written a class to do this to help you get started. It by no means is the end all be all when it comes to synthetic data generation pipelines, it’s just one python class to demonstrate this task, but hopefully it will motivate you to get started with your own pipeline for your use case. Anyway, here’s what this one does:

Takes a Pydantic schema to ensure data consistency
Uses custom prompts to guide the LLM’s output
Validates everything automatically
Saves data in .jsonl format for easy pipeline integration
Is customizable for various type of structured data

Let’s dive into the code and see how it works.

from typing import List, Type, Dict
from pydantic import BaseModel, ValidationError
from openai import OpenAI
import time
import json
import os
from dotenv import load_dotenv

load_dotenv()
class SyntheticDataGenerator:
    def __init__(
            self,
            schema: Type[BaseModel],
            prompt_template: str,
            output_file: str = "synthetic_data.jsonl",
            n_samples: int = 100,
            model: str = "gpt-3.5-turbo-1106",  # Updated to a model that supports JSON mode
            max_retries: int = 3,
            delay: float = 1.0,
            temperature: float = 0.0
    ):
        self.schema = schema
        self.prompt_template = prompt_template
        self.output_file = output_file
        self.n_samples = n_samples
        self.model = model
        self.max_retries = max_retries
        self.delay = delay
        self.temperature = temperature
        self.client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

    def _call_openai(self, prompt: str) -> str:
        for attempt in range(self.max_retries):
            try:
                response = self.client.chat.completions.create(
                    model=self.model,
                    messages=[
                        {"role": "system", "content": "You are a helpful assistant that generates support ticket data in JSON format."},
                        {"role": "user", "content": prompt}
                    ],
                    temperature=self.temperature,
                    response_format={"type": "json_object"}  # Force JSON output
                )
                content = response.choices[0].message.content
                print(f"Raw API response: {content}")  # Debug print
                return content
            except Exception as e:
                print(f"OpenAI API error (attempt {attempt + 1}/{self.max_retries}): {e}")
                if attempt < self.max_retries - 1:
                    time.sleep(self.delay)
        return ""

    def _generate_sample(self) -> Dict:
        raw_output = self._call_openai(self.prompt_template)
        try:
            data = json.loads(raw_output)
            sample = self.schema(**data)
            return sample.dict()
        except (json.JSONDecodeError, ValidationError) as e:
            print(f"Parsing error: {e}")
            return {}

    def _write_to_file(self, data: List[Dict]) -> None:
        with open(self.output_file, "w") as f:
            for item in data:
                if item:
                    f.write(json.dumps(item) + "\n")

    def run(self) -> None:
        print(f"Generating {self.n_samples} samples...")
        data = []
        for _ in range(self.n_samples):
            sample = self._generate_sample()
            if sample:
                data.append(sample)
            time.sleep(self.delay)
        self._write_to_file(data)
        print(f"Saved {len(data)} samples to {self.output_file}")

Example

Here’s an example to show how this works. You simply need to define your schema, prompt, and then pass them into the class initiation.

class SupportTicket(BaseModel):
    customer_name: str
    issue_summary: str
    issue_detail: str
    priority: str

support_ticket_prompt = """
Generate a synthetic customer support ticket in JSON format with the following fields:
- customer_name
- issue_summary
- issue_detail
- priority (low, medium, high)

Example:
{
  "customer_name": "Alice Johnson",
  "issue_summary": "App crash on launch",
  "issue_detail": "The mobile app crashes immediately after tapping the icon on Android 14.",
  "priority": "high"
}

The response must be a valid JSON object with exactly these fields and no additional content.
Now generate one new ticket:
"""

generator = SyntheticDataGenerator(
    schema=SupportTicket,
    prompt_template=support_ticket_prompt,
    output_file="support_tickets.jsonl",
    n_samples=50
)

generator.run()

You can input your own custom output file name too to see the output, but to show you explicitely, running the below code block will give you a jsonl file with samples that look like this:

{"customer_name": "Bob Smith", "issue_summary": "Unable to login", "issue_detail": "I am unable to login to my account using my username and password.", "priority": "medium"}

Real-World Applications

Here are some more considerations on using this in practice:

1. Rapid Prototyping

The main point in using synthetic data is probably in getting your own custom datasets for whatever you’re building. This can be expecially useful if you want to test the feasibility of an idea where you know real-world quality data exists, but it’s too expensive, has nasty rate limits, or it’s private data you can’t get immediate access too.

2. Minority Classes

Another place where this could be worth using is in training a model on a dataset where you have a miniority class and don’t see model performance improving without getting more data for that class specifically. Classification problems are usually made interesting thanks to minority classes, which naturally tend to be harder to obtain, so augmenting your data with synthetic data here could prove to be really handy.

3. Fine-tuning an SLM

This is probably one of the more popular use cases, but fine-tuning a smaller language model (SLM) with knowledge distillation is always worth mentioning. If you’re using LLM extensively, but you have a specific task that is rather simple, it might be worth fine-tuning a SLM to save costs and inference time.

Best Practices and Tips

When using synthetic data, keep these points in mind:

Start Simple: Begin with a basic schema and expand as needed
Validate Assumptions: Cross-check synthetic patterns with domain experts
Mix and Match: Combine synthetic data with small samples of real data when possible
Document Limitations: Be clear about what the synthetic data can and cannot prove
Avoid Over-perfection: Don’t spend too much time making synthetic data perfect

Final Thoughts

Synthetic data isn’t just a stopgap — it’s a powerful tool that can transform your development workflow. While it shouldn’t permanently replace real data, it can:

Get your project moving when real data isn’t available
Help stakeholders visualize the end product
Validate technical approaches early
Speed up the prototype-to-production pipeline

The next time you’re stuck waiting for data, remember: you can probably generate what you need in the next hour.

Go From Idea to LLM Demo in a Day with Synthetic Data

The Case for Synthetic Data

Generating Synthetic Data

Example

Real-World Applications

1. Rapid Prototyping

2. Minority Classes

3. Fine-tuning an SLM

Best Practices and Tips

Final Thoughts

Resources

Named Entity Recognition with Spacy and OpenAI

YouTube Data Collection Tutorial

Model Drift: Detecting the Faulty Evolution of ML Models

Text to Music Generator App

71 Types and Applications of Math

MLOps Ground Zero

Leave a Reply Cancel reply

The Case for Synthetic Data

Generating Synthetic Data

Example

Real-World Applications

1. Rapid Prototyping

2. Minority Classes

3. Fine-tuning an SLM

Best Practices and Tips

Final Thoughts

Resources

Similar Posts

Leave a Reply Cancel reply