| |

Go From Idea to LLM Demo in a Day with Synthetic Data

Ever heard this before:

“We love the concept, but can you show us a demo?”

If you’re building LLM-powered products, this question can stop you in your tracks. Not because of technical challenges, but because you’re waiting on data. The good news? There’s a faster way:

Generate synthetic data using LLMs.

In this guide, I’ll show you a quick and simple demo of how to do this with a Python pipeline that generates synthetic test data in minutes using OpenAI and Pydantic. Whether you’re validating a concept, building a prototype, or testing edge cases, this approach can compress weeks of preparation into hours.

The Case for Synthetic Data

Every product manager, stakeholder, or freelance client wants to see a working demo before moving forward. But here’s the typically happens:

  • You spend weeks collecting and cleaning real user data
  • Legal teams take forever to approve data usage
  • Privacy concerns create endless back-and-forth
  • The project loses momentum while waiting for backend systems

Naturally, ML apps, being data-driven, are particularly vulnerable to these delays. Even with LLM products, which tend to be more demo-friendly, you still need relevant data to showcase unique value. As a result, good ideas often die before they can prove themselves.

This is where synthetic data becomes your secret weapon. Instead of waiting weeks or months, you can:

  • Build Faster: Create working prototypes in hours instead of weeks
  • Test Earlier: Validate your ideas before investing in data collection
  • Iterate Quickly: Customize your data on the fly as requirements change
  • Demo Confidently: Show stakeholders real functionality without privacy concerns

Ultimately, you want to validate the need for any further data investment while accelerating development. Remember, you don’t need perfect data to prove an idea — you need data that’s realistic enough to demonstrate value.

Generating Synthetic Data

The problem with synthetic data is people tend not to take it seriously, meaning they don’t care enough to build a reliable and maintainable data pipeline to not only generate the synthetic data, but to have a repeatable, maintainable, and customizable process that follows good software practices.

Below I’ve written a class to do this to help you get started. It by no means is the end all be all when it comes to synthetic data generation pipelines, it’s just one python class to demonstrate this task, but hopefully it will motivate you to get started with your own pipeline for your use case. Anyway, here’s what this one does:

  • Takes a Pydantic schema to ensure data consistency
  • Uses custom prompts to guide the LLM’s output
  • Validates everything automatically
  • Saves data in .jsonl format for easy pipeline integration
  • Is customizable for various type of structured data

Let’s dive into the code and see how it works.

from typing import List, Type, Dict
from pydantic import BaseModel, ValidationError
from openai import OpenAI
import time
import json
import os
from dotenv import load_dotenv

load_dotenv()
class SyntheticDataGenerator:
    def __init__(
            self,
            schema: Type[BaseModel],
            prompt_template: str,
            output_file: str = "synthetic_data.jsonl",
            n_samples: int = 100,
            model: str = "gpt-3.5-turbo-1106",  # Updated to a model that supports JSON mode
            max_retries: int = 3,
            delay: float = 1.0,
            temperature: float = 0.0
    ):
        self.schema = schema
        self.prompt_template = prompt_template
        self.output_file = output_file
        self.n_samples = n_samples
        self.model = model
        self.max_retries = max_retries
        self.delay = delay
        self.temperature = temperature
        self.client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

    def _call_openai(self, prompt: str) -> str:
        for attempt in range(self.max_retries):
            try:
                response = self.client.chat.completions.create(
                    model=self.model,
                    messages=[
                        {"role": "system", "content": "You are a helpful assistant that generates support ticket data in JSON format."},
                        {"role": "user", "content": prompt}
                    ],
                    temperature=self.temperature,
                    response_format={"type": "json_object"}  # Force JSON output
                )
                content = response.choices[0].message.content
                print(f"Raw API response: {content}")  # Debug print
                return content
            except Exception as e:
                print(f"OpenAI API error (attempt {attempt + 1}/{self.max_retries}): {e}")
                if attempt < self.max_retries - 1:
                    time.sleep(self.delay)
        return ""

    def _generate_sample(self) -> Dict:
        raw_output = self._call_openai(self.prompt_template)
        try:
            data = json.loads(raw_output)
            sample = self.schema(**data)
            return sample.dict()
        except (json.JSONDecodeError, ValidationError) as e:
            print(f"Parsing error: {e}")
            return {}

    def _write_to_file(self, data: List[Dict]) -> None:
        with open(self.output_file, "w") as f:
            for item in data:
                if item:
                    f.write(json.dumps(item) + "\n")

    def run(self) -> None:
        print(f"Generating {self.n_samples} samples...")
        data = []
        for _ in range(self.n_samples):
            sample = self._generate_sample()
            if sample:
                data.append(sample)
            time.sleep(self.delay)
        self._write_to_file(data)
        print(f"Saved {len(data)} samples to {self.output_file}")

Example

Here’s an example to show how this works. You simply need to define your schema, prompt, and then pass them into the class initiation.

class SupportTicket(BaseModel):
    customer_name: str
    issue_summary: str
    issue_detail: str
    priority: str

support_ticket_prompt = """
Generate a synthetic customer support ticket in JSON format with the following fields:
- customer_name
- issue_summary
- issue_detail
- priority (low, medium, high)

Example:
{
  "customer_name": "Alice Johnson",
  "issue_summary": "App crash on launch",
  "issue_detail": "The mobile app crashes immediately after tapping the icon on Android 14.",
  "priority": "high"
}

The response must be a valid JSON object with exactly these fields and no additional content.
Now generate one new ticket:
"""
generator = SyntheticDataGenerator(
    schema=SupportTicket,
    prompt_template=support_ticket_prompt,
    output_file="support_tickets.jsonl",
    n_samples=50
)

generator.run()

You can input your own custom output file name too to see the output, but to show you explicitely, running the below code block will give you a jsonl file with samples that look like this:

{"customer_name": "Bob Smith", "issue_summary": "Unable to login", "issue_detail": "I am unable to login to my account using my username and password.", "priority": "medium"}

Real-World Applications

Here are some more considerations on using this in practice:

1. Rapid Prototyping

The main point in using synthetic data is probably in getting your own custom datasets for whatever you’re building. This can be expecially useful if you want to test the feasibility of an idea where you know real-world quality data exists, but it’s too expensive, has nasty rate limits, or it’s private data you can’t get immediate access too.

2. Minority Classes

Another place where this could be worth using is in training a model on a dataset where you have a miniority class and don’t see model performance improving without getting more data for that class specifically. Classification problems are usually made interesting thanks to minority classes, which naturally tend to be harder to obtain, so augmenting your data with synthetic data here could prove to be really handy.

3. Fine-tuning an SLM

This is probably one of the more popular use cases, but fine-tuning a smaller language model (SLM) with knowledge distillation is always worth mentioning. If you’re using LLM extensively, but you have a specific task that is rather simple, it might be worth fine-tuning a SLM to save costs and inference time.

Best Practices and Tips

When using synthetic data, keep these points in mind:

  1. Start Simple: Begin with a basic schema and expand as needed
  2. Validate Assumptions: Cross-check synthetic patterns with domain experts
  3. Mix and Match: Combine synthetic data with small samples of real data when possible
  4. Document Limitations: Be clear about what the synthetic data can and cannot prove
  5. Avoid Over-perfection: Don’t spend too much time making synthetic data perfect

Final Thoughts

Synthetic data isn’t just a stopgap — it’s a powerful tool that can transform your development workflow. While it shouldn’t permanently replace real data, it can:

  • Get your project moving when real data isn’t available
  • Help stakeholders visualize the end product
  • Validate technical approaches early
  • Speed up the prototype-to-production pipeline

The next time you’re stuck waiting for data, remember: you can probably generate what you need in the next hour.

Resources

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *