Generate Synthetic Data¶
In this cookbook recipe, we go over how to generate synthetic data for LLMs, in this case, OpenAI’s gpt-4o-mini
. When using LLMs to synthetically generate data, it is most useful to generate non-numerical data which isn’t strictly dependent on a defined probability distribution - in those cases, it will be far easier to define a distribution and generate these points directly from the distribution.
However, for:
- data that needs general intelligence to be realistic
- data that lists many items within a broad category
- data which is language related
and more, LLMs are far easier to use and yield better (or the only feasible) results.
Mirascope Concepts Used
Background
Large Language Models (LLMs) have emerged as powerful tools for generating synthetic data, particularly for text-based applications. Compared to traditional synthetic data generation methods, LLMs can produce more diverse, contextually rich, and human-like textual data, often with less need for domain-specific rules or statistical modeling.
Generate Data as CSV¶
To generate realistic, synthetic data as a csv, you can accomplish this in a single call by requesting a csv format in the prompt and describing the kind of data you would like generated.
from mirascope.core import openai, prompt_template
@openai.call(model="gpt-4o-mini")
@prompt_template(
"""
Generate {num_datapoints} random but realistic datapoints of items which could
be in a home appliances warehouse. Output the datapoints as a csv, and your
response should only be the CSV.
Format:
Name, Price, Inventory
Name - the name of the home appliance product
Price - the price of an individual product, in dollars (include cents)
Inventory - how many are left in stock
"""
)
def generate_csv_data(num_datapoints: int): ...
print(generate_csv_data(5))
# > ```csv
# Name,Price,Inventory
# Stainless Steel Microwave,149.99,25
# Cordless Vacuum Cleaner,199.99,15
# 4-Slice Toaster,39.99,50
# Energy Star Dishwasher,699.00,10
# Smart Air Purifier,249.50,8
# ```
Note that the prices and inventory of each item are somewhat realistic for their corresponding item, something which would be otherwise difficult to accomplish.
Generate Data with response_model
¶
Sometimes, it will be easier to integrate your datapoints into your code if they are defined as some schema, namely a Pydantic BaseModel
. In this case, describe each column as the description
of a Field
in the BaseModel
instead of the prompt, and set response_model
to your defined schema:
from pydantic import BaseModel, Field
class HomeAppliance(BaseModel):
name: str = Field(description="The name of the home appliance product")
price: float = Field(
description="The price of an individual product, in dollars (include cents)"
)
inventory: int = Field(description="How many of the items are left in stock")
@openai.call(model="gpt-4o-mini", response_model=list[HomeAppliance])
@prompt_template(
"""
Generate {num_datapoints} random but realistic datapoints of items which could
be in a home appliances warehouse. Output the datapoints as a list of instances
of HomeAppliance.
"""
)
def generate_home_appliance_data(num_datapoints: int): ...
print(generate_home_appliance_data(5))
# > name='Refrigerator' price=799.99 inventory=15
# name='Microwave Oven' price=129.99 inventory=30
# name='Washing Machine' price=499.99 inventory=10
# name='Air Conditioner' price=349.99 inventory=5
# name='Dishwasher' price=599.99 inventory=8
Generate Data into a pandas DataFrame
¶
Since pandas is a common library for working with data, it’s also worth knowing how to directly create and append to a dataframe with LLMs.
Create a New DataFrame
¶
To create a new DataFrame
, we define a BaseModel
schema with a simple function to generate DataFrame
via a list of list of data and the column names:
from typing import Any
import pandas as pd
from pydantic import BaseModel, Field
from mirascope.core import openai
class DataFrameGenerator(BaseModel):
data: list[list[Any]] = Field(
description="the data to be inserted into the dataframe"
)
column_names: list[str] = Field(description="The names of the columns in data")
def generate_dataframe(self) -> pd.DataFrame:
return pd.DataFrame(
{name: column for name, column in zip(self.column_names, self.data)}
)
@openai.call(model="gpt-4o-mini", response_model=DataFrameGenerator)
@prompt_template(
"""
Generate {num_datapoints} random but realistic datapoints of items which could
be in a home appliances warehouse. Generate your response as `data` and
`column_names`, so that a pandas DataFrame may be generated with:
`pd.DataFrame(data, columns=column_names)`.
Format:
Name, Price, Inventory
Name - the name of the home appliance product
Price - the price of an individual product, in dollars (include cents)
Inventory - how many are left in stock
"""
)
def generate_df_data(num_datapoints: int): ...
df_data = generate_df_data(5)
df = df_data.generate_dataframe()
print(df)
# > Name Price Inventory
# 0 Microwave Oven 79.99 25
# 1 Blender 49.95 18
# 2 Air Fryer 129.99 30
# 3 Toaster 29.99 50
# 4 Coffee Maker 59.99 15
Appending to a DataFrame
¶
To append to a DataFrame
, we can modify the prompt so that instead of describing the data we want to generate, we ask the LLM to match the type of data it already sees. Furthermore, we add a append_dataframe()
function to append to an existing DataFrame
. Finally, note that we use the generated df
from above as the DataFrame
to append to in the following example:
class DataFrameGenerator(BaseModel):
...
def append_dataframe(self, df: pd.DataFrame) -> pd.DataFrame:
return pd.concat([df, self.generate_dataframe()], ignore_index=True)
@openai.call(model="gpt-4o-mini", response_model=DataFrameGenerator)
@prompt_template(
"""
Generate {num_datapoints} random but realistic datapoints of items which would
make sense to the following dataset:
{df}
Generate your response as `data` and
`column_names`, so that a pandas DataFrame may be generated with:
`pd.DataFrame(data, columns=column_names)` then appended to the existing data.
"""
)
def generate_additional_df_data(num_datapoints: int, df: pd.DataFrame): ...
df_data = generate_additional_df_data(5, df)
df = df_data.append_dataframe(df)
print(df)
# > Name Price Inventory
# 0 Microwave Oven 79.99 25
# 1 Blender 49.95 18
# 2 Air Fryer 129.99 30
# 3 Toaster 29.99 50
# 4 Coffee Maker 59.99 15
# 5 Electric Kettle 39.99 20
# 6 Food Processor 89.99 12
# 7 Rice Cooker 59.99 27
# 8 Slow Cooker 49.99 22
# 9 Electric Griddle 69.99 15
Adding Constraints¶
While you cannot successfully add complex mathematical constraints to generated data (think statistics, such as distributions and covariances), asking LLMs to abide by basic constraints will (generally) prove successful, especially with newer models. Let’s look at an example where we generate TVs where the TV price should roughly linearly correlate with TV size, and QLEDs are 2-3x more expensive than OLEDs of the same size:
from typing import Literal
class TV(BaseModel):
size: int = Field(description="The size of the TV")
price: float = Field(description="The price of the TV in dollars (include cents)")
tv_type: Literal["OLED", "QLED"]
@openai.call(model="gpt-4o-mini", response_model=list[TV])
@prompt_template(
"""
Generate {num_datapoints} random but realistic datapoints of TVs.
Output the datapoints as a list of instances of TV.
Make sure to abide by the following constraints:
QLEDS should be roughly (not exactly) 2x the price of an OLED of the same size
for both OLEDs and QLEDS, price should increase roughly proportionately to size
"""
)
def generate_tv_data(num_datapoints: int): ...
for tv in generate_tv_data(10):
print(tv)
# > size=32 price=399.99 tv_type='OLED'
# size=32 price=799.99 tv_type='QLED'
# size=42 price=599.99 tv_type='OLED'
# size=42 price=1199.99 tv_type='QLED'
# size=50 price=699.99 tv_type='OLED'
# size=50 price=1399.99 tv_type='QLED'
# size=55 price=899.99 tv_type='OLED'
# size=55 price=1799.99 tv_type='QLED'
# size=65 price=1099.99 tv_type='OLED'
# size=65 price=2199.99 tv_type='QLED'
To demonstrate the constraints’ being followed, you can graph the data using matplotlib, which shows the linear relationships between size and price, and QLEDs costing roughly twice as much as OLED:
Additional Real-World Examples
- Healthcare and Medical Research: Generating synthetic patient records for training machine learning models without compromising patient privacy
- Environmental Science: Generating synthetic climate data for modeling long-term environmental changes
- Fraud Detection Systems: Generating synthetic data of fraudulent and legitimate transactions for training fraud detection models.
When adapting this recipe to your specific use-case, consider the following:
- Add Pydantic AfterValidators
to constrain your synthetic data generation
- Verify that the synthetic data generated actually matches real-world data.
- Make sure no biases are present in the generated data, this can be prompt engineered.
- Experiment with different model providers and versions for quality.