Extracting structured information with LLMs¶

Large Language Models (LLMs) are powerful at generating human-like text, but their outputs are inherently unstructured. Many real-world applications require structured data to function properly, such as extracting due dates, priorities, and task descriptions from user inputs for a task management application, or extracting tabular data from unstructured text sources for data analysis pipelines.

Mirascope provides tools and techniques to address this challenge, allowing you to extract structured information from LLM outputs reliably.

Challenges in Extracting Structured Information¶

The key challenges in extracting structured information from LLMs include:

Unstructured Outputs: LLMs are trained on vast amounts of unstructured text data, causing their outputs to be unstructured as well.
Hallucinations and Inaccuracies: LLMs can sometimes generate factually incorrect information, complicating the extraction of accurate structured data.

Defining and Extracting Schemas with Mirascope¶

Mirascope's extraction functionality is built on top of Pydantic. We will walk through the high-level concepts you need to know to get started extracting structured information with LLMs, but we recommend reading their docs for more detailed explanations of everything that you can do with Pydantic.

Mirascope offers a convenient extract method on extractor classes to extract structured information from LLM outputs. This method leverages tools (function calling) to reliably extract the required structured data.

First, let's take a look at a simple example where we want to extrac task details like due date, priority, and description from a user's natural language input:

from typing import Literal

from mirascope.openai import OpenAIExtractor
from pydantic import BaseModel


class TaskDetails(BaseModel):
    due_date: str
    priority: Literal["low", "normal", "high"]
    description: str


class TaskExtractor(OpenAIExtractor[TaskDetails]):
    extract_schema: Type[TaskDetails] = TaskDetails
    prompt_template = """
    Extract the task details from the following task:
    {task}
    """

    task: str


task = "Submit quarterly report by next Friday. Task is high priority."
task_details = TaskExtractor(task=task).extract()
assert isinstance(task_details, TaskDetails)
print(TaskDetails)
#> due_date='next Friday' priority='high' description='Submit quarterly report'

Let's dive a little deeper into what we're doing here.

Model¶

Defining the schema for extraction is done via models, which are classes that inherit from pydantic.BaseModel. We can then define an extractor dependent on this schema and use it to extract the schema:

from typing import Type

from mirascope.openai import OpenAIExtractor
from pydantic import BaseModel


class Book(BaseModel):
    title: str
    author: str


class BookExtractor(OpenAIExtractor[Book]):
    extract_schema: Type[Book] = Book
    prompt_template = "The Name of the Wind by Patrick Rothfuss."


book = BookExtractor().extract()
assert isinstance(book, Book)
print(book)
#> title='The Name of the Wind' author='Patrick Rothfuss'

You can use tool classes like OpenAITool directly if you want to extract a single tool instead of just a schema (which is useful for calling attached functions).

Field¶

You can also use pydantic.Fields to add additional information for each field in your schema. Again, this information will be included in the prompt, and we can take advantage of that:

from typing import Type

from mirascope.openai import OpenAIPrompt
from pydantic import BaseModel, Field


class Book(BaseModel):
    title: str
    author: str = Field(..., description="Last, First")


class BookExtractor(OpenAIExtractor[Book]):
    extract_schema: Type[Book] = Book
    prompt_template = "The Name of the Wind by Patrick Rothfuss."


book = BookExtractor().extract()
assert isinstance(book, Book)
print(book)
#> title='The Name of the Wind' author='Rothfuss, Patrick'

Notice how instead of “Patrick Rothfuss” the extracted author is “Rothfuss, Patrick” as desired.

Extracting base types¶

Mirascope also makes it possible to extract base types without defining a pydantic.BaseModel with the same exact format for extraction:

from mirascope.openai import OpenAIExtractor


class BookRecommender(OpenAIExtractor[list[str]]):
    extract_schema: Type[list[str]] = list[str]
    prompt_template = "Please recommend some science fiction books."


books = BookRecommendation().extract()
print(books)
#> ['Dune', 'Neuromancer', "Ender's Game", "The Hitchhiker's Guide to the Galaxy", 'Foundation', 'Snow Crash']

We currently support: str, int, float, bool, list, set, tuple, and Enum.

We also support using Union, Literal, and Annotated

Note

If you’re using mypy you’ll need to add # type: ignore due to how these types are handled differently by Python.

Using `Enum` or `Literal` for classification¶

One nice feature of extracting base types is that we can easily use Enum or Literal to define a set of labels that the model should use to classify the prompt. For example, let’s classify whether or not some email text is spam:

from enum import Enum
# from typing import Literal

from mirascope.openai import OpenAIExtractor

# Label = Literal["is spam", "is not spam"]


class Label(Enum):
    NOT_SPAM = "not_spam"
    SPAM = "spam"


class NotSpam(OpenAIExtractor[Label]):
    extract_schema: Type[Label] = Label
    prompt_template = "Your car insurance payment has been processed. Thank you for your business."


class Spam(OpenAIExtractor[Label]):
    extract_schema: Type[Label] = Label
    prompt_template = "I can make you $1000 in just an hour. Interested?"


# assert NotSpam().extract() == "is not spam"
# assert Spam().extract() == "is spam"
assert NotSpam().extract() == Label.NOT_SPAM
assert Spam().extract() == Label.SPAM

Validation¶

When extracting structured information from LLMs, it’s important that we validate the extracted information, especially the types. We want to make sure that if we’re looking for an integer that we actual get an int back. One of the primary benefits of building on top of Pydantic is that it makes validation easy — in fact, we get validation on types out-of-the-box.

We recommend you check out their thorough documentation for detailed information on everything you can do with their validators. This document will be brief and specifically related to LLM extraction to avoid duplication.

Validating Types¶

When we extract information — for base types, BaseModel, or any of our tools — everything is powered by Pydantic. This means that we automatically get type validation and can handle it gracefully:

from mirascope.openai import OpenAIExtractor
from pydantic import BaseModel, ValidationError


class Book(BaseModel):
    title: str
    price: float


class BookRecommender(OpenAIExtractor[Book]):
    extract_schema: type[Book] = Book
    prompt_template = "Please recommend a book."


try:
    book = BookRecommender().extract()
    assert isinstance(book, Book)
    print(book)
    #> title='The Alchemist' price=12.99
except ValidationError as e:
    print(e)
  #> 1 validation error for Book
  #  price
  #    Input should be a valid number, unable to parse string as a number [type=float_parsing, input_value='standard', input_type=str]
  #      For further information visit https://errors.pydantic.dev/2.6/v/float_parsing

Now we can proceed with our extracted information knowing that it will behave as the expected type.

Custom Validation¶

It’s often useful to write custom validation when working with LLMs so that we can automatically handle things that are difficult to hard-code. For example, consider determining whether the generated content adheres to your company’s guidelines. It’s a difficult task to determine this, but an LLM is well-suited to do the task well.

We can use an LLM to make the determination by adding an AfterValidator to our extracted output:

from enum import Enum
from typing import Annotated

from mirascope.openai import OpenAIExtractor
from pydantic import AfterValidator, BaseModel, ValidationError


class Label(Enum):
    HAPPY = "happy story"
    SAD = "sad story"


class Sentiment(OpenAIExtractor[Label]):
    extract_schema: type[Label] = Label
    prompt_template = "Is the following happy or sad? {text}."

    text: str


def validate_happy(story: str) -> str:
    """Check if the content follows the guidelines."""
    label = Sentiment(text=story).extract()
    assert label == Label.HAPPY, "Story wasn't happy."
    return story


class HappyStory(BaseModel):
    story: Annotated[str, AfterValidator(validate_happy)]


class StoryTeller(OpenAIExtractor[HappyStory]):
    extract_schema: type[HappyStory] = HappyStory
    prompt_template = "Please tell me a story that's really sad."


try:
    story = StoryTeller().extract()
except ValidationError as e:
    print(e)
    # > 1 validation error for HappyStoryTool
    #   story
    #     Assertion failed, Story wasn't happy. [type=assertion_error, input_value="Once upon a time, there ...er every waking moment.", input_type=str]
    #       For further information visit https://errors.pydantic.dev/2.6/v/assertion_error

Streaming Tools and Structured Outputs¶

When using tools (function calling) or extracting structured information, there are many instances in which you will want to stream the results. For example, consider making a call to an LLM that responds with multiple tool calls. Your system can have more real-time behavior if you can call each tool as it's returned instead of having to wait for all of them to be generated at once. Another example would be when returning structured information to a UI. Streaming the information enables real-time generative UI that can be generated as the fields are streamed.

Note

Currently streaming tools is only supported for OpenAI and Anthropic. We will aim to add support for other model providers when available in their APIs.

Streaming Tools (Function Calling)¶

To stream tools, first call stream instead of call for an LLM call with tools. Then use the matching provider's tool stream class to stream the tools:

import os

from mirascope.openai import OpenAICall, OpenAICallParams, OpenAIToolStream

os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY"


def print_book(title: str, author: str, description: str):
    """Prints the title and author of a book."""
    return f"Title: {title}\nAuthor: {author}\nDescription: {description}"


class BookRecommender(OpenAICall):
    prompt_template = "Please recommend some books to read."

    call_params = OpenAICallParams(tools=[print_book])


stream = BookRecommender().stream()
tool_stream = OpenAIToolStream.from_stream(stream)
for tool in tool_stream:
    tool.fn(**tool.args)
#> Title: The Name of the Wind\nAuthor: Patrick Rothfuss\nDescription: ...
#> Title: Dune\nAuthor: Frank Herbert\nDescription: ...
#> ...

Streaming Partial Tools¶

Sometimes you may want to stream partial tools as well (i.e. the unfinished tool call with None for arguments that haven't yet been streamed). This can be useful for example when observing an agent's flow in real-time. You can simple set allow_partial=True to access this feature. In the following code example, we stream each partial tool and update a live console, printing each full tool call before moving on to the next:

import os
import time

from rich.live import Live

from mirascope.openai import OpenAICall, OpenAICallParams, OpenAIToolStream

os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY"


def print_book(title: str, author: str, description: str):
    """Prints the title and author of a book."""
    return f"Title: {title}\nAuthor: {author}\nDescription: {description}"


class BookRecommender(OpenAICall):
    prompt_template = "Please recommend some books to read."

    call_params = OpenAICallParams(tools=[print_book])


stream = BookRecommender().stream()
tool_stream = OpenAIToolStream.from_stream(stream, allow_partial=True)

with Live("", refresh_per_second=15) as live:
    partial_tools, index = [None], 0
    previous_tool = None
    for partial_tool in tool_stream:
        if partial_tool is None:
            index += 1
            partial_tools.append(None)
            continue
        partial_tools[index] = partial_tool
        live.update(
            "\n-----------------------------\n".join(
                [pt.fn(**pt.args) for pt in partial_tools]
            )
        )
        time.sleep(0.1)

Streaming Pydantic Models¶

You can also stream structured outputs when using an extractor. Simply call the stream function to stream partial outputs:

import os
from typing import Literal

from mirascope.openai import OpenAIExtractor
from pydantic import BaseModel

os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY"


class TaskDetails(BaseModel):
    title: str
    priority: Literal["low", "normal", "high"]
    due_date: str


class TaskExtractor(OpenAIExtractor[TaskDetails]):
    extract_schema: type[TaskDetails] = TaskDetails

    prompt_template = """
    Please extract the task details:
    {task}
    """

    task: str


task_description = "Submit quarterly report by next Friday. Task is high priority."
stream = TaskExtractor(task=task_description).stream()
for partial_model in stream:
    print(partial_model)
#> title='Submit quarterly report' priority=None due_date=None
#> title='Submit quarterly report' priority='high' due_date=None
#> title='Submit quarterly report' priority='high' due_date='next Friday'

Generating Synthetic Data¶

In the above examples, we’re extracting information present in the prompt text into structured form. We can also use extract to generate structured information from a prompt:

from mirascope.openai import OpenAIPrompt
from pydantic import BaseModel


class Book(BaseModel):
    """A science fiction book."""

    title: str
    author: str


class BookRecommender(OpenAIPrompt[Book]):
    extract_schema: type[Book] = Book
    prompt_template = "Please recommend a book."

book = BookRecommender().extract()
assert isinstance(book, Book)
print(book)
#> title='Dune' author='Frank Herbert'

Notice that the docstring for the Book schema specified a science fiction book, which resulted in the model recommending a science fiction book. The docstring gets included with the prompt as part of the schema definition, and you can use this to your advantage for better prompting.