Structured extraction is one of the most important unlocks offered by foundation models. This is especially the case now that modern LLMs can reliably produce JSON-formatted outputs, which was not the case until recently.
Extraction is a decision problem...
But if models can reliably produce JSON outputs, what's left to solve? Why even discuss extraction models in the context of analytical work?
Well - if you've ever tried to use LLMs for structured extraction, you may have noticed that it's often more of a decision problem for the model than a pure information retrieval problem.
Let's use an example. Let's say we want to extract the company name from the following article title.
Apple to partner with OpenAI on new ChatGPT integration, per Bloomberg.
Not so simple. We'll probably need to be more specific: "Extract the company name from the article. If there are multiple companies listed, exclude any companies that are not the primary subject. If there are still multiple companies, choose the subjective pronoun in the title."
Now consider this article title:
Google, Apple, and Nvidia partner on new self-driving initiative.
Now we've found a new case that breaks our latest policy, and we need to refine again. We could go on and on with these examples, but the main point is that once again, we've realized that we're hit with another decision problem.
...and a parsing problem
There are many companies who handle PDF parsing and similar problems that primarily arise upstream of the decision problem. In the examples we used, it's simple plain text we feed into the LLM, but the extraction problem is still ambiguous. Good extractor design is mostly about constraining this ambiguity.
For the purposes of this guide, we assume that the input data can at least be understood by the decision model you're creating. We won't really cover parsing infrastructure in this guide, but may be able to recommend vendors or models that handle parsing reliably.