This is good. It covers the two easiest dominant methods people use. It even touches on my main complaint for the one they seem to recommend.
That said:
- Constrained generation yields a different distribution from what a raw LLM would provide. This can be pathologically bad. My go-to example is LLMs having a preference for including ellipses in long, structured objects. Constrained generation forces closing quotes or whatever it takes to recover from that error according to a schema, nevertheless yielding an invalid result. Resampling tends to repeat till the LLM fully generates the data in question, always yielding a valid result which also adheres to the schema. It can get much worse than that.
- The unconstrained "method" has a few possible implementations. Increasing context length by complaining about schema errors is almost always worse from an end quality perspective than just retrying till the schema passes. Effective context windows are precious, and current models bias heavily toward earlier data which has been fed into them. In a low-error regime you might get away with a "try it again" response in a single chat, but in a high-error regime you'll get better results at a lower cost by literally re-sending the same prompt till the model doesn't cause errors.
This is a seriously beautiful guide. I really appreciate you putting this together! I especially love the tab-through animations on the various pages, and this is one of the best explanations that I've seen. I generally feel I understand grammar-constrained generation pretty well (I've merged a handful of contributions to the llama.cpp grammar implementation), and yet I still learned some insights from your illustrations -- thank you!
I'm also really glad that you're helping more people understand this feature, how it works, and how to use it effectively. I strongly believe that structured outputs are one of the most underrated features in LLM engines, and people should be using this feature more.
Constrained non-determinism means that we can reliably use LLMs as part of a larger pipeline or process (such as an agent with tool-calling) and we won't have failures due to syntax errors or erroneous "Sure! Here's your output formatted as JSON with no other text or preamble" messages thrown in.
Your LLM output might not be correct. But grammars ensure that your LLM output is at least _syntactically_ correct. It's not everything, but it's not nothing.
And especially if we want to get away from cloud deployments and run effective local models, grammars are an incredibly valuable piece of this. For practical examples, I often think of Jart's example in her simple LLM-based spam-filter running on a Raspberry Pi [0]:
> llamafile -m TinyLlama-1.1B-Chat-v1.0.f16.gguf \
> --grammar 'root ::= "yes" | "no"' --temp 0 -c 0 \
> --no-display-prompt --log-disable -p "<|user|>
> Can you say for certain that the following email is spam? ...
Even though it's a super-tiny piece of hardware, by including a grammar that constrains the output to only ever be "yes" or "no" (it's impossible for the system to produce a different result), then she can use a super-small model on super-limited hardware, and it is still useful. It might not correctly identify spam, but it's never going to break for syntactic reasons, which gives a great boost to the usefulness of small, local models.
What does it do when the model wants to return something else, and what's better/worse about doing it in llamafile vs whatever wrapper that's calling it? How do I set retries? What if I want JSON and a range instead?
This is a fantastic guide! I did a lot of work on structured generation for my PhD. Here are a few other pointers for people who might be interested:
Some libraries:
- Outlines, a nice library for structured generation
- https://github.com/dottxt-ai/outlines
- Guidance (already covered by FlyingLawnmower in this thread), another nice library
- https://github.com/guidance-ai/guidance
- XGrammar, a less-featureful but really well optimized constrained generation library
- https://github.com/mlc-ai/xgrammar
- This one has a lot of cool technical aspects that make it an interesting project
Some papers:
- Efficient Guided Generation for Large Language Models
- By the outlines authors, probably the first real LLM constrained generation paper
- https://arxiv.org/abs/2307.09702
- Automata-based constraints for language model decoding
- A much more technical paper about constrained generation and implementation
- https://arxiv.org/abs/2407.08103
- Pitfalls, Subtleties, and Techniques in Automata-Based Subword-Level Constrained Generation
- A bit of self-promotion. We show where constrained generation can go wrong and discuss some techniques for the practitioner
- https://openreview.net/pdf?id=DFybOGeGDS
Some blog posts:
- Fast, High-Fidelity LLM Decoding with Regex Constraints
- Discusses adhering to the canonical tokenization (i.e., not just the constraint, but also what would be produced by the tokenizer)
- https://vivien000.github.io/blog/journal/llm-decoding-with-regex-constraints.html
- Coalescence: making LLM inference 5x faster
- Also from the outlines team
- This is about skipping inference during constrained generation if you know there is only one valid token (common in the canonical tokenization setting)
- https://blog.dottxt.ai/coalescence.html
If the authors or readers are interested in some of the more technical details of how we optimized guidance & llguidance, we wrote up a little paper about it here: https://guidance-ai.github.io/llguidance/llg-go-brrr
These are cool tricks but this seems like an impedence mismatch: why would you use an LLM (a probabilistic source of plausible text) in a situation where you want a deterministic source of text where plausibility is not enough?
You... don't. That's exactly what structured outputs are for! You're offloading any formally defined generation to a tool that better serves the case, leaving the ambiguous part of the task to the model.
Code is an example of a mixed case. Getting any mechanistically parsable output from a model is another. Sure, you can format it after the generation, but you still need the generation to be parsable for that. In many cases, using the required format right away will also provide the context for better replies.
I agree that building agents is basically impossible if you cannot trust the model to output valid json every time. This seems like a decent collection of the current techniques we have to force deterministic structure for production systems.
Are there output formats that are more reliable (better adherence to the schema, easier to get parse-able output) or cheaper (fewer tokens) than JSON? YAML has its own problems and TOML isn't widely adopted, but they both seem like they would be easier to generate.
You should do your own evals specific to your case. In my evals XML outperforms JSON on every model for out of distribution tasks (i.e. not for JSON that was in the data).
Just brainstorming. Human beings have trouble writing json, cause it is too annoying. Too strict. In my experience, for humans writing typescript is a lot better than writing json directly, even when the file is just a json object. It allows comments, it allows things like trailing commas which are better for readability.
So maybe an interesting file to have the LLM generate is instead of the final file, a program that creates the final file?
Now there is the problem of security of course, the program the LLM generates would need to be sandboxed properly, and time constrained to prevent DOS attacks or explosive output sizes, not to mention the cpu usage of the final result, but quality wise, would it be better?
> We use a lenient parser like ast.literal_eval instead of the standard json.loads(). It will handle outputs that deviate from strict JSON format. (single quotes, trailing commas, etc.)
A nitpick: that's probably a good idea and I've used it before, but that's not really a lenient json parser, it's a Python literal parser and they happen to be close enough that it's useful.
This is good. It covers the two easiest dominant methods people use. It even touches on my main complaint for the one they seem to recommend.
That said:
- Constrained generation yields a different distribution from what a raw LLM would provide. This can be pathologically bad. My go-to example is LLMs having a preference for including ellipses in long, structured objects. Constrained generation forces closing quotes or whatever it takes to recover from that error according to a schema, nevertheless yielding an invalid result. Resampling tends to repeat till the LLM fully generates the data in question, always yielding a valid result which also adheres to the schema. It can get much worse than that.
- The unconstrained "method" has a few possible implementations. Increasing context length by complaining about schema errors is almost always worse from an end quality perspective than just retrying till the schema passes. Effective context windows are precious, and current models bias heavily toward earlier data which has been fed into them. In a low-error regime you might get away with a "try it again" response in a single chat, but in a high-error regime you'll get better results at a lower cost by literally re-sending the same prompt till the model doesn't cause errors.
This is a seriously beautiful guide. I really appreciate you putting this together! I especially love the tab-through animations on the various pages, and this is one of the best explanations that I've seen. I generally feel I understand grammar-constrained generation pretty well (I've merged a handful of contributions to the llama.cpp grammar implementation), and yet I still learned some insights from your illustrations -- thank you!
I'm also really glad that you're helping more people understand this feature, how it works, and how to use it effectively. I strongly believe that structured outputs are one of the most underrated features in LLM engines, and people should be using this feature more.
Constrained non-determinism means that we can reliably use LLMs as part of a larger pipeline or process (such as an agent with tool-calling) and we won't have failures due to syntax errors or erroneous "Sure! Here's your output formatted as JSON with no other text or preamble" messages thrown in.
Your LLM output might not be correct. But grammars ensure that your LLM output is at least _syntactically_ correct. It's not everything, but it's not nothing.
And especially if we want to get away from cloud deployments and run effective local models, grammars are an incredibly valuable piece of this. For practical examples, I often think of Jart's example in her simple LLM-based spam-filter running on a Raspberry Pi [0]:
> llamafile -m TinyLlama-1.1B-Chat-v1.0.f16.gguf \ > --grammar 'root ::= "yes" | "no"' --temp 0 -c 0 \ > --no-display-prompt --log-disable -p "<|user|> > Can you say for certain that the following email is spam? ...
Even though it's a super-tiny piece of hardware, by including a grammar that constrains the output to only ever be "yes" or "no" (it's impossible for the system to produce a different result), then she can use a super-small model on super-limited hardware, and it is still useful. It might not correctly identify spam, but it's never going to break for syntactic reasons, which gives a great boost to the usefulness of small, local models.
* [0]: https://justine.lol/matmul/
What does it do when the model wants to return something else, and what's better/worse about doing it in llamafile vs whatever wrapper that's calling it? How do I set retries? What if I want JSON and a range instead?
There are no retries. The grammar enforces the output tokens accepted as part of llamacpp.
This is a fantastic guide! I did a lot of work on structured generation for my PhD. Here are a few other pointers for people who might be interested:
Some libraries:
- Outlines, a nice library for structured generation
- Guidance (already covered by FlyingLawnmower in this thread), another nice library - XGrammar, a less-featureful but really well optimized constrained generation library Some papers:- Efficient Guided Generation for Large Language Models
- Automata-based constraints for language model decoding - Pitfalls, Subtleties, and Techniques in Automata-Based Subword-Level Constrained Generation Some blog posts:- Fast, High-Fidelity LLM Decoding with Regex Constraints
- Coalescence: making LLM inference 5x fasterWhat a gold mine!
Automata-based constraints is fun.
Very nicely written guide!
If the authors or readers are interested in some of the more technical details of how we optimized guidance & llguidance, we wrote up a little paper about it here: https://guidance-ai.github.io/llguidance/llg-go-brrr
These are cool tricks but this seems like an impedence mismatch: why would you use an LLM (a probabilistic source of plausible text) in a situation where you want a deterministic source of text where plausibility is not enough?
You... don't. That's exactly what structured outputs are for! You're offloading any formally defined generation to a tool that better serves the case, leaving the ambiguous part of the task to the model.
Code is an example of a mixed case. Getting any mechanistically parsable output from a model is another. Sure, you can format it after the generation, but you still need the generation to be parsable for that. In many cases, using the required format right away will also provide the context for better replies.
This is a nice guide. I especially like the masked decoding diagrams on this page https://nanonets.com/cookbooks/structured-llm-outputs/basic-....
edit: Somehow that link doesn't work... It's the diagram on the "constrained method" page
One of the authors here, will checkout the diagram link.
Every commercial model provider is adding structured outputs so will keep updating the guide.
I agree that building agents is basically impossible if you cannot trust the model to output valid json every time. This seems like a decent collection of the current techniques we have to force deterministic structure for production systems.
Are there output formats that are more reliable (better adherence to the schema, easier to get parse-able output) or cheaper (fewer tokens) than JSON? YAML has its own problems and TOML isn't widely adopted, but they both seem like they would be easier to generate.
What have folks tried?
Yes, that's the purpose of TOON.
https://github.com/toon-format/toon
Nice, it would be good idea to develop CFG for this as well so can embed it into all these constrained decoding libraries
Is there evidence that LLMs adhere to this format better than to JSON? I doubt that.
It is 100% guaranteed that they DON'T. Toon is 3 months old, it's not used by anyone, and it's therefore not in the training set of any model.
You should do your own evals specific to your case. In my evals XML outperforms JSON on every model for out of distribution tasks (i.e. not for JSON that was in the data).
Just brainstorming. Human beings have trouble writing json, cause it is too annoying. Too strict. In my experience, for humans writing typescript is a lot better than writing json directly, even when the file is just a json object. It allows comments, it allows things like trailing commas which are better for readability.
So maybe an interesting file to have the LLM generate is instead of the final file, a program that creates the final file? Now there is the problem of security of course, the program the LLM generates would need to be sandboxed properly, and time constrained to prevent DOS attacks or explosive output sizes, not to mention the cpu usage of the final result, but quality wise, would it be better?
I use regex to force an XML schema and then use a normal XML parser to decode.
XML is better for code, and for code parts in particular I enforce a cdata[[ part so there LLM is pretty free to do anything without escaping.
OpenAI API lets you do regex structured output and it's much better than JSON for code.
> We use a lenient parser like ast.literal_eval instead of the standard json.loads(). It will handle outputs that deviate from strict JSON format. (single quotes, trailing commas, etc.)
A nitpick: that's probably a good idea and I've used it before, but that's not really a lenient json parser, it's a Python literal parser and they happen to be close enough that it's useful.
This information is really presented well. I subscribed to your newsletter. Thanks!
I like structured outputs as much as the next guy but be careful not to try to structure natural language.
What would be the point of outputting unconstrained json if the output is consumed by a human?
Huge fan of BAML , nice coverage
BAML
https://nanonets.com/cookbooks/structured-llm-outputs/uncons...