On May 27, 2023, a federal judge in the Southern District of New York issued an order to show cause to two attorneys who had filed a brief that cited six judicial opinions. Five of the six did not exist. They had been generated by ChatGPT, in convincing detail, complete with quoted holdings and Westlaw-format citations. The case is Mata v. Avianca. Read it before you use an LLM in a brief.

What an LLM actually is

Strip away the hype and an LLM is doing something narrowly defined: given a sequence of text, it predicts what the next chunk of text is most likely to be. It learned this by reading a substantial fraction of the public internet, plus enormous quantities of books and other text, and adjusting its internal numbers until it could complete sequences with high accuracy.

That's it. There's no rule book inside it, no list of facts, no logic engine. The model is a giant pattern-completer that has internalized so many patterns that it can carry on a conversation, summarize a document, draft a contract clause, or write working code — by completing the patterns it has seen of those things being done before.

The strange thing — the thing that surprised everyone, including the people building these systems — is that pure pattern-completion at sufficient scale produces something that looks an awful lot like reasoning. You can ask an LLM to think through a problem step by step and it will produce something genuinely useful. But it is not reasoning the way a person reasons. It is reasoning the way a person who has read every reasoning chain ever written down would imitate reasoning. Sometimes those are the same thing. Sometimes they aren't.

Why hallucinations happen

A hallucination is an LLM producing a confident statement of fact that is not true. The Avianca case is the canonical example: the model generated case names, citations, and quoted holdings that were structurally perfect — they looked exactly like real legal citations — but the underlying cases did not exist.

The model does this because it isn't checking against a list of real cases. It's completing the pattern of "what would a brief in this situation look like?" A real brief in this situation would cite four or five cases on point. The model produces four or five cases on point, with names that sound like real names and citation formats that look like real formats, because that's what the pattern requires. The model has no "is this a real case?" check. It has a "does this look like a real case?" check, and it passes that check easily.

This is not a bug that will be fully patched away. It is a fundamental property of how these systems work. Modern LLMs hallucinate dramatically less than they did two years ago, but they still hallucinate, and the way they hallucinate is harder to spot now precisely because the rest of the output is more accurate. You should plan for hallucinations the way you plan for typos: they're rare, they're inevitable, and your review process has to catch them.

What LLMs are great at, and what they aren't

Great at

  • Fluent prose. First drafts of letters, summaries, explanations.
  • Structure. Reorganizing a messy interview into clean intake fields. Outlining a complex argument.
  • Translation. Plain English to legalese, legalese to plain English. Across human languages too.
  • Pattern recognition in text. Finding all the dates in a document. Spotting inconsistencies between two documents.
  • Code that follows established patterns — exactly the kind of code most legal-operations tools require.

Bad at

  • Citations and specific facts. Verify every quote, every case name, every statute number, every date — every time.
  • Anything time-sensitive that postdates its training. Don't ask it about last week's bar opinion.
  • Doing arithmetic reliably. It will sometimes get it right by pattern; sometimes by luck; sometimes embarrassingly wrong.
  • Knowing what it doesn't know. The model will cheerfully invent something rather than say "I'm not sure."
  • Holding genuinely novel reasoning chains together over many steps without drift.

How BuildLegal handles this

Tools you build here use LLMs in tightly-scoped ways: extracting fields from intake submissions, drafting letters from templates you've defined, summarising documents you provide. The tool isn't asking the LLM "what's the law"; it's asking the LLM "reformat this user's response into our intake schema." That's a use case where hallucination risk is much lower because the model isn't being asked for facts it would have to invent.

When you ask it for something where it would have to invent — "draft me a complaint citing three cases on negligent infliction of emotional distress" — the right pattern is to provide the cases. Don't ask the model to know them. Provide them, and ask the model to weave them in. This pattern (called retrieval-augmented generation by people who like long names) is dramatically more reliable than asking the model to recall.