In April of this year, a bankruptcy motion filed by Sullivan & Cromwell in the matter of Prince Global Holdings contained fabricated case citations, misquoted authorities, and at least one case that didn't exist. The firm had two required modules of AI training with tracked completions, and an office manual instructing lawyers to independently check all answers, case citations and other information or work product received from an AI program1. None of it worked, which prompted the predictable reaction: there was a governance gap and a reminder that AI needs human oversight. Both of which are at least in part, true. If we assume the policy worked, the question is why the oversight failed: the actual cognitive act of a trained, experienced lawyer reading a document and not catching errors that, in retrospect, were there to be found. The answer has less to do with negligence than with something Daniel Kahneman spent a career documenting: the conditions under which fast, automatic thinking ("System 1") runs unchecked when slow, deliberate thinking ("System 2") should have taken over.2

Human cognition relies on heuristics, rules of thumb applied to signals that the brain has calibrated through experience, to decide which approach is most appropriate. Those signals can vary in weight, and the identity of the person producing the document is perhaps the first and therefore the strongest. Are they two years qualified or fifteen? If two, then has the partner worked with them before? If so, they'll carry a mental model of exactly where the associate's reasoning tends to wobble, which arguments they overstate, where their research is reliable and where it needs checking.

Beyond the producer, there are signals within the document. We've all experienced noticing a spelling or grammar mistake early in a document and that throwing doubt on the care and attention paid to the substance. A first-year associate who perhaps isn't sure about a holding may write around the uncertainty, the paragraph structure gives it away and the confidence doesn't match the weight being placed on it. When a paragraph has been lifted from another document, the seam usually shows in a shift of register, a phrase that sits uncomfortably in the prose around it, a Frankenstein effect: competent pieces that don't quite belong to each other. Then there is the construction itself, humans deploy the rhetorical architecture of expertise: the rule of three, the decisive pivot, the structured callback, but inconsistently, when the content organizes that way. AI uses them consistently because they pattern-match to authoritative prose in the training data. The reader's brain “hears” the triple and concludes that the writer has mastered the material, the construction sounds confident, confidence sounds correct, yet the reasoning underneath may be neither.

AI produces assertions of logic and confidence without the internal intermediate reasoning that would warrant them. The output is driven by probability of fit, not by a validated chain of inference, and so what appears on the page is the perfect facsimile of logic, which is not the same thing as logic. Logic is a process of valid inference from true premises to warranted conclusions, conducted by a mind that, at its best, knows what it doesn't know. The appearance of logic: sequential structure, cited support, confident conclusion, can be reproduced without any of that process occurring underneath. What you see is the surface, and a brain running on thirty years of learned heuristics, conditioned to take the fast path where it can, will read that surface and conclude the substance is sound.

AI removes most of the traditional tells: the prose is fluent, the register holds, the confidence reads as expertise, and the citations are formatted correctly whether or not they exist. There is no stumble or hedge, and no Frankenstein seam, because AI doesn't assemble from pieces but generates from pattern, and the joins are invisible because there are no joins, so System 1 sees fluency and concludes there is nothing to examine, System 2 never engages, and the review happens in the wrong cognitive mode. Not because the reviewer is careless, but because every signal that would have shifted gears was absent.

For as long as lawyers have reviewed each other's work, fluency has been a legitimate proxy for quality, because a well-constructed argument, cleanly written, with properly formatted citations, was almost always produced by someone who had also taken care with the substance; the surface and the substance were made by the same person, and they correlated. Humans respond to the signal long after the underlying correlation has changed, and what has changed here is that AI has broken the correlation while preserving the signal: the output is invariably fluent and confident, formatted as if produced by someone who knew exactly what they were doing, but the surface signal that experienced reviewers learned to use as a proxy for underlying quality now carries no information about underlying quality, even as the psychological response it triggers remains unchanged and the brain sees polish and concludes this requires less scrutiny.

The S&C story is not, necessarily, only a policy failure but a cognitive architecture failure; the policy was right, but it created a tension with human psychology. We will need to learn a new, appropriate set of heuristics tailored to identify and efficiently review AI-generated content. And to do that we must first forget the current set, which, at least initially, means a partner would be advised to review all content through System 2, as though it were written by a first year.

Why does that matter (other than avoiding embarrassing filings)? It changes the productivity case for AI, which assumes the review burden stays roughly constant when it doesn't. A partner reviewing a lawyer's work applies targeted scrutiny calibrated to a known producer, concentrated where experience says it's needed, and that review is efficient precisely because the heuristics work. Remove them and the review changes character entirely, because many AI outputs require a level of scrutiny closer to a first-year review than firms currently assume, comprehensive and undifferentiated, without the shortcuts that make senior review fast.

If a partner must spend an hour reviewing work that AI produced in minutes, the question worth asking is how much of the drafting saving, at partner rates rather than associate rates, that review hour consumes. The vendors don't publish that number and the firms don't volunteer it, but it sits in every matter where a partner read an AI draft more carefully than they expected to and billed less than the time warranted. The firm in question bills its restructuring partners at a reported $2,000 an hour. An hour of that, spent re-reading what the machine drafted in minutes, is what the productivity case leaves out.

This raises a deeper question, not about review process or governance policy but about what the heuristics sought to detect. They weren’t detecting fluency for its own sake. They were using fluency as a proxy for something underneath it, context and judgment: the accumulated consequence of having been wrong and understood why. The difficulty is that much of that context was never fully captured in the first place. The legal industry has spent thirty years trying to solve that problem. It largely failed. AI has inherited the consequences. That’s the subject of the next article.

If your organization has adopted GenAI, what exactly has changed in the review process? Not the policy. Not the training module. The actual review. Have partners been taught how to review AI-generated work differently from associate-generated work, and if so, how? If the answer is “we trust but verify,” who has defined what verification now means?