Why You’re Thinking About “Reasoning” All Wrong

Sad mime juggling

The Illusion and Appeal of LLM Reasoning

Words like reasoning, thinking, and writing are the working tools of the legal profession. But with the rise of large language models, like OpenAI’s GPT, Anthropic’s Claude, and Google’s Gemini, these words are now used in a different way. If we don’t confront their false familiarity, we risk misunderstanding the capabilities of these tools and misplacing our trust in them.

This article is about the mismatch between how “reasoning” works in law and how it’s being described in connection with LLMs. We’ll explore:

What LLMs are actually doing when they generate reasoning-style responses
Why those responses often feel persuasive even when they’re not grounded
What makes human reasoning special
Why lawyers may be more vulnerable to over-trusting this kind of output

If we’re clear about the limits and strengths of these tools, we can use them more effectively. That starts with understanding what reasoning really means—and what it doesn’t.

Why We Hope for LLM Reasoning

When lawyers hear reasoning, our interpretation is influenced by the fact that legal reasoning is foundational to the practice of law. It’s how we interpret rules, apply precedent, make analogies, develop arguments, and reach conclusions. Our advice is only as good as the reasoning behind it, which we convey through precise language.

Much of the recent interest in GenAI reasoning seems driven by two hopes. First, we hope reasoning might help LLMs avoid hallucinations: if we can get LLMs to explain their steps, they should be more accurate. Second, we hope reasoning is the last missing piece in building a truly useful GenAI coworker: an assistant that can draft, evaluate, prioritize, and even identify and escalate important issues—like a junior associate. Both hopes are understandable, but they’re leading us to misunderstand what LLMs with “reasoning capabilities” can actually do. Even with newer reasoning models, LLMs are still just generating plausible strings of text based on statistics.

You Keep Using That Word…

The word reasoning is doing a lot of work right now. In conversations about GenAI, it’s become a kind of shape-shifting shorthand—sometimes for formal logic, sometimes for problem-solving, sometimes for research, and sometimes just for output that better aligns with our experience of the world.

When product announcements refer to an LLM’s reasoning ability, they’re often describing its ability to follow task instructions or generate multi-step answers that resemble human explanations and appear logically organized.[1] It might also mean the model performed well on a benchmark that involves solving puzzles, or answering questions that require intermediate steps. Sometimes it just means the output looks like what we expect from someone who’s thinking through a problem. No matter what, that LLM-specific definition conflicts with our expectations of what we instinctively believe reasoning is—but it’s not so far off that we immediately know to ask clarifying questions.

To see how easily this kind of misunderstanding occurs, consider a word lawyers know well: consideration. In everyday English, consideration means careful thought. “I gave it serious consideration” implies attention and deliberation. But in contracts, consideration has a legal-specific meaning: the bargained-for exchange of value. It has nothing to do with how much someone thought about the deal. We know the elements of a contract are not offer, acceptance, and thinking about it.

Just like a pro se litigant would have an ugly surprise if they discovered in court that consideration was a legal term of art, lawyers will be similarly surprised if they don’t realize that reasoning has a different meaning when it comes to GenAI.

Since discussions of reasoning often include other recycled terms of art, here’s a quick guide to how I break down the words that have taken on specialized meaning in the GenAI era:[2]

	Human Meaning	LLM Meaning
Reasoning	Purposeful, structured thinking; logic, judgment, and understanding	Generating text that resembles logical patterns
Thinking	Mental activity: memory, planning, imagination, doubt, making and revising conclusions	No internal state, just statistical prediction
Inference	Drawing conclusions from evidence; can be deductive, inductive, abductive, or evaluative	Token (small unit of data) prediction during output generation; statistical computations
Certainty	Expression of actual belief or confidence	Probabilistic token-level confidence in generated output
Writing	Intentional communication with structure, audience awareness, and applied experience	Word-by-word surface-form text generation, no intent, no knowledge of the world, no ability to comprehend the generated text

What LLMs Are Actually Doing

LLMs with reasoning capabilities now “think” for longer and “explain” how they reach conclusions. They can both produce what looks like a well-reasoned response, as well as what looks like a thoughtful explanation of it. But those two things are not necessarily connected. In some cases, the model generates a solid response for reasons it can’t explain. In others, it offers a plausible explanation for a response it got wrong. And sometimes it gets both wrong—but sounds so convincing that the errors are hard to spot.

When we use chain-of-thought prompting[1] and tell an LLM to “explain your reasoning,” it will appear to do so. We’ll get similar step-by-step explanations when we use the deep research features. The LLM might start with, “First, we consider the relevant statute.” It may say “that doesn’t sound right” or “let me double-check that” or “let’s revise that conclusion” to give the impression of critical thinking. It may even include rhetorical pauses like “but that may not apply here.” The veneer of self-awareness is misleading. The model appears to be thinking, weighing options, revising its conclusions. But these aren’t signs of introspection and there is no internal process to share. The LLM is mimicking the outward appearance of human reasoning by predicting the words most likely to appear when a human reasons.

These thoughtful phrases are traces of phrases found in human writing and the LLM is drawing from examples where someone expressed doubt, or revised their answer, and repeating the kind of language that typically follows a prompt like yours. If a reasonably matching string of introspective words is represented somewhere in the training data, you are likely to get it when the LLM reasons.

Those explanations don’t match the LLM’s internal behavior at all—it’s calculating probabilities, not thinking through the problem. Stranger still: In many cases, the LLM generates the response first, then constructs an explanation that sounds right.[2] There’s a big disconnect between what an LLM claims it did and what it actually did during generation.[3] This “transparent” insight into how an LLM works is a rhetorical performance—a kind of after-the-fact storytelling. I suggest you treat this feature as entertainment while you wait.

These intermediate explanatory statements are particularly alluring because they’re exactly what we expect from a human reasoning through a problem using the think-aloud protocols promoted by composition theorists. The pedagogy appears to be there, so we want to believe. But, to borrow a concept from classical philosophy, which I learned about from instructional technologist James Faulkner, what LLMs do is closer to mīmēsis—imitation. They reproduce the structure of something without having the essence of it.

What Human Reasoning Really Is

If we want to understand what’s missing from LLM “reasoning,” we need to be precise about what human reasoning entails. Human reasoning is a cognitive process of consciously working through problems: goal-directed, deliberative, contextual, and reflective. As we work, there’s an internal narrative and awareness of why we conclude A from B. This process is key in legal work.

Cognitive psychologist Philip Johnson-Laird defines human reasoning as the manipulation of mental models, which are internal representations of how the world works. When we reason, we don’t just match patterns; we imagine different outcomes and weigh their implications. We test arguments against standards, facts, and goals. And when we find contradictions, we revise our conclusions.

Human reasoning typically involves:

Goals and Intentions: Human thought is shaped by intention, and guided by an objective or desired outcome. We usually reason with a purpose in mind—to solve a problem, advise the client, persuade the court, anticipate objections, or make decisions.
Logic and Judgment: Humans apply logic, draw on knowledge and experience, and weigh evidence. We can interpret nuances, recognize context, and deliberately avoid contradictions. If something doesn’t make sense, we notice and adjust our approach.
Understanding and Meaning: We attach meaning to words and symbols. We don’t just see patterns in sentences; we know what they mean. We form mental models of how one event leads to another, test explanations, and ask why. Our reasoning is grounded in an understanding of cause and effect and the real world, and we use that to imagine timelines, motives, obligations, and outcomes.
Reflection and Adaptation: We can reflect on our own thought process and consciously self-correct if we realize we’re on the wrong track. We have a sense of what we know and don’t know, and we will pause to double-check or flag uncertainty.
Learning from Experience: We form general principles that we carry into new situations. We can interpret novel scenarios by analogizing to things we’ve seen before, even if we’ve never encountered the exact situation.

How LLM and Human Reasoning Differ

LLMs don’t think, reason, or understand the way humans do. Here are six ways that LLM outputs differ from actual human reasoning:

No Goal-Directed Thinking: LLMs have no outside goals; they exist to complete requested responses. They simply continue text in the direction the prompt suggests. They don’t decide to be cautious, persuasive, or complete.
No Causal Understanding: [1] LLMs don’t understand cause and effect, the LLM only knows which words tend to appear together but not why they relate. For example, the LLM might say “rain causes traffic” because those words frequently co-occur in writing. But it doesn’t understand the underlying relationship: visibility, road conditions, driving behavior. It may even conclude traffic causes rain simply because the association exists. They do not have a model of the world against which they can check their outputs.
No Deductive Logic: LLMs can only imitate logical processes when the examples are familiar and there are no linguistic miscues. Slightly alter the wording of a math problem, and performance often drops dramatically.[2] The model wasn’t solving—it was matching.
No Internal Consistency: LLMs don’t build internal representations of what they’ve said and don’t remember their outputs. If the LLM must revisit that same math problem from above, whether in the same output or in the future, you may get different answers each time. For LLMs contradiction and lack of internal consistency is acceptable so long as the output is fluent.
No Mental Models or Semantic Grounding: LLMs are not grounded in a physical experience of the world, so they can’t base their interpretations or logic on that. They aren’t connecting words to the world. The probabilities that LLMs calculate are token-level probabilities. Each output is just another prediction, which is why LLMs sometimes make factually or legally absurd claims.
No Metacognition, Self-Awareness, or Epistemic Limits: LLMs don’t know what they don’t know, and they don’t have the ability to assess or express how well they know something if they know it. There’s no internal model of truth or consequence—so it’s okay to make things up, so long as they sound good.
No Awareness of a Word’s Meaning and Value: LLMs frequently misinterpreted negated statements—sometimes flipping the meaning entirely while maintaining confident word choice. They process negation as syntax rather than meaning, which leads to misinterpretations of logical implications.

LLMs Don’t Understand Word Value and Meaning

Because LLMs treat words as mere data points, they generally treat all short words as if they have the same, low value. But the tiny words can be the most important ones, and lawyers are trained to look for them. Lawyers will even back up, re-read a sentence, and diagram it to make sure they understand what parts of the sentence those small words are affecting.

But LLMs routinely mishandle these basic components of language. Several research papers[1] on negation found that LLMs frequently ignored or misinterpreted negation, misapplied modal expressions, misinterpreted conditional words, conflated quantifiers, assumed false converse relationships, and accepted invalid inferences due to their overlapping statistical usage in training data. [2] The more complex the language and the more clauses are present, the more likely the LLM will get it wrong.

It’s hard to imagine a program can reason if it can’t understand the word “not.” But that’s the case with LLMs. A phrase like “The court did not find liability” is obviously different from “The court found liability.” LLMs often miss this distinction. For lawyers, these words matter. A misread not, a swapped if, or a confused quantifier can change everything. A lawyer who routinely missed these words would be incompetent.

Lawyers Should Be Skeptical of “Careful” Thought

Lawyers care about words more than most other professionals. We’re trained to read closely and find meaning. We don’t just analyze what’s said, we pay close attention to how it’s said. Our attention to language leads us to focus on subtle language cues to convey meaning. In legal writing, word choice is part of the argument. Our training primes us to extract intention from word choice. That interpretive instinct, so essential to legal practice, becomes a liability when applied to machine-generated text. And the more closely we read, the more likely we are to see intentionality where none exists.

Because we put so much weight on words, it’s hard to believe that all text isn’t created based on those same standards. (This was a fairly safe assumption about text traded between lawyers before 2023, but that’s changed now that GenAI is so prevalent.)

For example, we’re taught that words like likely, arguably, or clearly are ways to calibrate legal advice and manage expectations. As Joe Fore argued in his 2019 article, A Court Would Likely (60–75%) Find…, these linguistic cues shape how legal readers interpret probability, risk, confidence, exposure, and judgment. So when an LLM uses words like likely, arguably, possibly, or clearly, we assume that it’s evaluating risk or signaling confidence when it’s not. It’s just predicting what comes next based on statistical patterns in its training data.

Further, an LLM cannot be prompted to tell you that it’s unsure even though it often uses language of uncertainty. In a 2024 paper by Gal Yona, Roee Aharoni, and Mor Geva, Can Large Language Models Faithfully Express Their Intrinsic Uncertainty in Words?, researchers at Google explored whether models could express confidence or uncertainty in ways that correlated with how likely the model actually was to be correct. They found that LLMs often used decisive language when their internal confidence was low, and hedging language when their confidence was high. There was no reliable correlation between how sure the model sounded and how sure it actually was. They called this disconnect “faithful response uncertainty.” Most models failed the test.

So we must be more skeptical of AI-generated writing and stop interpreting it as if it came from a human. If we treat this text as though it were human-generated, we will read more meaning into the word choice than there is. This is more than a linguistic curiosity—it’s a mismatch that could have serious consequences.

Longer Responses Aren’t Better Ones

Legal professionals are trained to value thoroughness, and we expect that to be delivered in a detailed written explanation. Most people have come to believe that length signals depth and shows the amount of care that went into a draft. So when a large language model produces a long, step-by-step response to a legal question, it’s natural to feel reassured. But that familiarity is a trap. Because we mistake the performance of reasoning for an actual process of reasoning, we assume the depth is real. We assume the LLM is working through the problem the way a lawyer would. It’s not.

The length of an LLM’s response is not a measure of quality. It’s not even a sign of greater accuracy. In fact, in many cases, longer answers are less reliable, because each additional reasoning step or sentence creates more opportunities for something to go wrong. Thus there’s more risk of semantic drift and more risk that something that isn’t true, isn’t relevant, or doesn’t follow will be included.

In a 2025 study by Mehdi Fatemi, Banafsheh Rafiee, Mingjie Tang, Kartik Talamadupula, Concise Reasoning via Reinforcement Learning, researchers found that LLMs trained to produce long, step-by-step answers often performed worse than those trained to be brief and direct. This is because, during reinforcement learning, LLMs are rewarded for producing helpful outputs and penalized when the reviewer doesn’t like the answer. But when long responses contain mistakes and are penalized for them, the penalty is spread out across many tokens, softening the penalty. As a result, verbosity is unintentionally rewarded—even when it contributes to inaccuracy.

To test this theory, the researchers retrained models to prefer shorter answers. The result: performance stayed the same or improved. Accuracy didn’t depend on how much the model said. Correct answers tend to be short; incorrect answers tend to be long.

“Trustworthy” and Helpful by Design

Humans—including lawyers—tend to attribute understanding or intention to a system that uses human language fluently. Named after a 1960s experiment, the ELIZA effect explains why even minimal linguistic cues can trigger a sense of thoughtfulness or empathy in readers. The original ELIZA system was a simple chatbot that echoed users’ statements with basic question formats. Despite the simplicity of the system, people reported feeling heard and understood.

Today’s LLMs operate on a larger scale with far more sophisticated language but the psychological trap is the same. It seems that today’s tools may be leaning into this psychological phenomenon by design. Most LLMs are engineered not just to be helpful, but to feel helpful. The interfaces encourages spending more time with the LLM.

Putting the ELIZA effect and the legal profession’s reliance on calibrated language together, we get a perfect storm for lawyers to believe that LLM reasoning is more capable and trustworthy than it is. Lawyers see a model hedge, or express confidence, or walk through a problem step-by-step, and they believe a reasoning process must be behind it. When the LLM says, “that may not be correct,” or “let me reconsider,” we instinctively infer a reflective process, even when we intellectually understand that no such process exists.

How Diligent Prompting Makes Outputs Worse

Knowing that LLMs can be wrong, a well-meaning lawyer may try to fix the problem with diligent prompting.

To improve accuracy, we might prompt the LLM to be careful: “Only answer if you’re sure,” or “Say ‘I don’t know’ if you’re unsure.” And the model will comply—linguistically. While it changes the tone of the output, it has no effect on accuracy.[1] This creates a false sense of safety. We feel like we’ve mitigated the risk by steering the language, but we’ve really just added another variable that possibly pushes the model further from what it “knows” statistically. Because the prompt suggests caution, the LLM will perform caution. The response may sound more cautious, but the connection to statistical certainty has not improved. In some cases, it may have gotten worse. In trying to correct the behavior, the user may simply make the illusion of thinking more persuasive.

To improve depth, we might prompt the LLM to work step by step in a method called chain-of-thought (CoT) prompting. The idea is that breaking a problem into smaller parts will lead to better responses. While this method can sometimes lead to better responses, it often just wastes tokens and computing power. The model isn’t actually solving a problem in parts. It isn’t pausing, checking assumptions, or reasoning through a process. It’s responding to the form of the prompt. When it produces a multi-step response, it’s drawing from examples of similar explanations it has seen in its training data. It’s giving you something that sounds like step-by-step reasoning. In many cases, the LLM generates the conclusion first, then backfills a plausible-sounding rationale to match it. Because the prompt suggests caution, the LLM will perform caution. Still, this performance of careful thought triggers our trust. Anthropic’s interpretability research has confirmed this pattern. When models like Claude are prompted to explain their reasoning, the step-by-step output may not reflect how the response was generated.

Conclusion

If you’re struggling to start a difficult task, and you just need something to respond to so you’re using GenAI, none of this may matter. But if you believe that LLMs can reason, it’s time to rethink that position. Reevaluate where you’ve put LLMs in your workflows and introduce additional verification steps to protect your work and your clients.

It’s laudable to want to be an efficient, tech savvy lawyer, but sometimes that means taking a different path or preparing before starting down the path. There are many opportunities between client intake and court filing where you can cut costs and save time. For now, reasoning is not one of them.

About the Author

Ivy B. Grey is the Chief Strategy & Growth Officer for WordRake. Prior to joining the team, she practiced bankruptcy law for ten years. In 2020, Ivy was recognized as an Influential Woman in Legal Tech by ILTA. She has also been recognized as a Fastcase 50 Honoree and included in the Women of Legal Tech list by the ABA Legal Technology Resource Center. Follow Ivy on Twitter @IvyBGrey or connect with her on LinkedIn.

[1] At the time of writing in May 2025, LLM reasoning is limited to what is described here. But the paper Towards Large Reasoning Models (2025) outlines a path toward reasoning that will get closer to meeting our expectations. The researchers propose a range of techniques that would help LLMs move beyond next-word prediction. Suggestions include adopting architectural changes and training strategies that support explicit reasoning—test-time planning, process supervision, and reinforcement learning with reasoning-aware reward models—that aim to scaffold deliberate reasoning within LLMs.

[2] This breakdown was inspired by James Faulkner’s post on LinkedIn reflecting on conversation between Emily Bender and Sébastien Bubeck and the ways we have coopted existing language of human experiences to describe what LLMs do as well as Matt White’s I Think Therefore I am: No, LLMs Cannot Reason (2025) on Medium.

[3] Chain-of-thought (CoT) prompting is a technique that tells LLMs to solve a problem using a series of steps before giving a final answer. Before deep research features became available, it was the primary way to simulate reasoning. The method is promoted in Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (2022). While impressive, it doesn’t live up to the expectations that lawyers have when they expect reasoning to reflect what humans do.

[4] Anthropic’s On the Biology of a Large Language Model (2025) showed that when an LLM “explains” how it solved a math problem, the explanation may not reflect the model’s actual internal process—it simply outputs what a plausible explanation would sound like, often after-the-fact. See also Reasoning Models Don't Always Say What They Think (2025)

[5] Anthropic’s Tracing the thoughts of a large language model (2025) showed that when an LLM “thinks aloud” the explanations do not the steps occurring, though there is some internal logic to how the problem is solved—just now how a human would do it. See also What large language models know and what people think they know (2025).

[6] I Think Therefore I am: No, LLMs Cannot Reason (2025).

[7] In GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models (2024), six AI researchers at Apple found that LMs cannot perform genuine logical reasoning, thus reasoning on math problems breaks down with even slight changes to wording.

[8] Revisiting subword tokenization: A case study on affixal negation in large language models (2024); Negation: A Pink Elephant in the Large Language Models' Room? (2025); and Language models are not naysayers: An analysis of language models on negation benchmarks (2023).

[9] In Can Transformers Reason Logically? A Study in SAT Solving (2024), researchers found LLMs could handle tasks involving formal logical inference—syllogisms, set theory, and nested conditionals, until problem involved subtle shifts in quantifiers (like “some” vs. “all”) and negation. This shows that LLMs don’t have semantic awareness of logic.

[10] After finding that “With standard decoding, virtually all the models answer decisively, even in the presence of significant intrinsic uncertainty, ” Yona et al. (2024) prompted the LLMs to hedge and the models would comply, but the language wasn’t a trustworthy indicator of uncertainty.

An Interview with Proposal Manager Kyle Morse

Easy Wading (or A Fly Angler’s Guide to Clear Legal Writing)