LLMs crush coding and math but choke on casual questions, and that’s not a contradiction

Large language models can now refactor sprawling codebases and solve graduate-level math, yet they still stumble over simple, everyday prompts. That whiplash is real, argues Andrej Karpathy—and it isn’t a contradiction. It’s a consequence of where today’s AI gets the strongest, cleanest feedback.

Two AI realities, talking past each other

Karpathy describes two camps shaping the public narrative. One group has tried a free chatbot or a flashy voice demo and walked away unimpressed after silly mistakes and hallucinations. Those experiences, often with outdated or lightly maintained models and interfaces, color their entire view of AI’s limits.

The other group lives inside modern, pro-grade setups—agentic coding environments, math solvers, research assistants—powered by the latest frontier models. There, the progress has been staggering: models can autonomously restructure code, write and pass unit tests, and even surface security vulnerabilities with minimal human nudging.

Both groups are correct about what they see. They’re just looking at different slices of the stack—and different feedback loops.

Why code and math are racing ahead

The key difference is verifiability. In domains like programming and mathematics, correctness can be checked automatically: the code compiles or it doesn’t; the test suite passes or fails; the proof steps balance or break. That binary feedback is gold for training and tuning. It enables reinforcement learning with clear reward signals, synthetic data generation at scale, and rapid iteration with measurable gains.

By contrast, fuzzy domains—open-ended writing, broad consulting, general chit-chat—lack crisp ground truth. There’s no universal pass/fail for “good advice” or “compelling prose.” Without a clean metric, optimization is noisier and progress is harder to measure. Models can still be helpful, but they’re more prone to inconsistency and overconfident errors, especially in casual, under-specified prompts.

The Software 2.0 lens: don’t specify, verify

Karpathy’s earlier “Software 2.0” framing puts it bluntly: what matters isn’t whether you can describe the task—it’s whether you can verify the output. If a system can receive fast, automated feedback, it can be trained efficiently. If not, improvement stalls behind human-in-the-loop bottlenecks and subjective judgments.

That’s why AI shines in environments rich with automatic checks: compilers, linters, type systems, formal verifiers, SAT/SMT solvers, theorem provers, and exhaustive test harnesses. Each acts like a teaching signal. The more verifiable the task, the more automatable it becomes in this new programming paradigm.

So, can general intelligence emerge from LLMs?

This split raises a live question: are language models inching toward general intelligence, or are we merely sculpting domain specialists wherever verification is strong? Karpathy’s argument doesn’t close the door on generality—it reframes what’s required to get there. Scale helps, but closing the gap likely hinges on improving the feedback infrastructure: better evaluators, more robust simulators, richer tool use, and agents that turn vague goals into verifiable subproblems.

In practice, that means more systems where models propose solutions and external tools judge them. Think of it as building “scaffolding” that converts open-ended tasks into sequences of checkable steps, tightening the learning loop one verification at a time.

The elusive universal verifier

Industry watchers have whispered about a “universal verifier” that could make reinforcement learning work across nearly any domain. If you could reliably grade arbitrary outputs, you could optimize toward quality everywhere—writing, strategy, design, even conversation. So far, that silver bullet hasn’t appeared. The hard truth is that many human tasks resist objective scoring because context, taste, and trade-offs matter.

Meanwhile, debates about the limits of current methods simmer on. Some researchers have even quipped that “deep learning research is done”—a provocation reflecting frustration with diminishing returns in poorly verifiable domains rather than a consensus conclusion. What’s clear is that the frontier is shifting from raw modeling to better feedback, evaluation, and tool integration.

How to reconcile the contradiction

  • Both impressions are valid: casual chat still trips models; structured, checkable work is surging.
  • Verifiability is the engine: domains with automated feedback accelerate; fuzzy tasks lag.
  • Progress now depends as much on scaffolding as on model size—tests, tools, and evaluators are the new compilers of AI.

The takeaway for users and teams: calibrate expectations by task. If you can build or borrow a verifier—unit tests, solvers, checkers, benchmarks—you can push AI surprisingly far. If you can’t, don’t expect miracles from a single prompt. Until we invent broadly reliable evaluators, LLMs will keep looking superhuman in code and math while feeling strangely mortal in small talk—and that’s exactly what the science predicts.

Leave a Reply

Your email address will not be published. Required fields are marked *

You May Also Like

Unlock Your Escape: Mastering Asylum Life Codes for Roblox Adventures

Asylum Life Codes (May 2025) As a tech journalist and someone who…

Challenging AI Boundaries: Yann LeCun on Limitations and Potentials of Large Language Models

Exploring the Boundaries of AI: Yann LeCun’s Perspective on the Limitations of…

Unveiling Oracle’s AI Enhancements: A Leap Forward in Logistics and Database Management

Oracle Unveils Cutting-Edge AI Enhancements at Oracle Cloud World Mumbai In an…

Charting New Terrain: Physical Reservoir Computing and the Future of AI

Beyond Electricity: Exploring AI through Physical Reservoir Computing In an era where…