It's A Small (Language Model) World After All

At this year’s ILTACON, between the open bars and the marketing bingo cards, I picked up on a murmur running through the legal tech crowd. While OpenAI and Anthropic continue begging for more and more investor cash in the face of consistently lackluster earnings, some vendors delivering advanced AI to the legal industry dropped hints about growing interest in small models. It’s not that large language models don’t work — though they often don’t — but they’re overbloated science experiments that, as Goldman Sachs observed, require exponentially increased resources to achieve tiny linear gains. Practical applications, at least in legal, don’t need models that need the human battery array from The Matrix just to say, “here’s a haiku about ERISA.”

This week, a number of developments from the greater tech world tend to confirm that the future is small.

Small models — for the purpose of this discussion — are “small” only as compared to the labyrinthian architectures behind products like GPT-5. That said, these smaller models deliver cheaper results without much drop off in quality. Some could be light enough to run on institutional hardware, meaning law firms and corporate clients can keep their data in-house instead of shipping it off to Silicon Valley narcs. For an industry that still treats the cloud like it’s a Soviet spy balloon — an overreaction, but a persistent one — the pitch for small models is obvious: more control, less spend, nearly the same output.

This week, Meta announced its small reasoning model, confirming that the race toward small might be on. Its new model is designed to be hosted locally and will be specialized — as small models are by necessity — to math and coding applications, but the announcement bucks what had been a runaway train behind building bigger and bigger models. Going small might also be in Meta’s best interest since this week’s demonstration of its general AI offering imploded on stage during a live demo:

I’ll bet Zuckerberg never thought he’d ever find himself on stage thinking back fondly to the Metaverse announcement. Wifi problems? Sure, bud.

For some time now, I’ve been saying that whoever delivers the “American DeepSeek” wins the long-term AI crown. China-based DeepSeek is still a large model by technical standards, but much smaller than the competition, and it burst onto the scene this year claiming to do basically everything the behemoth American models can for a fraction of the price. Except tell you what happened in Tiananmen Square in 1989, of course. Investors in up to their necks with the big American foundational models tried to downplay DeepSeek’s cheapness claims, arguing that the Chinese government must have contributed more money under the table to bring the product to life. Though even the most aggressive theories of Chinese government involvement still ended in a product that cost a tiny fraction of what the Americans spent that still outperforming American models on some tasks. Anyone able to replicate that without the lingering concern that the product is scraping corporate secrets into a PRC database should dominate the space.

This week, in a preprint of a peer-reviewed paper, DeepSeek disclosed the cost of training its R1 model was… $294,000. That’s cheaper than a second-year associate once you include the bonus and the cost of every midnight Uber Eats order and 2 a.m. black car voucher. With cheaper training comes cheaper operation. DeepSeek charges something like $0.0011 per thousand tokens, which is a whopping 27 times cheaper than OpenAI.

But are smaller models ready for the “agentic” revolution? The answer is yes. And not just because “agentic” is empty buzzword that should be purged from legal tech conversations. According to VentureBeat, “agentic” is, charitably, “a largely nebulous term still to this day in the AI industry.” Less charitably, tech commentator Ed Zitron describes it as “one of the most egregious acts of fraud I’ve seen in my entire career writing about this crap, and that includes the metaverse.” Fundamentally, it’s a batch file of chatbot prompts — which is not necessarily a dig, since curated and vetted prompts make for better results — but, in action, agents take short, general prompts from the user and from that build a workflow — which a chatbot can do — and then use that workflow to generate results, often by pinging outside resources. It can save some time over repeatedly prompting a bot, but it’s not a robot lawyer run amok like the “agent” branding might suggest.

They also fail a lot. According to Salesforce, the company putting more eggs in agentic AI than anyone, agents “achieve around a 58 percent success rate on tasks that can be completed in a single step without needing follow-up actions or more information” and this falls “to 35 percent when a task requires multiple steps.” This is their own research!

However, designed by the right hands, these systems can produce better and faster results than a user working alone. But, again, do they need large models to pull this off?

Also this week, Alibaba’s AI research team dropped Tongyi DeepResearch, “on par with OpenAI’s DeepResearch across a comprehensive suite of benchmarks.” Per VentureBeat:

The new Tongyi DeepResearch Agent is setting off a furor among AI power users and experts around the globe for its high performance marks: according to its makers, its the “the first fully open-source Web Agent to achieve performance on par with OpenAI’s Deep Research with only 30B (Activated 3B) parameters.”

That is… small. By way of comparison, GPT-4 supposedly ran on 2 trillion parameters. Compared to an activated 3 billion, that’s an ominous 666x difference.

Look, large models played their part. Without them, we probably wouldn’t have these workable smaller models. The real trick of a large model is that it’s nearly impossible to properly weight a model to get the most efficient results. But once the model is massive, it will develop smaller sub-models doing the real work on various queries. The premise of the Lottery Ticket Hypothesis is that once you have a big enough model, you can start paring down to find the ideally weighted model that wouldn’t have been uncovered but for the original massive investment. At that point, you can, as the joke goes, build the whole plane out of the black box — market a smaller model that does everything an application actually needs and nothing more.

As an industry, AI can start cashing in those winning tickets instead of doubling down on lotto scratchers.

This is especially true in legal, where our applications don’t require paving over the Mohave with server farms, we just need something smart enough to speed up the job. When you’re summarizing depositions, you’re not going to find yourself hurting because the underlying model wasn’t trained on a 10-year-old TypePad blog post about birdwatching. For our profession, small is both beautiful and indispensable.

And cheaper. Did we mention cheaper yet? Because it’s cheaper.

The AI landscape isn’t going to shift overnight, but as this week suggests, the tide might be turning. It’s hard to imagine OpenAI going belly up in a few months (unless you actually look at their revenues and expenditures).

But it was also hard to imagine a world without Napster or MySpace.

Joe Patrice is a senior editor at Above the Law and co-host of Thinking Like A Lawyer. Feel free to email any tips, questions, or comments. Follow him on Twitter or Bluesky if you’re interested in law, politics, and a healthy dose of college sports news. Joe also serves as a Managing Director at RPN Executive Search.