AI companies are copying each other's homework to make cheap models

Cheaply built AI abounds as developers riff off Big Tech's costly offerings. But like a dollar store, selection and quality may vary.

Mar 7, 2025 - 12:17
 0
AI companies are copying each other's homework to make cheap models
Sam Altman illustration looking to the left
Sam Altman
  • The price of building AI is falling to new lows.
  • New, cheaper AI development techniques have developers rejoicing — but it's not all upshot.
  • As costs hit rock bottom, Big Tech foundation model builders must justify expensive offerings.

How much does it cost to start an AI company?

The answer is less and less each day as large language models are being created for smaller and smaller sums.

The cost of AI computing is falling. Plus, a technique called distillation to make decent LLMs at discount prices is spreading. This has sent a spark through parts of the AI ecosystem and a chill through others.

Distillation is an old concept gaining new significance. For most, that's good news. For a select few, it's complicated. And for the future of AI, it's important.

Distillation defined

AI developers and experts say distillation is, at its core, using one model to improve another. A larger "teacher" model is prompted to generate responses and paths of reasoning and a smaller "student" model mimics its behavior.

Chinese firm DeepSeek caused a stir with OpenAI-competitive models reported to have trained for around $5 million. It sent the stock market into a panic, punishing Nvidia with a loss of $600 billion in market capitalization for the potential downshift in chip demand. (Such a decline has yet to materialize.)

A UC Berkeley team of researchers, flying further under the radar, trained two new models for under $1000 in computing costs each, according to research released in January.

In early February, researchers from Stanford University, The University of Washington, and the Allen Institute for AI were able to train a serviceable reasoning model for a fraction of that.

Distillation was an unlock for all of these developments.

It is a tool in developers' toolboxes, alongside fine-tuning, to improve models in the training phase, but at a much lower cost than other methods. Both techniques are used by developers to give models specific expertise or skills.

This could mean taking a generic foundation model like Meta's Llama and using another model to distill it into an expert on US tax law, for example.

It could also look like using DeepSeek's R1 reasoning model to distill Llama to have more reasoning capabilities — meaning when AI takes a longer time to generate an answer in order to question its own logic and lay out the process of reaching an answer step-by-step.

"Perhaps the most interesting part of the R1 paper was being able to turn non-reasoning smaller models into reasoning ones via fine-tuning them with outputs from a reasoning model," wrote the analysts at Semianalysis in January.

In addition to the bargain price tag — at least for AI — DeepSeek released distilled versions of other open-source models using the R1 reasoning model as the teacher. DeepSeek's full-sized models, along with the largest versions of Llama are so large that only certain hardware can run them. Distillation helps with that too.

"That distilled model has a smaller footprint, fewer parameters, less memory," explained Samir Kumar, a general partner at Touring Capital. "You can run it on your phone. You could run it on edge devices," he said.

DeepSeek's breakthrough was that the distilled models didn't get worse as they got smaller as was expected. In fact, they got better.

Distillation isn't new but it has changed

The distillation technique first surfaced in a 2015 paper authored by prominent Google AI chiefs Jeff Dean and Geoffrey Hinton, and current Google DeepMind research VP Oriol Vinyals.

Vinyals recently said the paper was rejected from the prestigious NeurIPS conference because it was not deemed to have much impact on the field. A decade later, distillation is suddenly at the forefront of AI discussion.

What makes distillation so powerful now as opposed to back then, is the number and quality of open-source models to use as teachers.

"I think by releasing a very capable model — the most capable model to date — in the open source with a permissible MIT license, DeepSeek is essentially eroding that competitive moat that all the big model providers have had to date, keeping their biggest models behind closed doors," Kate Soule, director of technical management for IBM's LLM Granite, said on the company's Mixture of Experts podcast in January.

How far distillation can go

Soule said Hugging Face, the internet repository for LLMs, is full of distilled versions of Meta's Llama and Alibaba's Qwen, both open-source traditional models.

Indeed, of the 1.5 million models available on Hugging Face, 30,000 of them contain the word "distill" in the name, which conventionally indicates a distilled model. But none of the distilled models have made the site's leaderboard.

Just like shopping at the dollar store in the physical world, distillation presents some of the lowest cost-to-performance ratios on the market, but the selection is somewhat limited and there are drawbacks.

Making a model particularly good at one type of task through distillation can erode its performance in other areas.

Apple researchers attempted to create a "distillation scaling law" that can predict the performance of a distilled AI model based on factors including the size of the model being built, the size of the "teacher" model, and the amount of computing power used.

They concluded that distillation can work better than traditional supervised learning in some cases, but only when a high-quality "teacher" model is used. The teacher also needs to be. larger than the model being trained, but not beyond a certain threshold. Improvement stops as teacher models grow too big.

Still, the technique can, for instance, close the distance between idea and prototype for founders and generally lower the barrier to entry for building AI.

Finding a shortcut to smarter, smaller models doesn't necessarily negate the need for big, expensive foundation models, said multiple AI experts. But it does call into question the financial prospects of the companies that build those big models.

Are foundation models doomed?

"Just about every AI developer in the world today," is using DeepSeek's R-1 to distill new models, Nvidia CEO Jensen Huang said on CNBC following the company's latest quarterly earnings.

Distillation has brought opportunity, but it is poised to meet opposition due to the threat it poses to massive, expensive, proprietary models like those made by OpenAI and Anthropic.

"I think the foundation models will become more and more commoditized. There's a limit that pre-trained models can achieve, and we're getting closer and closer to that wall," Jasper Zhang, cofounder of cloud platform Hyperbolic said.

Zhang said the answer for the big names of LLMs is to create beloved products, rather than beloved models — perhaps lending credence to Meta's decision to make its Llama models somewhat open.

There are also more aggressive tactics foundational model companies can take, according to a Google Deepmind researcher who asked to remain anonymous to discuss other companies.

Companies with reasoning models could remove or reduce the reasoning steps or "traces" shown to the user so that they can't be used for distillation. OpenAI hides the full reasoning path in its large o1 reasoning model but has since released a smaller version, o3-mini, which does show this information.

"One of the things you're going to see over the next few months is our leading AI companies trying to prevent distillation," David Sacks, President Donald Trump's adviser for cryptocurrency and artificial intelligence policy told Fox News in January.

Still, it may be difficult to put the genie back in the bottle, by tamping down distillation in the Wild West of open-source AI.

"Anyone can go to Hugging Face and find tons of data sets that were generated from GPT models, that are formatted and designed for training and likely taken without the rights to do so. This is like a secret that's not a secret that's been going on forever," Soule said on the same podcast.

Anthropic and OpenAI did not respond to requests for comment.

Read the original article on Business Insider