What is sparsity? Apple researchers reveal the secret of DeepSeek AI

Yuichiro Chino/Getty Images

The artificial intelligence (AI) market — and the entire stock market — was rocked last month by the sudden popularity of DeepSeek, the open-source large language model (LLM) developed by a China-based hedge fund that has bested OpenAI’s best on some tasks while costing far less.

Also: Cerebras CEO on DeepSeek: Every time computing gets cheaper, the market gets bigger

As ZDNET’s Radhika Rajkumar details, R1’s success highlights a sea change in AI that could empower smaller labs and researchers to create competitive models and diversify available options.

Why does DeepSeek work so well?

Its success is due to a broad approach within deep-learning forms of AI to squeeze more out of computer chips by exploiting a phenomenon known as “sparsity”.

Sparsity comes in many forms. Sometimes, it involves eliminating parts of the data that AI uses when that data doesn’t materially affect the model’s output.

Also:I put DeepSeek AI’s coding skills to the test – here’s where it fell apart

At other times, sparsity involves cutting away whole parts of a neural network if doing so doesn’t affect the result.

DeepSeek is an example of the latter: parsimonious use of neural nets.

The main advance most people have identified in DeepSeek is that it can turn large sections of neural network “weights” or “parameters” on and off. Parameters shape how a neural network can transform input — the prompt you type — into generated text or images. Parameters have a direct impact on how long it takes to perform computations. More parameters typically mean more computing effort.

Sparsity and its role in AI

The ability to use only some of the total parameters of an LLM and shut off the rest is an example of sparsity. That sparsity can have a major impact on how big or small the computing budget is for an AI model.

Apple AI researchers, in a report published Jan. 21, explained how DeepSeek and similar approaches use sparsity to get better results for a given amount of computing power.

Apple has no connection to DeepSeek, but the tech giant does its own AI research. Therefore, the developments of outside companies such as DeepSeek are broadly part of Apple’s continued involvement in AI research.

Also: Deepseek’s AI model proves easy to jailbreak – and worse

In the paper, titled “Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models”, Samir Abnar, other Apple researchers and Harshay Shah, a collaborator from MIT, have studied how performance varies when they turn off parts of the neuronal network. Abnar and his team used a code library to conduct their studies

. MegaBlocks, a new AI product that will be released in 2023 by Microsoft, Google and Stanford researchers, is slated to be releasedin 2023. They make it clear that their work is applicable to DeepSeek as well as other recent innovations. Abnar and his team are asking if there is a “optimal” level of sparsity for DeepSeek or similar models. For a given amount computing power, what’s the optimal number of neural weights that can be turned on or off?

According to the research, you can quantify sparsity by the percentage of neural weights that you can turn off. This percentage is close to but never equals 100% of the neural network being “inactive”.

Graphs show that for a given neural net, on a given computing budget, there’s an optimal amount of the neural net that can be turned off to reach a level of accuracy. The same economic rule of thumb has been true for every new generation of personal computers: either a better result for the same money or the same result for less money.

Apple

For a neural network of a given size in total parameters, with a given amount of computing, you need fewer and fewer parameters to achieve the same or better accuracy on a given AI benchmark test, such as math or question answering.

Put another way, whatever your computing power, you can increasingly turn off parts of the neural net and get the same or better results.

Optimizing AI with fewer parameters

As Abnar and team stated in technical terms: “Increasing sparsity while proportionally expanding the total number of parameters consistently leads to a lower pretraining loss, even when constrained by a fixed training compute budget.” The term “pretraining loss” is the AI term for how accurate a neural net is. Lower training loss means more accurate results.

That finding explains how DeepSeek could have less computing power but reach the same or better results simply by shutting off more network parts.

Also: The best AI for coding in 2025 (and what not to use)

Sparsity is like a magic dial that finds the best match for your AI model and available compute.

The same economic rule of thumb has been true for every new generation of personal computers: either a better result for the same money or the same result for less money.

Also: Security firm discovers DeepSeek has ‘direct links’ to Chinese government servers

There are some other details to consider about DeepSeek. For example, another DeepSeek innovation, Ege Erdil, Epoch Ai, Explainedthat a mathematical trick is called “multi-head latent attention”. Multi-head latent Attention is used, without getting too deep into the weeds to compress one of largest consumers of memory, the memory cache which holds the most recent input text for a prompt.

Future of sparsity Research

Details notwithstanding, the most important point about this effort is that sparsity is not a new phenomenon in AI research or an approach to engineering.

AI research has shown for many decades that removing parts of a neuronal net can achieve similar or even better accuracy while requiring less effort.

xAI’s Grok 3 also performs better than expected. How to try it free (before you pay)

Intel, a competitor of Nvidia, has been using sparsity for years to improve the state-of-the-art in the field. Startups that have used sparsity to develop their products have also scored high on industry benchmarks over the past few years.

The magic dial of sparsity doesn’t only shave computing costs, as in the case of DeepSeek. Sparsity also works in the other direction: it can make increasingly efficient AI computers.

Apple

The magic dial of sparsity is profound because it not only improves economics for a small budget, as in the case of DeepSeek, but it also works in the other direction: spend more, and you’ll get even better benefits via sparsity. As you turn up your computing power, the accuracy of the AI model improves, Abnar and the team found.

Also: Are we losing our critical thinking skills to AI? New Microsoft study raises red flags

They suggested: “As sparsity increases, the validation loss decreases for all compute budgets, with larger budgets achieving lower losses at each sparsity level.”

In theory, then, you can make bigger and bigger models, on bigger and bigger computers, and get better bang for your buck.

All that sparsity work means that DeepSeek is only one example of a broad area of research that many labs are already following — and many more will now jump on to replicate DeepSeek’s success.

Artificial Intelligence

What is sparsity? Apple researchers reveal the secret of DeepSeek AI

Why does DeepSeek work so well?

Sparsity and its role in AI

Optimizing AI with fewer parameters

Future of sparsity Research

DeepSeek Releases New R1-0528 Model on Hugging Face, Rivaling Top AI...

A Hacker Could Have Deepfaked Trump’s Chief of Staff with a...

Republican Operatives Want To Distancing From Elon Musk’s DOGE

‘Little evidence’ that EU laws aided criminals in crypto kidnappings

Recomended

DeepSeek Releases New R1-0528 Model on Hugging Face, Rivaling Top AI in Coding

A Hacker Could Have Deepfaked Trump’s Chief of Staff with a Phishing Campaign.

Republican Operatives Want To Distancing From Elon Musk’s DOGE

‘Little evidence’ that EU laws aided criminals in crypto kidnappings

Google and DOJ argue over how AI will transform the web in antitrust final arguments

DeepSeek Releases R1-0528: An Open-Source Reasoning AI Model Delivering Enhanced Math and Code Performance with Single-GPU Efficiency