DeepSeek AI claims it has discovered a pragmatic path to running large language models that is far less expensive than what other companies in the field, which include not only DeepMind but also OpenAI and Facebook’s AI division, have managed.
In this paper announcing DeepSeek-V3.2-Exp, the company claims a 75-percent reduction in prediction, or inference, cost—from approximately $1.68 down to roughly $0.42 per million tokens—by heavily utilizing sparsity and an efficient indexing module that minimizes what the model has to compute.

The pitch is straightforward: fewer redundant operations, less GPU time, and lower bills, with no noticeable cut in quality. The approach relies on a companion “lightning indexer,” as well as a sparsity-based attention scheme to reduce the amount of token comparisons the model has to make. The claims are derived from DeepSeek’s technical paper and internal benchmarks shared on GitHub along with the research.
The Claim and the Math Behind DeepSeek’s Cost Savings
For chatbots and code assistants, inference costs scale with tokens. At $1.68 per million tokens, a 10,000-token session costs about 1.68 cents. Going down to $0.42 per million tokens reduces that to roughly 0.42 cents a clip. Multiplied over millions of daily interactions, the paybacks become material: a service processing one million 10,000-token sessions can save more than $12,000 per day at list rates.
Alternatively, a 75% discount buys four times as many tokens per dollar. That makes all the difference in long-context workloads—summarizing large documents, conducting multi-turn conversations, backing retrieval-augmented systems—where attention costs explode as a sequence grows to hundreds or thousands of tokens.
How Sparsity Drives the Savings in AI Inference
Sparsity is the concept of not having any model that must have every weight firing or every token being compared at each step. Previous work has exploited sparsity through pruning of parameters, gating experts, as in mixture-of-experts models, and skipping activations that make small contributions to output. Sparsity has been noted by Apple’s machine learning researchers and others as a critical lever for shrinking compute without caving in to cratered accuracy.
DeepSeek’s newest turn concentrates on attention, the most computationally expensive operation in modern transformers. Standard attention has all queries attend to all keys, which scales poorly with increasing context length. By safely ignoring most of the comparison relationships, you cut by a wide margin the heaviest line item in the inference budget.
Inside the Lightning Indexer That Guides Attention
DeepSeek trains the main model and a small “lightning indexer” separately. The core model retains its traditional attention mechanisms (as well as the company’s multi-head latent attention from the previous generation), but the indexer is now responsible for learning to select a small subset of highly relevant tokens to attend to at inference.

Operationally, the indexer restricts the search space prior to attention runs, akin to a librarian picking only the right shelves before a reader begins reading. This reduces the number of query–key comparisons, and hence matrix multiplications. DeepSeek presents “significant” end-to-end speedups on long-context tasks with no significant sacrifice of accuracy compared to its base model.
In addition to the sparsity mechanism, DeepSeek claims to have trained its model on domain-specific data for both mathematics and coding, where efficient reasoning and accurate pattern placement are highly essential. It’s an optimization that would potentially also increase LLM practical throughput by not requiring a single unnecessary detour during generation.
How It Compares to Other Model Optimization Methods
The field has been continually whittling away at the cost of attention. Multi-query and grouped-query attention (employed by several leading labs) share key–value projections across heads, saving memory and speeding generation. FlashAttention, presented by academics associated with Tri Dao, reorganizes computation in a way that reduces reads from memory on GPUs and provides significant improvements in throughput. In mixture-of-experts models like Google’s Switch Transformer, tokens are routed to a small number of experts so that computations become sparse over parameters.
DeepSeek’s approach is in the same family but aims at a different choke point: rather than trimming attention computation time, it preselects which tokens deserve attention in the first place. It looks evolutionary, not revolutionary; like an engineering cocktail that mixes a dedicated indexer with latent attention and prior efficiency improvements baked right into the model.
Real-World Impact and the Open Questions That Remain
For businesses, the relevant questions are also predictable: Can you drive more throughput ($ per GPU) under high concurrency conditions? When the indexer and generator run on the same machine, what is the subsequent queueing behavior? How does this approach interact with factored KV caching, batching policies, and retrieval-augmented architectures that do some sort of context pruning?
Independent validation will matter. Benchmarks from academia, including MLCommons for inference and evaluations from academic teams like the one behind Stanford’s HELM framework, can help distinguish sustainable gains from benchmark artifacts. Public side-by-side comparisons with the baseline architectures at different context lengths would increase our confidence.
If the numbers hold, it’s a fairly uncontroversial payoff: cheaper per-interaction (“per-turn,” in chatbot jargon) cost, longer contexts without ruinous latency, and more space to specialize models for particular domains. Even small savings add up at that scale. The lesson is less about a single trick, and more about the direction today’s cutting-edge organizations are heading—toward building models to do the least amount of work possible, and nothing else.