Nvidia has also focused on Rubin CPX, a new GPU designed for AI models that need to reason over million‑plus token contexts. The system, which will be presented at the AI Infrastructure Summit, as well as its underlying approach enable scaling long‑context inference (think entire codebases, hour‑long videos, or huge research archives) without collapsing under memory pressure or latency.
Why long-context inference matters
Context length determines how much an AI system can “see” at any given time. Most production values today are in the range of 100K–200K tokens, whereas Google’s Gemini 1.5 N Pro provides million‑token prompts. Stretching out beyond that is more than just a party trick: it breaks brittle retrieval hops, helps maintain narrative continuity, and supports other tasks like multi‑file code reasoning, complex legal analysis, and long‑form video comprehension.
The catch is memory. Transformer Inference saves key‑value (KV) caches for each layer and attention head which are linealy proportional to the sequence length. Be relaxed windows of million tokens, KV caches blow up into the terabytes with frontier models, and straightforward scaling out is no longer feasible. That’s the problem that Rubin CPX was designed to solve.
Disaggregated inference, explained
Nvidia frames CPX as part of a “disaggregated inference” architecture — decoupling compute from memory and networking so count you can scale each separately. The concept: Share high‑bandwidth memory among accelerators; create tiers of high‑bandwidth memory and system RAM; and move KV caches intelligently between the levels without sacrificing throughput.
Look for it to rely leveraging high‑speed interconnects and software that understand where and when to impose state. Inference stack already has KVC (cache paging), in‑flight batching, quantization in TensorRT‑LLM, plus Ecosystem tools like vLLM. Combining those with a GPU that has been optimized for very long sequence processing indicates CPX will indeed target steady throughput even in a context that stretches far beyond 1m tokens.
Algorithmic advances complement the hardware. Works such as FlashAttention (while inexpensive and mlpspaceeff Help and memory-efficient attention variants removed overhead per token, and new attention routing schemes eliminated the necessity of touching the entirety of cache every step. CPX’s worth will come from realizing those ideas as reliable, production‑grade throughput across large fleets.
Positioning on Nvidia’s roadmap
Rubin CPX belongs to the upcoming Rubin series from Nvidia and is expected to be available by the end of 2026. It’s part of the company’s fast forward march of AI accelerators: Hopper and H200 for training and inference, followed by Blackwell for higher-density compute, with Rubin moving memory-centric inference even further. That momentum is underlined by the company’s data center business, which most recently pulled in $41.1 billion in quarterly revenue, the company said in its filings.
The competitive landscape is simmering. AMD’s Instinct MI300X puts the spotlight on a large HBM memory capacity for memory-bound workloads and hyperscalers are considering custom silicon to reduce and optimize the cost per inference per token. CPX serves as a shot fired across Nvidia’s bow that they are FLOPs-optimized, end‑to‑end memory movement and orchestration optimized high‑context tier defenders.
Who needs million‑token windows
Developers increasingly desire models that can ingest an entire repository so the assistant can reason across modules, tests, and docs without chunking. Video teams need this concise, accurate knowledge over a long timeline for creation and editing. Both financial services and healthcare need to analyze years’ worth of records with fewer jumps for retrieval and greater traceability. Interest from the enterprise is shifting from demos to operational SLAs: consistent latency, predictable per-million tokens cost, security for sensitive context.
Model providers are preparing too. Anthropic and OpenAI have deployed as many as 100K–200K contexts in production tiers, and research previews have featured even longer windows. As prompts and intermediate state grow, KV‑heavy inference is no longer a luxury but an enabling tradeoff.
What to watch next
CPX will be determined entirely on three numbers: tokens‑per‑second on million‑token contexts, energy per token, and effective capacity (how much context you can serve at what cost while keeping to tight latency). The independent results from groups like MLCommons will matter once silicon is sampling.
Equally important is software readiness. How quickly cloud providers and enterprises can deploy will depend on the level of support in TensorRT‑LLM, Triton Inference Server and well-known runtimes. If Nvidia can ship those components along with the hardware Rubin CPX could be the default target for long‑context inference, just as previous generations became the go‑to for model training.
Bottom line: Getting past a million tokens makes a difference for what AI can do in the real world. CPX is Nvidia’s attempt to make that leap practical at scale — and keep long‑context AI in the mainstream, not just in research demos.