Nvidia has introduced Rubin CPX, a new GPU tuned specifically for AI models that need to reason over million‑plus token contexts. Announced at the AI Infrastructure Summit, CPX is designed to accelerate long-context inference—think entire codebases, hour‑long videos, or vast research archives—without crumpling under memory pressure or latency.
Why long-context inference matters
Context length governs how much an AI system can “see” at once. Most production models today hover around 100K–200K tokens, with Google’s Gemini 1.5 Pro showcasing million‑token prompts. Stretching beyond that is more than a party trick: it reduces brittle retrieval hops, preserves narrative continuity, and improves tasks such as multi‑file code reasoning, complex legal analysis, and long‑form video understanding.

The catch is memory. Transformer inference stores key‑value (KV) caches for every layer and attention head, growing linearly with sequence length. At million‑token windows, KV caches balloon to sizes measured in terabytes for frontier models, making naive scaling impractical. That is the pain point Rubin CPX is built to address.
Disaggregated inference, explained
Nvidia frames CPX as part of a “disaggregated inference” architecture—separating compute from memory and networking so each can scale independently. The idea: pool high‑bandwidth memory across accelerators, tier it with system RAM, and move KV caches intelligently between them without stalling throughput.
Expect the approach to lean on high‑speed interconnects and software that knows when and where to place state. Nvidia’s inference stack already includes techniques such as KV cache paging, in‑flight batching, and quantization in TensorRT‑LLM, plus ecosystem tools like vLLM. Pairing those with a GPU designed for large sequence processing suggests CPX will target steady tokens‑per‑second even as contexts stretch past one million tokens.
Algorithmic advances complement the hardware. Work like FlashAttention and memory‑efficient attention variants cut overhead per token, while new attention routing schemes reduce the need to touch the entire cache every step. CPX’s value will be in turning those ideas into consistent, production‑grade throughput across large fleets.
Positioning on Nvidia’s roadmap
Rubin CPX is part of Nvidia’s forthcoming Rubin series and is slated for availability at the end of 2026. It follows the company’s rapid cadence of AI accelerators: Hopper and H200 for training and inference, then Blackwell for higher‑density compute, with Rubin pushing memory‑centric inference further. The company’s momentum is reflected in its data center business, which most recently delivered $41.1 billion in quarterly revenue, according to Nvidia’s filings.
The competitive backdrop is heating up. AMD’s Instinct MI300X emphasizes large HBM capacity for memory‑bound workloads, while hyperscalers are eyeing custom silicon to shave inference cost per token. CPX signals Nvidia’s intent to defend the high‑context tier by optimizing not just flops, but end‑to‑end memory movement and orchestration.
Who needs million‑token windows
Developers increasingly want models that can ingest an entire repository so the assistant can reason across modules, tests, and docs without chunking. Video teams want frame‑accurate understanding over long timelines for generation and editing. Financial services and healthcare want to analyze years of records with fewer retrieval hops and better traceability. Enterprise interest is moving from demos to operational SLAs: stable latency, predictable cost per million tokens, and security for sensitive context.
Model providers are preparing too. Anthropic and OpenAI have pushed 100K–200K contexts in production tiers, and research previews have shown even longer windows. As prompts and intermediate state expand, infrastructure tuned for KV‑heavy inference becomes a necessity rather than a luxury.
What to watch next
Three datapoints will determine CPX’s impact: tokens‑per‑second at million‑token contexts, energy per token, and effective capacity (how much context you can serve at a given cost while keeping latency tight). Independent results from groups like MLCommons will matter once silicon is sampling.
Equally important is software readiness. Seamless integration with TensorRT‑LLM, Triton Inference Server, and popular runtimes will dictate how quickly cloud providers and enterprises can deploy. If Nvidia delivers those pieces alongside the hardware, Rubin CPX could become the default target for long‑context inference, much as past generations became the standard for model training.
Bottom line: pushing beyond a million tokens changes what AI can do in the real world. Rubin CPX is Nvidia’s bid to make that leap practical at scale—and to keep long‑context AI in the mainstream, not just in research demos.