Nvidia has unveiled Vera Rubin, a new-generation “superchip” system that integrates a new 88-core Vera CPU with a pair of Rubin GPUs on a single board. While clearly intended for hyperscale AI groupings, the architecture targets greater compute density, fatter memory pipes, and tight CPU-GPU coherence as a method to accelerate both training and inference at enormous scale while reducing carbon output.

The Vera Rubin board has two flagship Rubins, an 8×8 Vera core, and a spin-off of NVLink-C2C. Nvidia reviewed this framework as offering significant additions in throughput relative to its current CPU-GPU coupling as a strategy to eliminate the penalties from data transport that dominate Big Model performance. The module, according to the firm, is the basis for the next wave of AI machines instead of being a self-contained accelerator card. The organization did not release sensitive clock rates or other design details.

Table of Contents

Memory design with HBM4 seeks to ease data movement
CPU memory via CAMM2 and faster NVLink-C2C coherence
System scale, NVL144 performance, and network topology
Why Vera Rubin matters for next-generation data centers
Key questions to watch for the Vera Rubin system

A person in a black leather jacket holds up a large, dark gray circuit board with multiple chips and components, including two prominent white square chips and a central, smaller square chip.

Nvidia’s claims on the topic focused on Vera Rubin’s muscles relative to the current generation of A100, with performance-per-rack improvements and memory-led speedups for a mixture of experts, retrieval-augmented generation, and multi-turn inference.

Memory design with HBM4 seeks to ease data movement

Memory unlocks this technology. Each Rubin GPU accompanying a chipset communicates with 288GB of HBM4 for reduced-precision tensor calculations and inter-node communications. It also eliminates the necessity for opposite shuffling between cores while allowing more bandwidth per socket for transformer-sized lattices. This is one of two promising directions indicated in the industry by JEDEC and memory producers’ roadmaps.

CPU memory via CAMM2 and faster NVLink-C2C coherence

On the CPU side, Nvidia pointed out that support for SoC-attached memory with CAMM2 modules was essential, with vendors exploring CAMM3 configurations up to 2TB per superchip for memory-heavy pipelines. CAMM2, a standard developed by JEDEC, enables server manufacturers to install dense, low-profile memory next to the processor without the signal integrity and service limitations compared with soldered LPDDR. The enhanced NVLink-C2C link between Vera and Rubin is supposed to extend the valid memory domain across CPU and GPUs. Thus, these modifications should make pointer operations, data preprocessing, and host-device communication more efficient, an issue frequently cited by cloud providers and embedded in MLCommons scaling annotations.

System scale, NVL144 performance, and network topology

System scale and performance:

Nvidia proved an entire system scale named NVL144, with 144 Rubin GPUs combined. Per MLCommons, headline outcomes of about 3.6 exaflops in FP4 for inference and near 1.2 exaflops in FP8 for training were published. This positions NVL144 nearly 3.3 times the capability of its present NVL72 trade class. Since that’s the way cloud players operate hardware today – by nodes, racks, and pods – the spotlight on whole-fabric performance is evident.

A man walks past a large display of the Vera Rubin Superchip, which is described as a processor for gigascale AI factories with 100 PF AI, 88 custom Arm cores, 2 TB fast memory, and 6 trillion transistors.

Finally, interconnect scope. In MLCommons entries for MLPerf and cloud suppliers, Nvidia’s networking arc – NVLink domains united with high-radix routers and Ethernet or InfiniBand – is a successful input. If the NVLink domain grows or splits denser in Vera Rubin structures, the effective effectiveness available at the leap point can transcend the published FLOP comparisons separately.

Why Vera Rubin matters for next-generation data centers

Why it matters for data centers. The industry’s bottleneck has swung from absolute compute to data movement and memory size. Larger context windows, extra-long sequences, expert routing, and more all penalize architectures with skinny memory pipelines or tiny on-device footprints. Vera Rubin aims to reduce the cost of training and serving the newest models by combining large HBM4 pools with CPU-side CAMM2 capacity and faster C2C links.

Power and cooling will be kept under close watch. Liquid cooling and power delivery breakthroughs are increasingly embraced by operators to keep slot density high. Recent Nvidia systems have promoted straight liquid cooling, so it will be surprising if the Vera Rubin platforms do not use a tighter thermal envelope per RU for enhanced support. Nvidia also indicated a more significant model, Rubin Ultra NVL576, for a subsequent cycle, including more GPUs per pod, HBM4e memory, and four times the initial tier’s power output. The information for cloud buyers is that updates will be made frequently.

The setting is growing increasingly competitive. AMD has described its next-wave accelerators with more HBM and coherent memory characteristics, and Intel has taken Gaudi in the training and inference arena. Recent MLPerf benchmarks show that software sophistication, kernel consolidation, and interconnection can affect the results as much as simple silicon does. A tightly integrated CPU-GPU architecture, complete coherent memory, and a mature CUDA and networking stack are Nvidia’s bets that keep it ahead.

Key questions to watch for the Vera Rubin system

Pricing, availability, and software maturity for FP4 and FP8 paths will determine real-world ROI.
How gracefully existing models and frameworks adopt lower-precision inference, how quickly compilers exploit C2C bandwidth, and how NVLink domains scale across racks will drive utilization gains.

For operators’ TCO, for now, Vera Rubin reads like Nvidia’s most aggressive swing yet at the memory wall: more capacity on package, more bandwidth between chips, more performance per rack. If those promises hold in the field, the next generation of AI data centers could look a lot less like your computer and a lot more like one.