Arm C1 and Mali G1: The next‑gen mobile leap inside

Arm’s new C1 CPU family and Mali G1 GPU are not typical incremental annual updates. They also indicate a more concentrated emphasis on sustained performance, efficiency and on-device AI — and an intentional shift toward more turnkey platform designs. According to Arm briefings and upstream company tech collateral, they appear poised to remake premium and upper‑mid mobile silicon over the coming cycle.

Table of Contents

C1 CPUs: Architecture, tiers, and actual gains
SME2: CPU-side acceleration for on-device AI
Mali G1: Optimization Fixes And A Smarter RT Unit
Platforms, partners and what to watch

C1 CPUs: Architecture, tiers, and actual gains

The C1 series does away with the Cortex branding in favor of four core levels: Ultra, Performance, Pro, and Nano. All implement Armv9. 3, which pretty much spells the end of the mix-and-match era of last year’s Cortex-X and A parts. There are three unique microarchitectures under the hood with C1-Ultra and C1-Performance being the main cores, C1-Pro as the middle core, and C1-Nano doing background efficiency work.

Arm C1 CPU and Mali G1 GPU powering next‑gen mobile SoC architecture

Arm’s own numbers point to meaningful — if modest — uplift. When we iso-configure, C1-Ultra is ~12% faster than Cortex-X925. Shifting front‑end processing to 3nm with the headroom extending to 3.7-3.9/4.1GHz gets single‑thread gains to within the 25% region relative to 3.6GHz class of last year. Perhaps more importantly, C1-Ultra can match previous peak performance while consuming 28% less power, a win users will appreciate in thermals and battery life.

How is it achieved? The lifting is largely being done on the front end. Arm has increased the out‑of‑order window size to about 2,000 in‑flight instructions (~1,500 previously) and has widened L1 instruction bandwidth by 1/3. The focus is in that case is on a more high‑throughput design with more feed for current execution resources. C1-Performance reflects Ultra’s philosophy but scales for an approximately 35% smaller area footprint, aimed at more cost-sensitive flagships.”

The C1‑Pro mid core is all about smarter prediction and faster access: a larger branch predictor and BTB to reduce mispredicts, higher L1 data bandwidth and lower latency L2 TLB.

SME2: CPU-side acceleration for on-device AI

Introducing this generation is SME2, an AI‑centric extension designed to speed-up common inferencing workloads on the CPU. It introduces multi‑vector operations, predicates, 2/4‑bit weight compression, and support for binary networks. It is worth noting that SME2 is a shared resource with the core complex and the position in the pipeline, the size of the execution block, and the power‑gating operations can be tuned appropriately.

The design has a two major upsides: it makes sure the AI every cluster gets is the same and doesn’t bloat every core, and provides a clean scaling from budget to premium layouts.

Crucially, software is lining up. SME2 paths are enabled in Google’s XNNPACK for Android and are supported by popular frameworks such as llama. cpp, Alibaba’s MNN, and Microsoft’s ONNX. Arm’s KleidiAI libraries allow developers to benefit from SME2 once devices ship, minimizing the lag between hardware deployment and useful app‑level speedups.

Arm C1 CPU and Mali G1 GPU highlight next‑gen mobile chipset performance

Mali G1: Optimization Fixes And A Smarter RT Unit

On the graphics front, Mali G1 also rearranges the stack into Ultra, Premium and Pro tiers. At like‑for‑like core counts (Arm compares a 14‑core G1‑Ultra to last year’s fastest), Arm claims ~20% better performance for gaming and ML inference, and ~9% less energy per frame. A redesigned on‑chip interconnect doubles internal bandwidth and cache in HBM-equipped Vega parts, while Image Region Dependencies allow the GPU to skip unnecessary work and decrease memory traffic—a useful trick to keep the ALUs busy without spiking power.

Ray tracing is the headliner feature, for sure. G1 introduces hardware BVH traversal and adopts a unified single‑ray algorithm optimized for small‑memory footprints; it merges casting and intersection tests because the ray tracing unit can be completely power‑gated under idle. Arm’s optimal case claims are closer to 2x ray tracing throughput, although real scenes differ. For Arm’s own Unreal Engine‑based test, the uplift appeared closer to ~40%, with frame‑rates that hovered around the mid‑30FPSs with occasional drops. Translation: you’ll get better thought-of RTs, but not a miraculous doubling across the board.

Branding maps 1:1 to scale: 10+ cores with symmetrical RT performance is G1‑Ultra, 6-9 cores is G1-Premium, 1-5 cores is G1‑Pro—targeting mainstream to entry

Platforms, partners and what to watch

Outside of IP blocks, Arm is pushing its Lumex CSS integrated platform design initiative, which sees the company working more closely with its foundry partners, TSMC among them, to shrink the time‑to‑market. This is a top‑end example layout from Arm’s own reference material, consisting of two 4.1GHz C1‑Ultra cores, six 3.5GHz C1‑Pro cores, two SME2 units and 16MB of L3, in addition to a 14‑core Mali G1‑Ultra with some beefy caches fabricated on a 3nm chip. It’s a memory‑rich, performance‑first template; commercial products will probably optimize cache sizes to cost.

Cost-optimized near-flagships could potentially replace the big cores with C1-Performance and qualify down the GPU, and mid-tiers could use 1× Ultra/Performance + 3× Pro + 4× Nano, or 2× Pro + 6× Nano for mainstream.

Reckon on early adoption from vendors that already have form on shipping products based on Arm’s latest CPU and GPU IP in fast cycles.

Bottom line: C1 provides stable single‑thread uplift and holds on to better efficiency, whilst SME2 and G1’s redesigned RT unit addresses specific high‑impact workloads drawbacks. The largest wins will be in terms of software support, thermal envelopes, and memory bandwidth in shipping devices — all possible points of concern when the first phones arrive.