John Carmack is firing shots at Nvidia’s latest compact AI box, saying the real-world results aren’t enough to match up with the pitch. With the evergreen tinkerer’s hands-on testing, the legendary engineer reported that the DGX Spark delivered an order of magnitude less in compute performance than he anticipated, while consuming less than half its rated power during non-burst workflows and clearly being thermally constrained on sustained workloads. The findings have been echoed by other developers who question how Nvidia is defining performance for its small-form-factor AI systems.

Carmack on Power, Throughput, and Sustained Performance Gaps

The DGX Spark’s sustained power draw hovered around 100W, Carmack says, in his workloads—well off the ~240W figure that Nvidia quotes for full-tilt operation. He also notes that reliable AI throughput came in at about half of the headline figure, and that during extended sessions, the system ran very hot—signs of clocks dropping to remain within thermals, he says. From an engineer whose reputation stands on optimization work at id Software and Oculus, his criticism resonates with fellow developers for whom sustained performance, not peak specs, is a matter of life or death.

Table of Contents

Carmack on Power, Throughput, and Sustained Performance Gaps
What Nvidia Originally Pitched With the Compact DGX Spark
The Fine Print on a Petaflop Claim and What It Implies
Developers Echo Concerns After Hands-On Tests With DGX Spark
Possible Explanations and Next Steps for Clearer Performance Data

DGX Spark hits half promised performance, highlighted by Carmack

What Nvidia Originally Pitched With the Compact DGX Spark

Announced in mid‑October, the DGX Spark pairs a 20-core Arm-based Nvidia Grace CPU with a Blackwell-class GPU using a “GB10” superchip approach. Nvidia is pitching it as a desk-friendly AI workstation, with thousands of CUDA cores and 128GB of shared LPDDR5X memory in place of discrete VRAM, along with 4TB NVMe storage and full access to its development stack including CUDA, cuDNN, and TensorRT. Nvidia has promised no less than 1 petaflop of AI performance for a compact thermal envelope, with systems in general priced around $4,000 and some partners targeting as little as $3,000.

The Fine Print on a Petaflop Claim and What It Implies

That 1‑petaflop headline almost certainly refers to low‑precision math (usually FP8 or INT8) in optimal cases and with structured sparsity enabled. Since Ampere, Nvidia has backed 2:4 sparsity, which ignores zeroed weights in a model and can double effective throughput on paper. The catch: a good deal of real-world production workloads are not pruned to those patterns, and you may lose the benefit with dense layers, attention blocks, or custom kernels. In such cases, the “marketing” flops and flops with which one can actually feed real data become divergent.

Memory behavior compounds the gap. Unified LPDDR5X allows for large local experiment capacities, but it has its own bandwidth and latency footprint (and it is unlike that of high‑end HBM or GDDR setups). If a model is memory‑bound or spilling activations, the GPU’s tensor cores won’t be saturated; measured throughput will be below these theoretical ceilings even on short benchmarks and lag by an even wider gap in long, thermally constrained runs.

Developers Echo Concerns After Hands-On Tests With DGX Spark

VideoCardz reported, very unflatteringly, that some of the lead developers of Apple’s MLX framework saw results around 60 TFLOPs on DGX Spark where they should have been getting close to four times that. That holds true to Carmack’s experience and to what engineers who have tried small Blackwell‑based systems on training loops and inference-heavy pipelines say. The recurring theme: short bursts of peak-friendly work appear okay, but sustained loads expose the presence of power caps, conservative firmware, or cold-headroom ceilings that bottleneck performance.

A close-up, professional shot of two stacked gold-textured computer units, showcasing their rear ports including USB-C, HDMI, Ethernet, and USB-A, with a black cable plugged into the top unit.

Let me be clear: none of this makes DGX Spark worthless. A quiet, low‑footprint box with 128GB of unified memory and well-supported CUDA tooling is an attractive developer platform for many teams. The problem is expectations management: if you’re a lab or a start-up and budget against “1 PFLOP at 240W”, there’s a ripple effect that occurs when the real number gets halved because it only occurs with dense, sustained jobs.

Possible Explanations and Next Steps for Clearer Performance Data

A couple of potential scapegoats beyond marketing math are to blame. Factory power profiles may cap draw for acoustics; firmware that prefers longevity over clocks; early drivers and kernels can leave performance on the table; small chassis thermals can droop during marathon sessions. Transparent statements about precision, sparsity, and sustained (not burst) throughput would help buyers set expectations based in reality.

Independent testing is the quickest way to cut through the clutter. Results in the community with pre-tuned suites — MLPerf I&T, or openly reproducible benchmarks across FP8/FP16 full/half (including zeroing out sparsity) — would show if Carmack’s numbers are an edge case or generalizable. Nvidia has not publicly responded to these claims in any meaningful way — a full breakdown of the power modes, thermal targets, and scenarios driving its petaflop figure would be helpful in this regard.

Until then, potential buyers should think of DGX Spark as a decent edge AI box with some caveats: impressive on paper and flexible for development, but unlikely to realize its headline numbers across dense sustained workloads without careful tuning.

As Carmack’s criticism emphasizes, the metric that ultimately matters is not peak flops — it is what you can sustain over time, at the wall, with models you actually want to run.