Thinking Machines Lab is targeting one of the most resistant quirks of modern AI: Ask the same model the same query, and you might get different answers. The research group, which is being led by the former OpenAI executive Mira Murati, has already started sketching out a technical agenda for making the outputs of large language models repeatable — not just statistically similar, but actual Reproducible,Reuse 24,Repeatable, title wordpress postsubtopic pagee964f9ed-5a9e-5093-a611-c5a1028c4776mraidIN||, Democratic, Title,Amol Sharma Entries of the model rescrambled in real time down to the token.
Why AI answers shift
Model response randomness is often attributed to sampling choices, such as temperature or top-k. That’s not the whole story. The reason, as researcher Horace He explains in a new post at the lab’s “Connectionism” blog, is that even with temperature set to zero and a fixed seed, results can shift in sans-souci fashion due to how what’s happening under the hood with GPU kernels. The order of floating‑point operations, atomic updates, and parallel reductions can differ between runs, while minor numerical differences may lead to divergent tokens.
This isn’t hypothetical. In practice, framework (e.g., annyang et al. (2019) PyTorch) documentation notes that certain GPU operations are non‑deterministic purely as a result of performance‑focused algorithm choices. Nvidia’s libraries also provide deterministic modes or algorithm flags for certain ops, but in most cases full determinism across an entire inference pipeline cannot be guaranteed. The impact is exacerbated for models distributed across multiple GPUs or nodes, due to scheduling jitter and kernel fusion decisions.
The kernel-level solution being tested
The thought behind Thinking Machines Lab’s work is simple: tame the orchestration layer that wrangles kernels into place, and you can things to the point that numerical variance is bounded enough to keep the decoding path the same. In practice this involves restricting reduction orders, avoiding non‑deterministic atomics, selecting algorithms carefully, and introducing precision paths end-to-end. It also means treating the runtime as a first‑class product, not a black box.
There, he says, a carefully constructed inference stack, from graph compilation to the order of kernel launches, can make outputs deterministic, without requiring a retraining of the model. The trade-out is that some throughput and latency headroom may have been left on the floor to achieve stability. For businesses that need auditability, and researchers who need reproducibility, this is a trade many will be willing to make.
Why determinism matters now
Consistency is more than a wonkish academic nicety. For regulated industries, a model which changes its output because a kernel reordered adds is a governance nightmare. Banks and medical providers more frequently request audit trails which associate input with a particular output and model state. That is, if you re-run an inference on the same prompt, weights, and seed, you should get the same tokens; anything else adds complexity in compliance reviews and incident response.
Determinism also smooths training workflows. Reinforcement learning systems reward or punish behavior based on sampled outputs; if output variability is due to runtime rather than policy, the reward signal becomes noisy. He writes that cleaner, reproducible inference could tighten and speed up RL fine‑tuning and offline evaluation. That dovetails with reporting from The Information that the lab wants to deploy RL to customize models for businesses.
Even evaluation benchmarks benefit. Teams running nightly “evals” frequently chase ghost regressions that are due to nondeterministic kernels, not real model drift. We call it the yellow stack, which turns those tests into true canaries, catching data or code changes rather than hardware quirks.
How it fits the lab’s broader play
Thinking Machines Lab has done so with outsized attention — and capital — by hiring high‑profile researchers and positioning itself as the builder of foundational infrastructure. Investors have supported a vision that encompasses new models as well as the plumbing to run them at scale reliably. Public commentary from Murati indicates the initial product will focus on researchers and startups who are developing their own systems, a group that currently struggles with reproducibility across clouds and hardware.
The lab says it plans to publish technical notes and code early and often via Connectionism, a more general shift toward open research that many would like to see across the field but that few heavy hitters consistently practice. If its determinism work ships as a baked-in inference runtime or kernel library, it may become a de facto reference layer for teams who require verifiable outputs but don’t want to retrain on a different framework.
What to watch
Determinism is the kind of claim that’s easy to make and hard to support. Concrete proof would be stable pass@1 results between repeated runs, identical sequences of tokens between identical GPUs, and at least reasonable parity across GPU SKUs in documented deterministic mode. Replication by academic labs or by members of MLCommons would provide independent credibility.
There is also a market question: can a deterministic stack hold up under real production loads without a painful performance hit? Nvidia has been introducing more and more deterministic choices in its libraries, framework maintainers have been progressively removing non‑deterministic ops, but end‑to‑end guarantees are a rare thing. A vendor who products such guarantees — and backs them up with transparent tests — would be filling an obvious hole to me in the AI stack.
If Thinking Machines Lab can consistently tame non-determinism at the kernel level, it’ll be more than just models feeling more reliable. It could establish a new pattern of how inference is constructed — and it may give enterprises the assurance to place large models at the foundation of systems where repeatability seriously isn’t an option.