RadixArk, a new commercial venture born from the popular open source project SGLang, has spun out with a valuation around $400 million, according to people familiar with the matter. The move underscores how fast the AI inference market is expanding as enterprises race to squeeze more performance and lower latency out of existing GPU fleets.

SGLang, used by teams at companies such as xAI and Cursor, has built momentum as an engine for running large language models more efficiently. Several core contributors have now shifted to RadixArk, positioning the startup to productize the technology while continuing to support the open source codebase.

Table of Contents

From Open Source Roots To Commercial Play
Betting On Inference Efficiency To Cut Costs
A Crowded But Expanding Market For AI Inference
Why A $400M Price Tag Might Add Up For RadixArk
What To Watch Next As RadixArk Builds Momentum

From Open Source Roots To Commercial Play

Like fellow inference project vLLM, SGLang traces its origins to Ion Stoica’s lab at UC Berkeley, a proving ground that also produced Databricks. The academic-to-startup path has become a pattern in AI infrastructure: build adoption in the open, then layer enterprise-grade operations, support, and SLAs.

Ying Sheng, a key SGLang contributor and former engineer at xAI, is leading RadixArk as co-founder and CEO. Sheng previously worked as a research scientist at Databricks, a background that fits RadixArk’s focus on production-scale systems. The company has raised angel capital, including support from veteran semiconductor and enterprise investors, according to people familiar with the financing.

RadixArk says it will continue to steward SGLang as an open-source AI model engine while developing commercial offerings. Those are expected to include managed hosting, enterprise support, and tooling that simplifies deployment across heterogeneous GPU clusters.

Betting On Inference Efficiency To Cut Costs

Inference—the phase where models generate outputs for real users—now dominates the operational cost curve for AI services. Nvidia has indicated that inference represents a growing share of data center GPU workloads, overtaking training for many customers. Every percentage point of throughput gained or latency reduced translates into immediate savings at scale.

SGLang’s appeal lies in techniques that are now table stakes for high-performance serving: continuous batching to keep GPUs saturated, paged attention to optimize memory, smarter key-value cache management, quantization-aware kernels, and speculative decoding to minimize wasted compute. In practice, these approaches can boost tokens-per-second and cut P95/P99 latency without new hardware—a compelling proposition when H100-class GPUs remain supply constrained.

Beyond raw serving speed, RadixArk is building Miles, a framework geared for reinforcement learning. The aim is to let customers adapt models in production using feedback loops—think RLHF refreshes, policy optimization for agents, and domain-specific skill tuning—while keeping inference costs predictable. That pairing of serving and continuous improvement could be a differentiator for enterprises that want models to get better with real-world use rather than periodic offline retrains.

A Crowded But Expanding Market For AI Inference

RadixArk enters a heated field. Forbes recently reported that the team behind vLLM has discussed raising roughly $160 million at a valuation near $1 billion. The Wall Street Journal reported that Baseten secured $300 million at a $5 billion valuation, while Fireworks AI raised $250 million at a $4 billion valuation late last year. The funding wave reflects investor conviction that the inference layer—model serving, routing, optimization, and observability—will be a durable part of the AI stack.

Hardware trends reinforce the thesis. As LLMs swell in parameter count and context windows stretch into hundreds of thousands of tokens, serving becomes a memory and scheduling problem as much as a raw FLOPS problem. Vendors that deliver GPU-agnostic acceleration across Nvidia’s H100/H200/B200 and AMD’s MI300-class hardware, while handling quantized and mixture-of-experts models, will have an advantage as customers seek portability and cost control.

Why A $400M Price Tag Might Add Up For RadixArk

Valuations at seed and early growth for AI infrastructure companies increasingly reflect strategic positioning rather than current revenue. RadixArk owns several levers that investors prize: an active open-source community, demonstrated performance on widely used open models, and a pathway to enterprise contracts that promise immediate ROI via cost-per-token reductions. For buyers spending millions monthly on inference, a 20–30% efficiency gain can be budget-changing.

If RadixArk can convert SGLang’s adoption into enterprise subscriptions—offering uptime guarantees, privacy-preserving on-prem deployments, robust telemetry, and workload-aware autoscaling—the company could tap the same tailwinds lifting peers. The key will be translating benchmark wins into predictable savings in messy, multi-tenant real-world environments.

What To Watch Next As RadixArk Builds Momentum

Keep an eye on RadixArk’s head-to-head benchmarks on Llama 3, Mixtral, and other MoE models; support for quantization schemes like AWQ and GPTQ; and performance under long-context workloads where KV cache strategies become critical. Enterprise readiness—role-based access, audit logs, data residency controls—and support for hybrid GPU fleets will also signal how quickly RadixArk can move from open source traction to commercial scale.

The inference market is expanding fast, but it rewards execution. If RadixArk can turn SGLang’s technical edge into lower tail latencies, higher throughput, and simpler operations across varied hardware, the $400 million starting line may look conservative in hindsight.