Inferact, the new company formed by the creators of the open source inference engine vLLM, has raised $150 million in seed financing at an $800 million valuation, signaling a major push to turn one of the most popular LLM serving projects into a full-fledged enterprise platform.

The team’s bet is straightforward: as generative AI shifts from model training to real-world deployment, the winners will be those who make inference faster, cheaper, and easier to operate at scale. vLLM’s lead creator and Inferact CEO Simon Mo has said that vLLM already powers production workloads at major companies, including Amazon’s cloud unit and its retail app, as reported by Bloomberg.

Table of Contents

A Big Bet on Inference Efficiency for AI Serving
What vLLM Brings To Production Workloads
From Open Source to a Full Enterprise Platform
A Crowded Field With Clear Benchmarks for Serving

Inferact raises $150M to commercialize vLLM for enterprise AI

A Big Bet on Inference Efficiency for AI Serving

The financing lands amid a broader reorientation in AI infrastructure toward serving. Startups building inference runtimes and scheduling layers have become priority targets for investors as organizations evaluate total cost per token, latency, and GPU utilization rather than raw pretraining scale.

Inferact’s debut follows a similar move by the team behind SGLang, which was commercialized as RadixArk and reportedly valued at around $400 million in a round led by Accel. Both projects were incubated in 2023 at a UC Berkeley lab overseen by Databricks co-founder Ion Stoica, underscoring academia’s ongoing role in shipping practical systems for the AI stack.

What vLLM Brings To Production Workloads

vLLM rose quickly by attacking the bottlenecks that make serving large language models expensive. Its scheduling core popularized techniques such as PagedAttention for memory-efficient key–value cache management, enabling long-context responses without exhausting GPU memory, and continuous batching to keep devices saturated while maintaining responsiveness.

The result is higher throughput and steadier tail latency for a wide variety of models, from instruction-tuned LLMs to multi-tenant chat and tool use cases. Developers also value its OpenAI-compatible server and ecosystem connectors, which allow existing applications to swap in vLLM with minimal code changes while benefiting from better GPU utilization.

In practice, these optimizations translate into lower cost per 1,000 tokens and improved reliability under bursty, real-time traffic—two constraints that often derail pilots as usage scales. For teams operating across fleets of NVIDIA GPUs, those gains can compound quickly as workloads grow.

From Open Source to a Full Enterprise Platform

Commercializing vLLM gives customers a clearer path to supported, production-grade deployments. While details of Inferact’s offering were not disclosed, buyers typically look for SLAs, hardened security, compliance certifications, observability, and hands-on support for model deployment pipelines—especially across multi-cloud and hybrid environments.

Expect Inferact to focus on managed services and performance tooling that help teams squeeze more tokens from each GPU hour: adaptive batching and prioritization, autoscaling for heterogeneous clusters, and configuration presets tuned for common accelerators. For large enterprises, integration with existing MLOps stacks, role-based access controls, and cost attribution by team or application will be table stakes.

Critically, the company will need to maintain the project’s open source velocity while layering commercial features on top—a balance that has defined successful infrastructure companies over the past decade.

A Crowded Field With Clear Benchmarks for Serving

Inferact enters a competitive arena. Open source alternatives such as Hugging Face’s Text Generation Inference and NVIDIA’s TensorRT-LLM, alongside hosted platforms like Fireworks AI and Together AI, are vying to become the default runtime for serving. With SGLang’s commercialization as RadixArk, the race to own the inference layer is accelerating.

For customers, the calculus is pragmatic: lowest cost per token at required latency and reliability, with the simplest developer experience. Vendor-neutrality and data governance are also top of mind as enterprises standardize on private deployments for sensitive workloads.

Inferact’s war chest, the project’s widespread adoption, and the credibility of its Berkeley roots position it well. If the company can convert vLLM’s technical lead into enterprise guarantees and operational simplicity, it could become the default engine behind a growing share of AI applications.