Silicon Valley is investing capital and talent in simulated “environments” where AI agents learn through doing, not only by predicting text. The pitch: Don’t fine-tune a model on static data sets; instead, teach it to navigate browsers, spreadsheets and enterprise apps, and rank how well the model can accomplish multi-step tasks. Investors are watching and investors see a new infrastructure layer being built, with one leading lab even hinted in industry reporting that it has been developing plans to invest well over a billion dollars toward this approach over the next year.
It is the new data sidekick to labeled data from the chatbot boom—a substrate upon which agentic models can learn, fail forward and improve. If the last wave required Scale AI for annotation, the next one might need a “Scale AI for environments.” The question is whether environments are capable of providing the degrees of reliability and scale that modern agents require.

The importance of environments in agent training
Reinforcement learning (RL) environments are essentially such sandboxes that are instrumented to monitor an agent’s actions and their resulting outcomes. Imagine a tamable Chrome— a Chrome that lets web applications participate in the same ecosystem as native applications. Every click, keystroke, and tool call is monitored; being right may earn rewards that reflect in its prediction model, but errors lead to feedback signals.
That’s easy enough in theory, but then you have to deal with edge cases like dropdowns that get clipped off-screen, elements dependent on previous inputs or actual environments that could change over time, CAPTCHAs and login flows, and things that say a little too much about what they really are supposed to be. Unlike a static corpus, environments should be robust to unpredictable behavior and UI drift but still give clean feedback. That instrumentation—what to log, how to score, what to reset—was the true IP behind the feature.
There’s precedent. Gym has been popular around OpenAI for close to a decade and DeepMind’s AlphaGo demonstrated the potential of RL in mastering complex decision spaces. The novelty now is ambition: researchers aim for generally capable, computer-using agents that string together tools, surf the web and pursue natural-language goals. Recent progress, including OpenAI o1 and Anthropic’s Claude Opus 4, highlights that RL-style optimization minimizes the gating between functions, which allows for leaps in reasoning when supervised fine-tuning reaches a plateau.
A new startup race for the “environment layer”
The rush to productize spaces has launched a detour-filled marathon. (Startups including Mechanize Work are running after depth instead of breadth, hiring senior engineers for salaries rumored to go into the high six figures, and using them to construct a few very reliable environments rather than hundreds of shallow ones.)
Others are betting on distribution. Prime Intellect, with investment from investors like Andrej Karpathy, Founders Fund, and Menlo Ventures, debuted a hub designed to be a “Hugging Face for environments,” in which community-built tasks would live and compute would be sold to run them. Its authors say training universally capable agents in interactive settings is much more compute-intensive than previous fine-tuning regimes, opening a concomitant market for GPU providers and cloud platforms.
Incumbent data operations players are also shifting. Annotation-leaning companies — like Scale AI, Surge and Mercor — are building environment programs to cater to labs transitioning from labels into simulations. Scale AI, the labeling data startup that is inking deals with open-source projects in exchange for proprietary training data, now has 16 environment partners to which it provides access to some of its customers, and others yet provide software on top of what Scale supplies. The need is clear: labs rely on curated workflows, consistent scoring, and human-in-the-loop evaluation at scale.

The difficult part: reliability, rewards and cost
Making environments that actually move the needle is brutally hard. Teams must strike a balance between determinism and realism: too much randomness causes training to become noisy, while the opposite would mean that agents overfit to brittle scripts. Reward design is an art of its own—too specific a target can lead to degenerate behaviors, while too broad a brush will not teach the skill. Instruments must account for partial credit, tool use, latency and safety challenges, rather than only binary success.
Environments are also an operational grind. Web targets change constantly. Enterprise apps bring about permissions, rate limits, and audit requirements. Security sandboxes need to defend against data exfiltration, and apply least-privilege access to tools. Each of these affects training stability and expense.
And cost matters. Interactive RL becomes long rollouts, heavy CPU for simulation orchestration, and heavy GPU time between policy updates. The Prime Intellect team has made the point that this work is more costly than supervised fine-tuning as usual. That dynamic plays in favor of well-funded labs and clouds— unless open-source ecosystems start standardizing shareable environments that can enable easier access to distributed compute.
Skeptics inside labs question the environment push
Not everyone is bullish. A senior OpenAI executive recently told me he’s “short” on environment startups, in part because fast research turnover and lab-specific needs make it difficult for third parties to keep up. Karpathy, one of the early voices for agentic interactions, has warned that while environments are promising, settling on pure reinforcement learning is not going to be a panacea. Takeaway: agents want rich interaction loops; we might find success in hybrids that combine search, program synthesis, tool-calling out to lightweight RL (although not the monolithic kind).
What to watch next as agent training environments evolve
Signals to look for:
- UI environments that generalize across UI changes without retraining
- Improvements on sensible benchmarks such as WebArena and MiniWob++
- GAIA-style tool-use tasks
- Code-agent suites like SWE-bench
- Standardized reward schemas
- Public repositories of reusable tasks
- Tighter linkage between environment providers and cloud GPUs
If environments follow through, they could be the missing scaffolding for reliable agents that perform actual ticket closure and data manipulation within enterprise guardrails. If they sputter at that level, expect labs to double down on other strategies — chain-of-thought variants, planning modules, retrieval and programmatic tool graphs. For now, Silicon Valley is counting on the idea that future breakthroughs won’t come simply from more data — they’ll come from better places for agents to learn.