Worried a chatbot is about to replace your contract gig? A new benchmark suggests you can exhale, at least for now. In a large-scale test spanning real remote freelance projects, state-of-the-art AI agents delivered finished work that clients would accept only a tiny fraction of the time, with the top system automating just 2.5% of tasks.
Inside the Remote Labor Index end-to-end evaluation
Researchers created the Remote Labor Index, or RLI, to evaluate whether AI can complete complex, economically valuable projects end to end—not just answer a prompt or pass a coding quiz. They sourced assignments that had already been completed by human freelancers in fields such as game development, product design, architecture, data analysis, and video animation. In human hands, the portfolio represented roughly $10,000 of paid work and more than 100 hours of effort.

The RLI emphasizes realistic constraints: ambiguous briefs, multi-step tool use, file management, and quality thresholds that mirror what a paying client would accept. The study evaluated multiple advanced systems—Manus, Grok 4, Sonnet 4.5, GPT-5, a ChatGPT agent, and Gemini 2.5 Pro—tasking them with delivering completed files and artifacts, not just outlines or drafts.
Automation rates hover near zero in client-ready work
The results were unambiguous.
- Manus: 2.5% automation rate
- Grok 4: 2.1% automation rate
- Sonnet 4.5: 2.1% automation rate
- GPT-5: 1.7% automation rate
- ChatGPT agent: 1.3% automation rate
- Gemini 2.5 Pro: 0.8% automation rate
In other words, even the best agents failed to deliver acceptable, client-ready work more than 97% of the time across this suite of remote projects.
That showing stands in stark contrast to AI’s performance on popular academic benchmarks, where models routinely score at or above human levels on multiple-choice tests, programming puzzles, and summarization tasks. The RLI’s gap highlights a hard truth: excelling at static benchmarks does not guarantee reliable execution of long-horizon, multi-tool, revision-heavy work.
Why AI agents struggled with real client work
One of the researchers, Dan Hendrycks, noted that while modern AIs can be impressively knowledgeable, they lack capabilities critical for remote execution. Long-term memory is thin to nonexistent, so agents cannot learn from earlier missteps or carry context cleanly across lengthy sessions. Visual reasoning—vital for tasks involving design comps, architectural renderings, or timeline-based video edits—remains brittle.

Real-world freelancing also demands robust tool orchestration: version control, asset handoffs, dependency installation, and precise file outputs. Today’s agents often stumble on these basics. They can generate promising drafts but falter on final-mile quality, edge-case handling, and the back-and-forth revision loop that clients expect. Non-determinism compounds the issue; identical prompts can yield inconsistent behaviors, making deadlines and QA hard to trust.
What it means for remote professionals and freelancers
For remote workers, this is a welcome signal: creative, open-ended, and tool-intensive projects remain meaningfully human. The RLI focused on tasks that require judgment, iterative problem-solving, and visual or spatial reasoning—areas where human freelancers hold an edge. Routine subtasks are still ripe for assistance, from data cleanup and code scaffolding to drafting outlines and generating first-pass visuals, but “press button, ship deliverable” is not where current agents shine.
Broader labor research echoes this nuance. Organizations such as the OECD emphasize that AI is more likely to reshape task mixes than to fully automate roles, especially in occupations with rich interpersonal and creative components. Professional leverage, not wholesale replacement, remains the near-term story: workers who combine domain expertise with AI copilot skills are seeing productivity gains without ceding ownership of outcomes.
The trajectory to watch in AI agent capabilities
The researchers stress that progress is measurable. Gains in long-term memory, multimodal perception, and reliable tool use could move RLI scores upward. Expect rapid iteration around persistent memory stores, retrieval-augmented reasoning, safer autonomous actions within sandboxes, and deeper integrations with IDEs, design suites, and analytics platforms. As these pieces mature, the line between “assist” and “automate” will blur in selected niches.
The takeaway is balanced. Today’s AI agents underperform on end-to-end remote freelance work, with automation rates clustered near zero. That buys time for professionals to double down on the durable skills that RLI appears to reward: client communication, problem framing, cross-tool fluency, taste and judgment, and rigorous QA. Use AI to clear the underbrush—drafts, boilerplate, data prep—while keeping your hands on the steering wheel. The jobs are still here, and for now, they’re still yours.