A new benchmark intended to mirror real white-collar tasks has delivered a sobering verdict on AI agents’ workplace readiness. The APEX-Agents test, released by training-data firm Mercor, evaluates leading models on multi-step work from consulting, investment banking, and law. The best systems answered only about a quarter of questions correctly in a one-shot setting, raising fresh doubts that autonomous AI can replace knowledge workers anytime soon.
It’s a sharp counterpoint to the narrative that agents are poised to take over desk jobs. Despite tremendous gains in reasoning and planning, the majority of responses in this benchmark were either wrong or absent, underscoring the gap between lab demos and the messy, cross-domain demands of actual professional work.
A Tougher Test of Real Work Drawn from Professionals
Unlike generic multiple-choice exams, APEX-Agents draws tasks from practicing professionals on Mercor’s expert marketplace, who also defined what a successful answer looks like. The scenarios require nuanced interpretation, reference to internal policies, and alignment with regulatory frameworks. They are closer to client deliverables than to trivia questions.
Consider a law prompt assessing whether a company’s emergency export of EU production logs to a U.S. analytics vendor could be treated as permissible under Article 49 of EU privacy law. The correct outcome is “yes,” but getting there depends on interpreting the company’s policies and legal derogations in context. According to Mercor’s team, this kind of cross-referencing across domains—policy, law, and operational detail—is where current agents most often falter.
The benchmark also contrasts with earlier efforts like OpenAI’s GDPval, which gauges broad professional knowledge. APEX-Agents narrows in on sustained, high-value workflows, making it more predictive of whether an agent could handle tasks that carry revenue and compliance implications.
Scores That Temper the Hype on Agent Workplace Readiness
On one-shot accuracy—responding correctly without interactive retries—Gemini 3 Flash led at 24%, with GPT-5.2 close behind at 23%. Opus 4.5, Gemini 3 Pro, and GPT-5 clustered around 18%. In practical terms, that means more than 75% of attempts failed to meet expert-defined standards on the first try.
The results don’t imply the models are useless; they signal that current agent stacks lack the reliability and depth needed for autonomous execution in high-stakes domains. Benchmarks have been overcome before, and the public release of APEX-Agents on Hugging Face will undoubtedly spur optimization. But for now, the ceiling appears well below what’s required for unsupervised deployment in client-facing finance, legal, or strategy work.
Why Agents Still Struggle with Real Enterprise Work
Three recurring bottlenecks stand out. First, grounding across sources: real work often demands stitching together internal policies, contracts, vendor docs, and regulatory text. Retrieval-augmented generation helps, but grounding quality varies, and agents can confidently cite the wrong clause or overlook crucial exceptions.

Second, tool orchestration: complex tasks require multi-step planning, selective use of tools, and iterative checking. Many agent frameworks still stumble on long-horizon coordination and fail to verify intermediate outputs, a major source of subtle errors.
Third, calibration and risk: enterprise work is asymmetric. A plausible but incorrect legal interpretation or misapplied financial assumption can be costlier than no answer. That pushes systems toward excessive caution or overreach, depending on prompting. The benchmark’s findings reflect this tension—either silence when confidence is low, or authoritative mistakes when it isn’t.
What It Means for Employers Deploying AI Agents Now
Near-term value will come from human-in-the-loop patterns rather than autonomous agents. Leaders should prioritize use cases where precision can be measured and verified: drafting client memos with citations, summarizing earnings calls with source links, or generating due diligence checklists that analysts refine. Strong retrieval with provenance, audit trails, and clear escalation to humans are essential controls.
Risk teams can anchor deployments to established guidance like the NIST AI Risk Management Framework, ensuring governance over data access, model monitoring, and incident response. In regulated functions, pair agents with mandatory review gates and standardized templates to limit variability. The ROI story is strongest where throughput gains don’t compromise compliance—think research synthesis, proposal generation, and structured data extraction.
The Next Milestones to Watch in Agent Capabilities
Progress will hinge on better retrieval fidelity, richer tool ecosystems, and agents that plan, verify, and cite as they go. Expect vendors to optimize for APEX-Agents with fine-tuned workflows and domain-specific memory. Just as important, evaluations must expand beyond accuracy to include business metrics like time saved, error rates under review, and compliance adherence.
The headline today is caution, not collapse. APEX-Agents shows that autonomous AI remains far from replacing bankers, consultants, or lawyers. But it also provides a sharper target for progress. If labs can lift one-shot accuracy well beyond 24% while preserving verifiability and provenance, the conversation about “agentic” work will move from hype to habit.