AI agents just posted a meaningful jump on a benchmark built to test whether software can perform real legal work, renewing a question the profession has been asking for a year: if not today, how soon can AI do a lawyer’s job?
The latest results come from Mercor’s agent benchmark, which evaluates multistep professional tasks like drafting legal memos, spotting issues in hypotheticals, and analyzing contracts. Anthropic’s new Opus 4.6 model pushed the leaderboard forward, scoring just under 30% in one-shot attempts and roughly 45% when allowed multiple tries—well ahead of earlier models that clustered below 25% only weeks ago.

It’s not a victory lap for machines in court, but it is a clear sign that agentic features—tool use, planning, and coordinated “swarms” of subagents—are now moving the needle on tasks that resemble day-to-day legal practice.
What Changed In The Legal Agent Benchmarks
Mercor’s evaluation stresses end-to-end execution, not just token-by-token prediction. Models must read a prompt, plan a sequence of steps, call tools or external data when appropriate, and deliver a final work product under constraints. In prior rounds, models floundered on long-horizon reasoning and cross-referencing facts across documents.
Opus 4.6 appears to improve each weak link. The model’s agentic stack supports iterative planning and self-critique, and Anthropic’s release included “agent swarms” that coordinate specialized workers. On multistep matters—think issue spotting across a fact pattern, synthesizing caselaw, then proposing edits to a clause—the compounded gains are visible in the scores.
Crucially, the uplift comes with limited prompt retries, suggesting higher baseline reliability. For firms evaluating AI as a workflow tool, fewer do-overs mean faster throughput and lower supervision costs.
Why 30% Matters More Than It Sounds In Legal Work
Thirty percent is not courtroom-ready. But in legal operations, partial automation compounds: shave 20–40% off document review, first drafts, or cite checks, and case teams redeploy hours to strategy. Goldman Sachs has estimated that roughly 44% of legal tasks are exposed to automation—largely the repetitive, text-heavy kind that pads billable hours but doesn’t decide outcomes.
Benchmarks also lag deployment realities. A model scoring 30% unguided may cross 60–70% in a workflow instrumented with retrieval, templates, checklists, and structured outputs. The lesson from e-discovery and contract lifecycle management is consistent: orchestrate the task well and average models look exceptional.
From Paralegal To Coauthor In Everyday Legal Work
Law firms and vendors have already been inching toward agent-like systems. Casetext pioneered GPT-4–powered brief drafting before its acquisition by Thomson Reuters, and Allen & Overy rolled out the Harvey platform to thousands of lawyers to assist with research and drafting. Corporate legal teams are using copilots to summarize NDAs, compare clauses against playbooks, and generate due diligence checklists.

What the new benchmark implies is that these tools won’t just autocomplete text; they will plan, verify, and ask for what they need. An “agent lawyer” doesn’t replace counsel—it drafts alternatives, flags risks tied to fact patterns, runs a quick analogical search over recent cases, and presents a reasoned memo for a human to approve or revise.
The Risk Ledger And The Rulebook For AI In Law
Legal work punishes errors. The Avianca case—where fabricated citations from a chatbot slipped into a filing—remains a cautionary tale. Multiple U.S. judges now require certifications that attorneys verified AI-assisted filings, and several courts have issued standing orders on disclosure and citation checks.
Regulators are circling, too. The EU AI Act treats systems used to assist in administering justice as high-risk, triggering requirements around transparency, data governance, and human oversight. For firms, this translates into audit logs, source grounding, confidentiality controls, and red-teaming models against biased or hallucinated outputs before they touch client matters.
Measuring Real Legal Reasoning In Agent Workflows
Traditional exams only tell part of the story. Research communities have built domain-specific evaluations—such as LegalBench and newer suites that test citation fidelity, statutory interpretation, and contract edits constrained by policy. The next wave of evals will be scenario-based: did the agent find the controlling precedent, properly distinguish adverse authority, and preserve privilege throughout the workflow?
Vendors are already moving this direction with “grounded generation,” forcing models to cite the passages that support each conclusion. Combine that with tool use—databases, calendaring, entity extraction—and agent reliability can be measured, not assumed.
What Comes Next For Agent Lawyers In Practice
Expect rapid iteration. If a single model release can lift one-shot legal task scores by double digits, coordinated systems and domain-tuned policies will push higher. The realistic near-term picture is a paralegal-plus agent that drafts, checks, and explains, with a lawyer in the loop owning judgment calls and ethics.
Can AI agents be lawyers after all? Not in the licensure sense. But as coauthors and tireless analysts, they’re getting uncomfortably good—and the latest benchmarks suggest they’re moving faster than many in the profession predicted just a month ago.
