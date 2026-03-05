OpenAI’s latest flagship model, GPT-5.4, is being positioned as a turning point for professional-grade AI. In evaluations designed to mirror real jobs, OpenAI says the new “Thinking” model matches or outperforms human experts 83% of the time. The company also reports material quality gains over its predecessor, with 18% fewer errors and 33% fewer false claims than GPT-5.2 on previously flagged factual prompts.

Inside the 83% benchmark for professional task performance

The 83% figure comes from OpenAI’s GPTval evaluation, a suite built to measure how well models execute economically valuable, real-world tasks. The framework spans nine major industries and 44 occupations—roles with high wages, low physical labor, and high potential for knowledge work automation. Think corporate finance, manufacturing engineering, accounting, and market analysis.

Tasks were sourced and refined with experienced professionals to reflect day-to-day work. One manufacturing engineering prompt asked for the design of a jig or fixture to streamline cable spool handling in underground mining—a complex, highly specific brief that blends domain knowledge and practical constraints. Completed tasks were graded by human professionals who were blinded to whether the output came from a person or a model. OpenAI then trained an automated grader on those human judgments to enable rapid iteration, while acknowledging the need to watch for bias in any AI-assisted scoring.

Performance has accelerated sharply across recent releases. By OpenAI’s account, GPT-5.1 scored 38.8% on GPTval. GPT-5.2 leapt to 70.9%, with Wharton’s Ethan Mollick calling GPTval one of the most economically relevant measures of AI ability. Less than a quarter later, GPT-5.4 pushes that to 83%, meaning that in head-to-head, hours-long tasks, a majority of expert graders preferred its outputs over those of human professionals.

What GPT-5.4 changes for professional workflows and teams

GPT-5.4 is the first mainline reasoning model to incorporate the frontier coding capabilities previously introduced in GPT-5.3 Codex, according to OpenAI. Beyond raw reasoning, the company highlights improved factual discipline: on prompts with a history of user-flagged issues, claims are a third less likely to be false than GPT-5.2, and overall error rates drop by nearly a fifth. That combination—stronger chain-of-thought planning, higher coding fluency, and fewer factual slips—is what appears to drive its gains on long, multi-step tasks.

Early enterprise anecdotes add color. Daniel Swiecki, who leads Artificial Intelligence Solutions at Walleye Capital, said GPT-5.4 improved accuracy by 30 percentage points on the firm’s most demanding finance and Excel evaluations. That, he noted, widens the scope of automated model updates and scenario analyses for fundamental investing workflows—precisely the sort of repetitive, detail-intensive work many teams aim to streamline.

How the evaluations were built, scored, and bias-checked

OpenAI’s approach tries to anchor evaluations in authentic professional practice. Occupations were selected from industries contributing at least 5% to U.S. GDP, then filtered for roles with limited manual labor and outsized compensation impact. For each occupation, seasoned practitioners developed scenarios that typically take four to eight hours for humans to complete, from audit workpapers and pricing models to engineering design memos and marketing strategy briefs.

Human graders scored outputs against rubrics emphasizing correctness, completeness, compliance with constraints, and practical utility. The addition of an automated grader—trained on those human scores—allows frequent retesting as models evolve. While that speeds iteration, independent replication and transparency around rubrics will be important to bolster trust, particularly when model families are benchmarked against one another or used in high-stakes decisions.

Availability timeline and model lineup across ChatGPT and API

GPT-5.4 is rolling out via the API and across paid ChatGPT tiers, and it will appear in Codex as GPT-5.4 Thinking. OpenAI positions “Thinking” models for complex reasoning and “Instant” models for speed and conversational fluidity. The company indicates the tracks will evolve at different cadences—useful context for teams choosing between latency-sensitive assistants and deeper analytical copilots.

Implications for work and the open questions that remain

For professionals, the headline is not just that GPT-5.4 beats humans frequently—it’s that it does so on tasks representative of real deliverables. The most immediate impact will be augmentation: faster first drafts of analyses, cleaner code, tighter financial models, and more rigorous design notes that experts then refine. But as reliability improves, parts of the workflow will shift from assist to automate, changing how teams staff and sequence projects.

Caveats remain. Performance likely varies by domain and task complexity, and the automated grader’s fidelity to expert judgment deserves continued scrutiny. Moreover, GPTval intentionally excludes roles with substantial physical components, so the 83% figure should not be generalized beyond knowledge-intensive work. Still, taken together—rising win rates, fewer errors, and stronger coding integration—GPT-5.4 is a compelling data point that advanced AI is moving from impressive demos to dependable professional output.

The practical next step for organizations is disciplined piloting: pair GPT-5.4 with clear rubrics, human-in-the-loop QA, and audit trails; start with well-bounded, high-volume tasks; and measure effects on quality, speed, and risk. If OpenAI’s numbers hold up at scale, the gains won’t just save hours—they’ll reshape how expertise is applied across entire industries.