FindArticles FindArticles
  • News
  • Technology
  • Business
  • Entertainment
  • Science & Health
  • Knowledge Base
FindArticlesFindArticles
Font ResizerAa
Search
  • News
  • Technology
  • Business
  • Entertainment
  • Science & Health
  • Knowledge Base
Follow US
  • Contact Us
  • About Us
  • Write For Us
  • Privacy Policy
  • Terms of Service
FindArticles © 2025. All Rights Reserved.
FindArticles > News > Technology

OpenAI Releases GPT-5.4, Beating Human Pros 83% in Tests

Gregory Zuckerman
Last updated: March 5, 2026 7:03 pm
By Gregory Zuckerman
Technology
6 Min Read
SHARE

OpenAI’s latest flagship model, GPT-5.4, is being positioned as a turning point for professional-grade AI. In evaluations designed to mirror real jobs, OpenAI says the new “Thinking” model matches or outperforms human experts 83% of the time. The company also reports material quality gains over its predecessor, with 18% fewer errors and 33% fewer false claims than GPT-5.2 on previously flagged factual prompts.

Inside the 83% benchmark for professional task performance

The 83% figure comes from OpenAI’s GPTval evaluation, a suite built to measure how well models execute economically valuable, real-world tasks. The framework spans nine major industries and 44 occupations—roles with high wages, low physical labor, and high potential for knowledge work automation. Think corporate finance, manufacturing engineering, accounting, and market analysis.

Table of Contents
  • Inside the 83% benchmark for professional task performance
  • What GPT-5.4 changes for professional workflows and teams
  • How the evaluations were built, scored, and bias-checked
  • Availability timeline and model lineup across ChatGPT and API
  • Implications for work and the open questions that remain
A screenshot of a tweet from Chubby ♨️♨️ (@kimmonismus) announcing Big GPT-5.4 updates (via TheInformation). The tweet lists several features including a 1M token context window, New Extreme reasoning mode, Parity with Gemini and Claude long-context models, Better long-horizon tasks, Improved memory across multi-step workflows, Lower error rates in complex tasks, Designed for agents and automation (e.g. Codex), Useful for scientific research & complex problems, and Part of OpenAIs shift to monthly model updates. The image has been resized to a 16:9 aspect ratio with a dark, subtly textured background.

Tasks were sourced and refined with experienced professionals to reflect day-to-day work. One manufacturing engineering prompt asked for the design of a jig or fixture to streamline cable spool handling in underground mining—a complex, highly specific brief that blends domain knowledge and practical constraints. Completed tasks were graded by human professionals who were blinded to whether the output came from a person or a model. OpenAI then trained an automated grader on those human judgments to enable rapid iteration, while acknowledging the need to watch for bias in any AI-assisted scoring.

Performance has accelerated sharply across recent releases. By OpenAI’s account, GPT-5.1 scored 38.8% on GPTval. GPT-5.2 leapt to 70.9%, with Wharton’s Ethan Mollick calling GPTval one of the most economically relevant measures of AI ability. Less than a quarter later, GPT-5.4 pushes that to 83%, meaning that in head-to-head, hours-long tasks, a majority of expert graders preferred its outputs over those of human professionals.

What GPT-5.4 changes for professional workflows and teams

GPT-5.4 is the first mainline reasoning model to incorporate the frontier coding capabilities previously introduced in GPT-5.3 Codex, according to OpenAI. Beyond raw reasoning, the company highlights improved factual discipline: on prompts with a history of user-flagged issues, claims are a third less likely to be false than GPT-5.2, and overall error rates drop by nearly a fifth. That combination—stronger chain-of-thought planning, higher coding fluency, and fewer factual slips—is what appears to drive its gains on long, multi-step tasks.

Early enterprise anecdotes add color. Daniel Swiecki, who leads Artificial Intelligence Solutions at Walleye Capital, said GPT-5.4 improved accuracy by 30 percentage points on the firm’s most demanding finance and Excel evaluations. That, he noted, widens the scope of automated model updates and scenario analyses for fundamental investing workflows—precisely the sort of repetitive, detail-intensive work many teams aim to streamline.

How the evaluations were built, scored, and bias-checked

OpenAI’s approach tries to anchor evaluations in authentic professional practice. Occupations were selected from industries contributing at least 5% to U.S. GDP, then filtered for roles with limited manual labor and outsized compensation impact. For each occupation, seasoned practitioners developed scenarios that typically take four to eight hours for humans to complete, from audit workpapers and pricing models to engineering design memos and marketing strategy briefs.

A close-up of a dark user interface element with text Ask Codex anything, @ to add file and GPT-5.4 with a dropdown arrow.

Human graders scored outputs against rubrics emphasizing correctness, completeness, compliance with constraints, and practical utility. The addition of an automated grader—trained on those human scores—allows frequent retesting as models evolve. While that speeds iteration, independent replication and transparency around rubrics will be important to bolster trust, particularly when model families are benchmarked against one another or used in high-stakes decisions.

Availability timeline and model lineup across ChatGPT and API

GPT-5.4 is rolling out via the API and across paid ChatGPT tiers, and it will appear in Codex as GPT-5.4 Thinking. OpenAI positions “Thinking” models for complex reasoning and “Instant” models for speed and conversational fluidity. The company indicates the tracks will evolve at different cadences—useful context for teams choosing between latency-sensitive assistants and deeper analytical copilots.

Implications for work and the open questions that remain

For professionals, the headline is not just that GPT-5.4 beats humans frequently—it’s that it does so on tasks representative of real deliverables. The most immediate impact will be augmentation: faster first drafts of analyses, cleaner code, tighter financial models, and more rigorous design notes that experts then refine. But as reliability improves, parts of the workflow will shift from assist to automate, changing how teams staff and sequence projects.

Caveats remain. Performance likely varies by domain and task complexity, and the automated grader’s fidelity to expert judgment deserves continued scrutiny. Moreover, GPTval intentionally excludes roles with substantial physical components, so the 83% figure should not be generalized beyond knowledge-intensive work. Still, taken together—rising win rates, fewer errors, and stronger coding integration—GPT-5.4 is a compelling data point that advanced AI is moving from impressive demos to dependable professional output.

The practical next step for organizations is disciplined piloting: pair GPT-5.4 with clear rubrics, human-in-the-loop QA, and audit trails; start with well-bounded, high-volume tasks; and measure effects on quality, speed, and risk. If OpenAI’s numbers hold up at scale, the gains won’t just save hours—they’ll reshape how expertise is applied across entire industries.

Gregory Zuckerman
ByGregory Zuckerman
Gregory Zuckerman is a veteran investigative journalist and financial writer with decades of experience covering global markets, investment strategies, and the business personalities shaping them. His writing blends deep reporting with narrative storytelling to uncover the hidden forces behind financial trends and innovations. Over the years, Gregory’s work has earned industry recognition for bringing clarity to complex financial topics, and he continues to focus on long-form journalism that explores hedge funds, private equity, and high-stakes investing.
Latest News
How Faceless Video Is Transforming Digital Storytelling
Oracle Cloud ERP Outage Sparks Renewed Debate Over Vendor Lock-In Risks
Why Digital Privacy Has Become a Mainstream Concern for Everyday Users
The Business Case For A Single API Connection In Digital Entertainment
Why Skins and Custom Servers Make Minecraft Bedrock Feel More Alive
Why Server Quality Matters More Than You Think in Minecraft
Smart Protection for Modern Vehicles: A Guide to Extended Warranty Coverage
Making Divorce Easier with the Right Legal Support
What to Know Before Buying New Glasses
8 Key Features to Look for in a Modern Payroll Platform
How to Refinance a Motorcycle Loan
GDC 2026: AviaGames Driving Innovation in Skill-Based Mobile Gaming
FindArticles
  • Contact Us
  • About Us
  • Write For Us
  • Privacy Policy
  • Terms of Service
  • Corrections Policy
  • Diversity & Inclusion Statement
  • Diversity in Our Team
  • Editorial Guidelines
  • Feedback & Editorial Contact Policy
FindArticles © 2025. All Rights Reserved.