The company is unleashing a specialized version of GPT-5, tailored for its AI coding agent — which we will be calling GPT-5-Codex (not an official name) — that is designed to have dynamic “thinking” time and performance on real-world software tasks. The model can take between a few seconds and seven hours to complete a coding task, and initial results suggest gains on agentic coding benchmarks as well as large-scale refactoring tasks, according to the company.
The definition of the compute is mostly utilized to adjust the split size in terms of compute rather than a router, which pre-allocates resources at query onset, as reported by Alexander Embiricos, product lead for OpenAI’s Codex. Accordingly GPT-5-Codex can ramp up its effort mid-task—realizing minutes in that a problem is “worth solving for another hour,” says the company, an approach which they claim has led to more stable end-to-end completions on complex repos.
Why dynamic compute is important when training coding agents
Software work is spiky. Some issues can be “resolved” with a one-line function edit, some by massaging dependencies or integrating dozens of services, and yet others are buried under test cycles. Flat compute budgets tend to underserve those long-tail issues, where agents pausing or running indefinitely during integration tests or giving up on reproducing a bug are more common. Letting the model “work the problem” for longer alleviates that failure mode directly.
When AI plans, executes and verifies it’s own steps—like in agentic workflows—dynamic time for iterative refactorer, flaky test triage and multi-file changes that span more than one pass. It also better corresponds to how senior engineers work: they spend more time on outages with ambiguous failure modes and less on warding off the routine pings.
Benchmarks and early performance
OpenAI summarizes that GPT-5-Codex surpasses a base-line GPT-5 on SWE-bench Verified, a widely used benchmark for agentic coding, and also on refactoring benchmarks which are found from large established code bases. The company also university-trained the model for code review, and it claims experienced engineers rated its comments as leading to fewer incorrect notes and a greater share of “high-impact” findings — feedback that changes code quality or architecture decisions.
Longer planning and verification loops, more effective retrieval across large codebases, and more cautious test execution are probably responsible for the uptick. In practice, that means fewer partial fixes and more fully formed patches that compile, pass tests, and conform to the project’s conventions.
Rollout and access
GPT-5-Codex is shipping across Codex experiences in terminal, IDE integrations, GitHub-connected workflows and ChatGPT. It’s available to ChatGPT Plus, Pro, Business, Edu and Enterprise users. (API access is on the roadmap). Teams can also expect a larger variance in latency:Dynamic runtimes can range from seconds to hours, depending on difficulty of the task.
Companies will care about governance controls: timeouts, budget limits and audit logs for long-running jobs. While OpenAI did not specify defaults ceilings, organizations will likely need to make policies around max runtime, artifact retention and when an agent is free to cause expensive test suites or ci pipelines.
A crowded (AI) market for coding
The upgrade, which arrives into a highly competitive category alongside GitHub Copilot, Claude Code and Cursor. Industry coverage of the trend has called out Cursor’s rapid trajectory that saw revenue rocket past the half‑billion ARR mark this year, while Windsurf’s tumbled acquisition story underlined just how competitive the market has become for AI-first code editors.
The customer appetite is real. Research by GitHub has also found that developers finish tasks up to 55% faster with AI pair programming, while survey data from GitHub and Stack Overflow have shown a large majority of developers either using or considering using AI coding tools. According to McKinsey, AI could increase software engineering productivity significantly and have a substantial effect on time-to-market and defect rates.
Code review and safety concern
This assumes that due to the high signal, high-friction nature of code review, a model that can remove low value comments while promoting legitimate risks may make it possible for people to go through it more quickly. For teams utilizing a combination of protected branches and policy checks, you can route GPT-5-Codex to propose change sets, annotate diffs, and flag security issues before human participants become involved—minimizing noise on pull requests.
That said, automated reviews still require guardrails. Advice from groups such as the Open Source Security Foundation and NIST invoke secure defaults, dependency hygeine and secret scaning. Combining GPT-5-Codex with SAST, SBOM builds and identity-aware approvals help keep “agentic” changes secure and auditable.
What to watch next
The big questions now: how API access will expose fine-grained controls around runtime and cost; how the model scales up to massive monorepos under CI load; and whether rivals follow suit with dynamic compute strategies of their own. For engineering leaders, the pragmatic takeaway is clear: thinking at a long horizon goes from research to day-to-day tooling — and it’s the teams that wrap it together with well-suited control systems that will see rewards first.