I combined two A.I. tools, one that could figure out the intent of a page and another that could generate relevant text, so I could answer questions about pretty much anything.
Deep Research identified the root of the problem. Codex produced the fix. Neither would have been possible without framing and verification by humans.
- A Subtle Slowdown That Turned Into A Site Breaker
- Codex Was Ineffective at System-Level Diagnosis
- Deep Research Turned Up The Regressions That Mattered
- Human Intervention Led Insight To A Stable Fix
- Why Two-Tool Workflows Are More Effective Than One
- Practical Takeaways For WordPress Developers
- The Bottom Line On Using Dual AI Tools And Human Review
A Subtle Slowdown That Turned Into A Site Breaker
It began as a faint morning annoyance: the dashboard would stall for 15–20 seconds after the site had been sitting idle. On my dev machine, this was sporadic and difficult to reproduce. It went from “Couldn’t be bothered” to HARD Out Of Service when I rolled it out to a busy prod server. Each click was freezing the front end and admin for a minute or more.
With WordPress being unresponsive, I used my host’s file manager to disable the plugin. The site instantly came back to life, thus proving the issue was definitely inside my update. That was significant; the base product is in use on more than 20,000 sites. We can’t ship a regression like this.
Codex Was Ineffective at System-Level Diagnosis
Codex is an awesome mate programmer for something focused. What I needed here was a systems autopsy. I told it to “instrument startup” and to log every hook, delay, call. Telemetry appeared normal and the bug was only consistently reproducible following long idle windows at a time: one try per day, so iteration was hellaciously slow.
Large, fuzzy prompts fared poorly. Codex does not keep project histories across sessions, so it was unable to reason about when the regression was introduced. I had gone through somewhere around 20 prompts. Each had a lengthy “fix,” and none of them fixed the freeze. This was not a code-snippet problem; it was an entire-system problem.
Deep Research Turned Up The Regressions That Mattered
Deep Research is good at seeing the forest from the trees. I gave it repository access and, importantly, the distribution zips: one known-good 3.2 release and the problematic 4.0 build. I requested delta analysis restricted to 3.2 and thought that was the most useful constraint I could provide!
The verdict: the plugin was confirming that its robots.txt did actually exist and checking robots.txt state on every request. That should be a one-time run that sets the capability flags, not every page load. On a high-traffic site, these checks serialized the PHP interpreter, turning normal traffic into a wall of blocking I/O and starving the server.
The clue matched what we saw in the field: dev was “mostly fine” while production — where requests are coming in non-stop — started choking almost instantly. The regression didn’t have to be slow in silico; it just had to happen often. It found the needle by comparing editions, not by rummaging through everything indiscriminately.
Human Intervention Led Insight To A Stable Fix
With accurate diagnosis in hand, I guided Codex toward a safe solution. We cached the robots.txt status once up front at startup and also threw in an explicit admin control to recheck on demand — no more per-request probing. Codex developed the patch fast; I polished off edge cases and tested the admin UI.
The updated build has been running smoothly for days under actual traffic. The workflow that seemed to work was very much hybrid: diagnostic LLM (to localize the systemic fault) + coding LLM (scoped change to apply, but not painlessly!) + human (to draw boundaries, verify assumptions, decline silver-bullet responses).
Why Two-Tool Workflows Are More Effective Than One
There are different AI systems for capital efficiency. The diagnostic models excel at comparing versions, ranking risks, and tracing architecture-level causes. Coding helpers are great at producing boiled-down code once the problem is crisply stated. The result can be that asking one model to handle both causes analysis to become either too hazy or patches too heavy.
There are high stakes in WordPress land: More than 40% of the web is running on WordPress, according to W3Techs. Data from GitHub’s Octoverse and Stack Overflow’s Developer Survey suggest that programmers are widely embracing the power of AI in coding tools, but they continue to worry about correctness and maintainability — two areas where human review is critical.
That squares with guidance from the likes of NIST and work by Google’s own researchers: AI can magnify strengths — and weaknesses. Without explicit requirements, representative test data, and something like human judgment, you bust code easily and rapidly.
Practical Takeaways For WordPress Developers
- Try to reproduce your issue on production-like data. My dev box covered the blast radius, because with real traffic, it became readily apparent that this was failing.
- Use diffs as guardrails. Providing Deep Research with both a breaking and known-good build narrowed down its investigation work and avoided wild-goose chases.
- Instrument first, then speculate. Weightless telemetry confirmed that it’s per-request work, not a crontab-tossed dotfile or some high plugin conflict (looking at you, Sphinx), that is the culprit.
- Split responsibilities across models. Just let a diagnostic LLM diagnose, a coding LLM iterate on the root cause with bounded change, and keep humans in charge of how we frame problems and our acceptance criteria.
The Bottom Line On Using Dual AI Tools And Human Review
AI didn’t “solve it for me.” It solved it with me. Deep Research made the systemic regression; Codex shipped a targeted patch; human boundaries kept them both on the job. That combination of factors turned a show-stopping bug into a manageable fix — and saved those gains in productivity that made using AI worth it in the first place.