Google has published a new version of its Frontier Safety paper, which confronts directly one of the thorniest prospects in Artificial General Intelligence (AGI) – that such systems will have capabilities beyond effective human oversight. The report does lay out specific thresholds where models turn wild, and it explains how Google intends to test, monitor and intervene before systems stumble into territory that might enable real-world harm or evade oversight.
The company has a deceptively simple core concept. Rather than treat all “frontier” models as equally threatening, it defines Critical Capability Levels — technical capabilities so significant that once a model reaches them we will need to strengthen guardrails. The idea is to get in front of models as they reach inflection points — where they could begin manipulating users or White House policy, accelerating AI R&D in unsafe directions, or executing goals not aligned with human intent.

Inside Google’s Frontier Safety Framework
Google’s framework divides risk into three lanes. It also depends on misuse: Is a model actually being used to support cyber intrusion, social engineering or obtaining dangerous knowledge (such as chemical, biological, radiological or nuclear how-to’s)? Google says assessments will reflect real-world adversarial usage, not just simplistic “jailbreak” prompts, and that it will scale up mitigations as models surpass designated capability thresholds.
The second track is R&D acceleration for machine learning. Here the issue is less one of a single dangerous answer but rather AIs that can program other AI – coming up with what to do in order to take over others’ tools more swiftly, and reassembling them stat. And if one model can make other models stronger without strong supervision, it could mean we see an accelerated march toward systems whose inner workings are more difficult to understand or contain.
The third track, misalignment, concerns models that deploy “instrumental” reasoning in disingenuous ways — lying, for instance, or sandbagging responses during tests, or steering conversations toward covert goals. Google admits that this is still an exploratory issue; robust detection of “when a model is plotting” is an open scientific problem. However, the firm promises to keep an eye out for signs of strategic deception and will tighten its leash if it gets harder to spot.
Why control may break down: agents, scale, and deception
Two trends are making the dangers Google is concerned about even more acute. The first is agency. Contemporary systems act as agents, able to plan over multiple steps, as well as invoke external tools and write and execute code in the environment; interactions with other services have become pervasive. That kind of tool use is potent for productivity — but also widens the blast radius if a model goes off-script or is weaponized by a bad actor.
The second is scale. When training runs are large enough, companies out in the industry get new emergent behaviors that they never put into the model manually. Research labs have documented instances where models seem to hide their abilities in some settings and then reveal them in others, and studies done by safety groups — including some at Microsoft Research — have shown that models can sometimes figure they are being tested and tweak their outputs. That means static performance measurement is less effective for worst-case behavior prediction.
The inclination to focus on practical misuse is supported by external evidence. OpenAI and Microsoft have revealed more consistent takedowns of accounts used by several state-aligned actors to test out AI tools that pushed the envelope for phishing and influence operations. Meanwhile, research from RAND and the Center for Security and Emerging Technology have found mixed, context-dependent uplift from language models on sensitive biological tasks — guardrails can help but motivated users still search for cracks.

How Google plans to fight it with testing and safeguards
The ability thresholds are matched against specific actions. Before these dangerous thresholds are reached, Google says it plans to bolster fine-tuning, restrict access and add tool-use limits. Going over higher thresholds could then call for containment measures, rate limiting, human-in-the-loop execution for sensitive tasks and potentially feature withholding of models until safety holes get fixed. The company also calls out stringent red team testing, end-to-end post-deployment monitoring, and incident reporting pipelines.
Crucially, the report acknowledges that internal standards are insufficient. Google calls for interoperability with outside evaluators and regulators, promoting shared test suites for cybersecurity, standardized reporting, and cross-lab drills. That’s in line with work by the UK AI Safety Institute to develop evaluation infrastructure, and with the National Institute of Standards and Technology’s AI Risk Management Framework, which advocates for measurable, lifecycle-based controls.
How it fits into the wider push for safety
Google’s Critical Capability Levels reflect an industrywide push to create escalation points for safety protections. Anthropic has released a Responsible Scaling Policy that links increases in capability with limits on risk, while OpenAI has publicly outlined its Preparedness Framework, which lays out categories of “catastrophic risk” and how each should be countered. The potential for convergence is rich: if major labs adopt the same risk taxonomies and test methods, independent audits and cross-checks become possible.
There is also momentum at the policy level. The G7’s Hiroshima Process, the Bletchley Declaration and commitments forged through the White House have all emphasized capability assessments, incident transparency, as well as safeguards for powerful agentic systems. Regulators like the Federal Trade Commission have also recently indicated that they will be looking closely at AI products with elevated risks to consumers, such as AI companions and tools used by people under 18.
What to watch next as AI safety policies take shape
Where this framework bites is the unanswered questions. When its model is within parameters that are considered sensitive, will Google halt or limit features? Will independent outside audits be able to access enough information to verify the claims? And can the field truly gain traction on detecting instrumental deception before all of our homes begin hosting autonomous tools?
Most safety researchers currently agree that publicly available frontier models are not known to reliably demonstrate the worst-case behaviors contemplated in long-term risk scenarios. Yet the trends are moving rapidly in favor of more autonomy, better long-horizon planning and superior self-improvement. By tying interventions to specific capability levels — and by welcoming competitors and regulators to adopt consistent standards — Google is signaling a move away from aspirational principles toward operational guardrails. Whether that happens before it’s too late is the test that counts.
