AI agents are racing into the enterprise with scant guardrails, according to a new MIT-led analysis that finds widespread gaps in safety testing, transparency, and basic shutdown controls. Reviewing 30 widely used “agentic” systems, the research team concludes today’s agents are fast, loose, and far less governable than their marketing suggests—just as businesses begin wiring them into email, browsers, and core workflows.
Inside the MIT-led survey of deployed agentic AI systems
The report, The 2025 AI Index: Documenting Sociotechnical Features of Deployed Agentic AI Systems, was authored by Leon Staufer of the University of Cambridge with collaborators from MIT, the University of Washington, Harvard University, Stanford University, the University of Pennsylvania, and The Hebrew University of Jerusalem. Rather than lab tests, the team systematically annotated public documentation, demos, governance papers, and product sites—supplemented by limited hands-on checks—to evaluate how real products describe their capabilities and controls.

The 30 systems span three categories: enhanced chatbots, AI-enabled browsers and extensions, and enterprise platforms. Despite the diversity, most are powered by a small set of closed frontier models—primarily GPT, Claude, and Gemini—raising systemic risk if common failure modes propagate across many agents.
Key Findings Transparency And Control Gaps
Across eight disclosure categories, most vendors provide little or no detail on risks, evaluations, or monitoring. Basic observability is often missing. The authors flag that, for many enterprise agents, it’s unclear whether fine-grained execution traces even exist—making it difficult to reconstruct what an agent did, why it did it, or who is accountable.
Resource usage is another blind spot. Twelve of 30 systems either offer no usage monitoring or only notify customers when rate limits are hit, undermining the budgeting and capacity planning enterprises need. Identification is also weak: most agents do not reliably disclose their AI nature to end users or third parties, for example via watermarking or by honoring robots.txt—blurring the line between human and automated activity on the web.
Perhaps most troubling, several products lack documented ways to stop an autonomous run once it begins. The study cites offerings such as Alibaba’s MobileAgent, HubSpot’s Breeze, IBM’s watsonx, and n8n automations as having no clear per-agent stop mechanism in public docs; in some enterprise platforms the only option appears to be halting all agents at once. In high-stakes environments, the absence of a targeted “off switch” is a risk multiplier.
Real products, real consequences in deployed agents
Agentic tools are not theoretical. OpenClaw, an open-source framework that drew attention for enabling email-sending and other autonomous tasks, also revealed stark security trade-offs, including the potential to hijack a user’s machine if poorly configured. The ecosystem is moving quickly—OpenAI recently hired OpenClaw’s creator Peter Steinberg—yet operational safeguards are often lagging.
The report contrasts product approaches. OpenAI’s ChatGPT Agent, for instance, cryptographically signs browser requests for traceability, a step toward accountable automation. By comparison, the researchers say Perplexity’s Comet AI browser lacks documented agent-specific safety evaluations, third-party testing, or robust sandboxing in public materials. Perplexity has pushed back, saying reported issues were responsibly disclosed and patched, and that a separate dispute with Amazon over bot identification is a contractual matter rather than a safety failure.

Enterprise buyers face mixed signals. HubSpot’s Breeze agents advertise compliance certifications such as SOC 2, GDPR, and HIPAA, yet the study notes limited public detail about security testing methodologies. IBM, for its part, contests the survey’s characterization, asserting it provides extensive documentation on observability, deterministic controls, and evaluation frameworks, and that it is engaging with the researchers to address perceived inaccuracies.
How the industry responded to the survey findings
The research team contacted companies over a four-week period; roughly 25% responded and only 10% offered substantive comments that were incorporated, according to the paper. The authors predict governance challenges will intensify as agents gain capability, pointing to fragmented ecosystems, tensions around web conduct, and the absence of agent-specific benchmarks as unresolved roadblocks.
What it means for enterprises deploying agentic AI
Agentic AI can already triage customer tickets, process purchase orders, and orchestrate multi-step workflows—precisely the automations that drive ROI. But the study’s core message is clear: without disclosure, monitoring, and reliable stop controls, organizations can’t manage risk at scale. Recent guidance from firms like Gartner urging caution with AI-enabled browsers underscores that sentiment.
Practical steps are available now:
- Insist on fine-grained logs and signed requests for actions taken on your behalf
- Require sandboxing and least-privilege access for tools that can read email, browse, or write to internal systems
- Verify vendor-run red teaming and independent evaluations
- Demand per-agent kill switches
- Ensure agents visibly identify themselves online
These aren’t nice-to-haves—they’re table stakes for accountable automation.
The takeaway from the MIT-led survey is not that agents are doomed, but that governance must catch up to capability. Vendors building on frontier models, and buyers deploying them into real workflows, will need to close the documentation and control gaps quickly—or expect regulators to do it for them.
