Google DeepMind’s Computer Use, an evolution of the Gemini 2.5 Pro model which browses the web more like a human user, has been released. Instead of using secret APIs, it looks at pages, clicks buttons, types into fields and scrolls through content — then tells you what it’s doing while it does so.

The goal is a simple one, yet it has lofty aspirations: enable AI to perform real tasks inside real websites with minimal hand-holding and ensure that humans stay informed and in control. Developers can now access it on the Gemini API and Vertex AI, with a public demo available through Browserbase.

Table of Contents

How Gemini Computer Use Works Across Real Websites
What Gemini Computer Use Can Accomplish Today
Performance and benchmarks on real-world browsing tasks
Safety railings and well-known limits for AI agents
How It Stacks Up Against Other AI Agents and Tools
Availability and what to try first with Gemini agents

Google Gemini AI agent browsing web pages in a desktop browser interface

How Gemini Computer Use Works Across Real Websites

Give a natural-language instruction to the model — “Open Wikipedia, find Atlantis and summarize the history of its myth” — and it’s smart enough to fetch the page, take screenshots and analyze an interface. It reads what you see on your screen, and can figure out which elements to engage with, from search boxes to dropdowns and pagination controls.

Behind the scenes, it’s just a loop in which we iterate. The model re-computes the page state after each action (click, type, scroll) to determine what comes next. This short-term memory of past actions is critical for UI work where text changes, modals pop up and things move. The loop is repeated until the target state is satisfied or we require an answer from a human.

The company’s standard for integrations: This is closer to how people flow through sites than traditional integrations. Rather than encoding site-specific scripts, it relies on general visual and structural clues in the appearance of pages. It’s like earlier Google experiments such as Project Mariner and fits in with the larger trend in AI toward embodied agents that can do stuff, as opposed to static chat.

What Gemini Computer Use Can Accomplish Today

In Google’s demos, the agent is shown updating a record on a customer relationship management dashboard and reordering content in Jamboard’s interface. Those aren’t contrived examples; those are real-world cases of dealing with nested menus, confirmations, and change validation.

Some common scenarios are:

Collecting data from multiple tabs
Filling out multi-step forms
Controlling e-commerce carts
Scheduling a doctor’s appointment
Cleaning up shared documents

If an instruction is sensitive — for example, “purchase this item” — the model can stop and request explicit approval first.

Performance and benchmarks on real-world browsing tasks

The model outperforms competing models from the likes of Anthropic and OpenAI across a combination of web and mobile control benchmarks, including Online-Mind2Web — a framework for evaluating agents on a variety of real-world browsing tasks.

The article touts improvements in task accuracy and latency.

Google Gemini Agent surfs the web, executing tasks across browser tabs

Success rates on benchmarks are important, but so is responsiveness. Google’s own public videos are sped up, so it’s unclear if real time will play out differently for your pages depending on how complex they are, network conditions and the number of steps needed. For enterprise rollouts, teams will want to test target workflows end-to-end and track success rates, retries and median-time-to-completion.

Safety railings and well-known limits for AI agents

Overall, the release articulates control. Developers can limit what the agent can do, stop it from circumventing CAPTCHAs, deny it access to certain pages and add a confirmation step for actions such as making purchases or exporting data. Activity is also read-write logged so it can be audited — a must in regulated industries.

Google’s system card also lists well-known frontier-model limits: hallucinations, gaps in causal understanding, and problems with complex logical deduction and counterfactual reasoning. In practice, this behavior may cause the agent to sometimes mispredict ambiguous interfaces or take suboptimal paths. Human-in-the-loop checkpoints and constraints are still the best practices.

How It Stacks Up Against Other AI Agents and Tools

OpenAI and Anthropic have recently been training agents that can run browsers and manipulate desktops. The common theme is to “generalize” UI control: models that can learn to use new websites without tailored scripts. As for leading the industry in benchmarks, Google’s highly specific assertion implies it has a competitive advantage in perception-action loops and latency (though your mileage will no doubt — as always — vary depending on what you’re trying to do).

One distinction is the focus on explicit, visible action traces — users can see what steps were taken and why as the agent operates. That kind of transparency fosters trust, aids debugging and provides teams with a vehicle for instituting review gates at critical times.

Availability and what to try first with Gemini agents

The model is accessible through the Gemini API in Google AI and via Vertex AI, with a demo available on Browserbase.

Although it’s mainly optimized for the web, Google sees substantial potential in mobile use cases and suggests cross-device control is next.

Early adopters should begin with well-scoped tasks that are high-value, low-risk: internal dashboard updates, report generation and structured data entry. Set up a sandbox, add confirmations for risky operations, and test success rates with a handful of representative sites before rolling it out.

The conclusion is larger than any one demo: AI agents are graduating from answering questions to performing actions in the very interfaces we all use. If Google’s results generalize beyond the lab, the era of browser-native automation may move from frail wielding of scripts to adaptive, auditable AI — one cautious click at a time.