Google is testing a simple idea, but one with the potential for big impact: on-screen markup that allows you to draw over parts of an image on your device screen before asking it a question or giving it a command. The feature, observed in its latest Google app build, will offer a more organized approach to guide Gemini’s focus and thus make visual questions faster, clearer, and better.

How the New Workflow of Markup Could Work

Strings and UI elements in version 16.42.61 of the Google app’s ARM64 build show a new option to draw on an image selected from your gallery or taken by the camera. You can draw a circle or underline an area, then you might say to Gemini that you want it to analyze only that part — “interpret just this label,” “find the damage on this corner,” and “contrast these two logos on the shelf.”

Table of Contents

How the New Workflow of Markup Could Work
Why Region Prompts Are Important for Multimodal AI
Editing Hooks and On-Device Clues in Gemini’s Tools
Use Cases That Make Sense in the Real World
How It Compares With Rivals in Visual AI Markup
Caveats and What to Watch as Google Tests Markup

Gemini visual AI markup tools interface with smart annotations, tags, and labels

While we can’t say for sure, early evidence suggests a color picker and multiple highlight modes, which would suggest that more than one region could be marked in a single shot. That would allow for multistep prompts like “describe the chart in green, then pull a sentence from this box in blue.” There are no explicit instructions to read in large type, but region cues serve as visual signposts that help Gemini guess at where to focus.

Why Region Prompts Are Important for Multimodal AI

It’s very common that multimodal models work better on low-ambiguity cases. Rather than using ambiguous language like “that thing on the left,” the hasty scrawl becomes a point of clear reference. Work on referring expressions and grounding from Google and other labs has also consistently demonstrated that spatial guidance helps object recognition, visual question answering, and caption accuracy.

This leverages Gemini’s long-context capabilities, too. With Gemini 1.5, context windows in images and documents can be enormous, on the order of millions of tokens. Markup is a cost-effective method to reduce the field of view, which may improve response times and decrease the computational burden by limiting the regions requiring attention.

Editing Hooks and On-Device Clues in Gemini’s Tools

The interface suggests more than analysis. Internal references to features known by the monikers “Nano” and “Banana” imply hooks into on-device image editing flows — like quickly cutting out an unwanted piece of a screenshot or tidying up a photo background. That fits Google’s larger split between on-device Gemini Nano for privacy-sensitive, light work and cloud models for heavier lifting.

If the markup tool is smart enough to route tasks efficiently, doing local edits where possible and the cloud when it makes sense, users could have both performance and fidelity. It would be a similar play to how Pixel features such as Magic Eraser and Audio Magic Eraser mix seamless gestures with AI-driven adjustments, except baked into the Gemini prompt flow itself.

Google Gemini visual AI markup tools highlighting objects with bounding boxes and labels

Use Cases That Make Sense in the Real World

Practical scenarios are everywhere. Students can focus on a particular axis in a chart and request the trend line summary on one dimension only. A sticker in a storefront window can be circled for translation, without bringing along the reflections from behind. A support rep can circle an error message in a screenshot and ask for a resolution, bypassing UI clutter that isn’t relevant.

Those capable of making workers most productive are likely to benefit, as will retail and productivity teams. Product catalogers can select a SKU label and extract structured data in a reliable way, and designers can cut out a logo for brand compliance tests. In medical and insurance processes, visual cues might prevent health-related or other sensitive sections from being blacked out, or mark a location for review to work in conjunction with existing privacy assurances.

How It Compares With Rivals in Visual AI Markup

Competitors have already been heading in this direction. Chat assistants from OpenAI and Microsoft now enable users to tap or draw on images to zero in on a query. Adding first-party markup within Gemini’s core flow would maintain parity and benefit from advantages of Google’s ecosystem: existing native Android markup tools, Google Photos’ image editing stack, and close integration with the Google app.

Caveats and What to Watch as Google Tests Markup

Like all pre-release features, the UI probably will change and looks provisional at best. It’s not entirely clear to me, for example, why there are so many colors — are they for ordering steps, labeling categories, or purely aesthetic? And availability might be staged, possibly first coming to devices with enough on-device AI muscle.

Still, the direction is compelling. “At the end of the day, visual markup is what turns Gemini from a capable generalist into a willing and able assistant that knows exactly what you’re talking about when you point. If Google activates this and tightens up region-exposed prompting, it will get quicker responses with less misunderstanding and a friend, therefore, more likely to make your team’s 100 m run up a mountain without passing out than along for the ride.”