Google has unveiled Gemini 3.1 Pro, claiming a major leap in reasoning. The company says the model more than doubled its internal reasoning performance over the prior 3 Pro and posted a 77.1% result on the ARC-AGI-2 benchmark, a test designed to probe “entirely new logic patterns” rather than memorized tasks. For developers and enterprises chasing dependable multi-step problem solving, that’s a headline-grabber.

A Big Jump on Harder Tests and Novel Reasoning Benchmarks

Benchmarks don’t tell the whole story, but they do mark progress. On Humanity’s Last Exam, a composite designed to resist overfitting and better mirror human-level problem-solving, Gemini 3.1 Pro reached 44.4%, up from Gemini 3’s prior high of 38.3%. On ARC-AGI-2, the new 77.1% score is what underpins Google’s “more than double” reasoning claim.

Table of Contents

A Big Jump on Harder Tests and Novel Reasoning Benchmarks
What Doubling Reasoning Really Means for Real-World Tasks
How It Compares to Rivals Across Capability and Safety
From Lab Scores to Daily Workflows and Developer Access
Caveats and the Road Ahead for Reliability, Safety, and Use

Google releases Gemini 3.1 Pro AI model with doubled reasoning power

There’s nuance, though. Google’s recently announced Gemini 3 Deep Think upgrade actually outscored 3.1 Pro on both tests, with 84.6% on ARC-AGI-2 and 48.4% on HLE. Google positions 3.1 Pro as the upgraded core intelligence powering those science-heavy gains, suggesting Deep Think is a specialized configuration while 3.1 Pro is the more general-purpose workhorse.

What Doubling Reasoning Really Means for Real-World Tasks

ARC-AGI-2 focuses on novelty—can a model solve problems it hasn’t seen before and combine concepts on the fly? A higher score typically correlates with better chain-of-thought style planning, fewer dead ends in multi-step tasks, and more robust generalization under changing instructions. In practical terms, users should expect more consistent performance on tasks like complex spreadsheet transformations, multi-constraint itinerary planning, or diagnosing edge-case bugs across large codebases.

But “double” doesn’t mean twice as smart in the wild. Real-world outcomes still hinge on context length, retrieval quality, prompt design, and safety guardrails. As with every frontier model, improvements in logic can expose new failure modes—confident mistakes, subtle reasoning gaps, or sensitivity to ambiguous inputs—especially outside benchmark conditions.

How It Compares to Rivals Across Capability and Safety

On aggregated capability measures maintained by the Center for AI Safety, Anthropic’s Claude Opus 4.6 currently leads for text-based reasoning and general language tasks. CAIS’s risk assessment leaderboard also places Anthropic’s Opus 4.5, Sonnet 4.5, and Opus 4.6 ahead of Gemini 3 on several safety dimensions. In other words, Gemini 3.1 Pro is pushing hard on reasoning benchmarks, but leadership differs by metric and workload.

This competitive picture reflects a broader trend: top labs are trading punches on targeted strengths. Google’s recent emphasis has been on scientific and mathematical reliability—chemistry, physics, coding—where Deep Think’s performance suggests meaningful headroom. Expect rapid responses from rivals as they tune for the same high-novelty tests.

The text Gemini 3.1 Pro is displayed in white, with 3.1 in a larger, dotted font that transitions from blue to a rainbow of colors. The background is a dark blue-grey gradient with subtle hexagonal patterns.

From Lab Scores to Daily Workflows and Developer Access

Google is rolling out access where builders already are. Developers can try Gemini 3.1 Pro in preview through the API in Google AI Studio, Android Studio, Google Antigravity, and the Gemini CLI. Enterprise teams can pilot it via Vertex AI and Gemini Enterprise. For everyday users, it’s available in NotebookLM and the Gemini app.

Practical wins to watch for include:

Multi-step data analysis with fewer re-prompts
Code refactoring that carries logic correctly across modules
Structured planning that respects constraints like budgets and time windows
Scientific drafting that better preserves units, assumptions, and error bounds

If the ARC-AGI-2 gains translate, these workflows should feel less brittle and more repeatable.

Caveats and the Road Ahead for Reliability, Safety, and Use

Benchmark peaks are fleeting. As new models land, relative rankings shuffle, and hard problems migrate to harder tests. The key questions for Gemini 3.1 Pro will be reliability under changing prompts, factual grounding on niche topics, and safety under adversarial use—all areas where independent evaluations, including those tracked by research groups like CAIS, will matter as much as lab numbers.

For now, Gemini 3.1 Pro signals that Google’s reasoning stack is accelerating. The doubling claim on ARC-AGI-2 is a clear step forward; whether it becomes a durable advantage will depend on how consistently those gains show up in real work, across messy datasets, edge cases, and the creative chaos of production-scale use.