Google has launched Gemini 3.1 Pro, positioning its newest flagship model as a step forward in complex reasoning and technical problem-solving. Early results on widely watched evaluations suggest it now tops key competitors from OpenAI and Anthropic on several demanding tests, while leaving room to grow in agentic coding tasks.

What the scores say about Gemini 3.1 Pro vs rivals

On ARC-AGI-2, a suite of abstract reasoning puzzles derived from François Chollet’s Abstraction and Reasoning Corpus, Gemini 3.1 Pro recorded a 77.1% score. That outpaces the reported 52.9% for GPT-5.2 and 68.8% for Claude Opus 4.6. Google also says the model leads across 12 of 19 total benchmarks it tracked against those rivals.

Table of Contents

What the scores say about Gemini 3.1 Pro vs rivals
Why these benchmarks matter for real-world reliability
Where Gemini 3.1 Pro still trails in agentic coding
What’s new in practice with Gemini 3.1 Pro today
How to access Gemini 3.1 Pro on web, app, and NotebookLM
Bottom line for buyers evaluating Gemini 3.1 Pro

Google Gemini 3.1 Pro tops AI benchmarks, beating rival models in tests

In knowledge-intensive testing, Gemini 3.1 Pro posted a 94.3% result on GPQA Diamond, a graduate-level, “Google-proof” science and reasoning benchmark. According to Google’s summary, GPT-5.2 registered 92.4% and Claude Opus 4.6 came in at 91.3% on the same test. While headline numbers don’t capture everything, these margins are notable given the difficulty and diversity of the questions in GPQA Diamond.

The overarching takeaway: on abstract reasoning and advanced scientific QA, Gemini 3.1 Pro is now competitive at the very top of the field and, in many cases, in front.

Why these benchmarks matter for real-world reliability

ARC-style tasks are designed to probe generalization and pattern induction rather than memorization. Scoring above 75% on ARC-AGI-2 suggests a model is getting better at discovering latent rules in novel problems—skills that translate to data synthesis, planning, and multi-step instructions. GPQA Diamond, meanwhile, pressure-tests graduate-level understanding in physics, biology, chemistry, and related domains with questions crafted to resist simple web lookup strategies.

Put simply, stronger results here tend to correlate with more reliable multi-hop reasoning and less brittle behavior on unfamiliar prompts—capabilities enterprises look for when moving from demos to production.

Where Gemini 3.1 Pro still trails in agentic coding

The picture isn’t uniformly rosy. Google acknowledges Gemini 3.1 Pro lags in several agentic coding tool evaluations, including SWE-Bench Verified, a benchmark that measures a model’s ability to fix real-world software issues in repository contexts. This gap aligns with a broader industry challenge: reliable tool use, environment setup, and multi-step code changes in long-running sessions remain fragile even for top-tier models.

For teams focused on automated bug fixing or repository-scale refactoring, that caveat is meaningful. Specialized coding agents and tight IDE integrations may still outperform general-purpose chat models until orchestration and tool reliability improve.

The text Gemini 3.1 Pro in Google Antigravity is displayed in white against a black background with a pattern of small blue dots.

What’s new in practice with Gemini 3.1 Pro today

Google is framing 3.1 Pro as the option for “hard prompts” that require multi-step math, code reasoning, structured analysis, and clearer visual or conceptual explanations. In plain terms, users should expect stronger synthesis—turning a pile of inputs into a single, coherent output—and more consistent follow-through on complex instructions.

That positioning lines up with the benchmark gains: ARC-AGI-2 implies better generalization, while GPQA Diamond indicates improved scientific grounding. Together, they hint at fewer dead ends when a task moves beyond surface-level pattern matching.

How to access Gemini 3.1 Pro on web, app, and NotebookLM

Gemini 3.1 Pro is live in Google’s Gemini app and on the web. Free users can try it with usage limits; paid subscribers get higher quotas. To switch models, open the model selector in the prompt box and choose the Pro variant, which is described as optimized for advanced math and code.

The model is also available in NotebookLM for subscribers on AI Pro or AI Ultra plans. That pairing makes sense for research-heavy workflows where long-form synthesis and source-grounded answers can benefit from the model’s reasoning strengths.

Bottom line for buyers evaluating Gemini 3.1 Pro

With 3.1 Pro, Google has narrowed—if not flipped—several of the most scrutinized leaderboards in its favor, particularly on abstract reasoning and scientific QA. The trade-off is still visible on agentic coding, where operational reliability and tool use remain a work in progress across the sector.

If your tasks center on complex analysis, multi-source synthesis, and hard technical explanations, Gemini 3.1 Pro’s scores make it a compelling first pick. If your priority is automated code changes inside real repositories, pair it with specialized agents or keep a close eye on updates to coding toolchains and SWE-Bench performance. Either way, the pace of improvement—and the spread in results across benchmarks—underscores just how fast the top of the AI stack is evolving.