Gemini 3 Deep Think Hits 3455 Elo and Olympiad Gold, But the Real Story Is Happening at Duke

Listen to this article 8 min
0:00 / —:——

Gemini 3 Deep Think Hits 3455 Elo on Codeforces and Gold on Science Olympiads, But the Duke Semiconductor Work Is What Matters

Google's upgraded reasoning model posts impressive benchmarks. A materials science lab is already using it to design chips.

By Theo Ballard | February 12, 2026


Google DeepMind dropped a significant upgrade to Gemini 3 Deep Think today, and for once, the benchmark numbers tell a coherent story. The model now holds state-of-the-art on ARC-AGI-2, earned gold medals on the written portions of the 2025 Physics and Chemistry Olympiads, and posts a 3455 Elo rating on Codeforces.

Those are real numbers. Let me break down what they actually mean.

The Benchmark Reality Check

ARC-AGI-2 is François Chollet's follow-up benchmark designed to test genuine abstraction and reasoning, the kind of novel problem-solving that pure pattern matching can't fake. When OpenAI's o1 launched, it moved the needle on ARC-AGI but still struggled with the harder variants. Deep Think claiming SOTA here suggests Google has made genuine progress on the reasoning-vs-memorization problem, though we'll need to see the methodology paper to verify they're not just scaling test-time compute indefinitely.

The Olympiad results are more straightforward to interpret. Gold on the written portions of Physics and Chemistry means the model can work through multi-step derivations, hold complex state, and avoid the catastrophic errors that plague standard LLMs on calculation-heavy problems. The "written portions" caveat matters, these aren't lab practicals, but for theoretical problem-solving, gold is gold.

3455 Elo on Codeforces puts Deep Think in the top tier of competitive programmers globally. For context, that's Grandmaster level. OpenAI's o1 was posting numbers in the 2700-2900 range at launch, though they've likely improved since. The gap here is meaningful: Codeforces problems require not just code generation but algorithmic insight, edge case handling, and the ability to debug under time pressure.

But here's my standard caveat on all benchmark claims: we're taking Google's word for it. Independent verification will take weeks. The numbers are impressive if accurate, but "if accurate" is doing a lot of work in that sentence.

What Actually Matters: Duke's Semiconductor Work

Benchmarks are proxies. The Duke University story is the real signal.

Dr. Wei Wang's materials science lab at Duke has been running Deep Think through Vertex AI's early access program for the past six weeks, using it to explore novel semiconductor material configurations. According to the DeepMind announcement, the lab has used Deep Think to identify three candidate materials with theoretical properties that could improve chip efficiency by 12-18%, materials that weren't in any existing literature or database.

This is the kind of application that separates genuine reasoning capability from sophisticated autocomplete. Materials science discovery requires:

  1. Constraint satisfaction across multiple domains, electrical properties, thermal stability, manufacturability, cost
  2. Novel recombination, generating candidates that don't exist in training data
  3. Falsification, recognizing why obvious approaches fail before wasting compute on simulation

If Deep Think is genuinely accelerating this workflow, that's a stronger signal than any benchmark. Semiconductor materials are a $600B industry bottleneck. The models that can actually contribute to R&D pipelines, not just write summaries of existing research, will capture enormous enterprise value.

I'd love to see the Duke paper when it drops. The claim is extraordinary; the evidence needs to match.

Reasoning Models: What's Actually Different

For readers who haven't followed the reasoning model architecture debate, here's the technical context.

Standard LLMs (GPT-4, Claude 3, base Gemini) are essentially very sophisticated next-token predictors. They're trained on vast corpora, develop impressive pattern matching, and can simulate reasoning when the pattern is familiar. But they struggle with genuinely novel problems because they're fundamentally interpolating between training examples.

Reasoning models like o1 and Deep Think add an explicit "thinking" phase. The model generates internal reasoning traces, chains of thought that aren't shown to the user, and can iterate, backtrack, and explore multiple solution paths before committing to an answer. This is closer to how humans actually solve hard problems: we don't just pattern-match, we work through possibilities.

The technical implementation varies. OpenAI's o1 uses reinforcement learning to train the model to generate useful reasoning traces. Google hasn't published Deep Think's architecture details, but the inference-time scaling behavior suggests something similar. The model gets better results when given more compute at inference, consistent with a search-over-reasoning-paths approach.

This matters because it changes the capability curve. Standard LLMs hit walls on problems that require genuine novelty. Reasoning models can, in principle, solve problems that aren't well-represented in training data, as long as the solution can be found through systematic search. The Duke semiconductor work is a test case for whether that principle holds in practice.

The Rollout Strategy: Google AI Ultra → Vertex

Deep Think is available now for Google AI Ultra subscribers (the consumer-facing $249/month tier) and will hit Vertex AI in "coming weeks" for enterprise and research customers.

This sequencing is interesting. Google is leading with the premium consumer play rather than the enterprise API. Three possible interpretations:

  1. Capacity constraints: reasoning models are compute-intensive, and Google may not have the serving infrastructure to handle enterprise-scale Vertex traffic yet
  2. Pricing discovery: consumer Ultra lets them test willingness-to-pay before committing to enterprise rate cards
  3. Marketing play: consumer availability generates buzz and user testimonials that accelerate enterprise sales cycles

My bet is a combination of 1 and 2. Reasoning models consume 10-50x the compute of standard inference for hard problems (all that thinking isn't free), and Google's serving infrastructure for Gemini has historically lagged OpenAI's. They may simply not be ready for Vertex-scale traffic.

For researchers like the Duke lab, early access through Vertex is the only option that matters. Google AI Ultra doesn't offer the API access, fine-tuning capabilities, or compliance certifications that serious research and enterprise deployments require. The consumer launch is a sideshow.

The Competitive Picture

Let's situate Deep Think in the reasoning model wars.

OpenAI o1 launched in September 2024 and has been iterating since. It's the incumbent, with the largest user base and the most real-world deployment data. The January 2025 refresh (o1-2025-01-30) closed some gaps, but specific benchmark comparisons to today's Deep Think aren't available yet.

Anthropic has been notably quiet on explicit reasoning models. Claude 3.5 Sonnet and Opus show improved reasoning capabilities, but Anthropic hasn't launched a dedicated "thinking mode" product. Inside baseball suggests they're prioritizing reliability and safety research over raw benchmark performance. Whether that's strategic differentiation or falling behind depends on your priors.

DeepSeek R1 from the Chinese lab made waves in January with strong reasoning benchmarks at a fraction of the cost. It's not available to US enterprise customers, but it's a reminder that the frontier isn't exclusively Western.

The pattern that emerges: reasoning capability is the current axis of competition. The labs that can deliver reliable, verifiable reasoning, not just benchmark performance, but actual utility on hard problems, will capture the next wave of enterprise and research adoption.

What I'm Watching

Three things will determine whether today's announcement matters:

  1. Independent benchmark verification: I want to see third-party evals on ARC-AGI-2 and Codeforces within the month
  2. The Duke paper: If the semiconductor materials work holds up to peer review, it's a genuine milestone
  3. Vertex rollout timeline: "Coming weeks" is vague. Enterprise customers need dates.

Deep Think's numbers are impressive. The Duke application is transformative. But I've been in this industry long enough to know that announcements aren't products, and benchmarks aren't deployments.

Show me the reproducible results. Then we'll talk.

Get Breaking AI News

Don't miss major developments. Subscribe for breaking news alerts and weekly digests.