MiniMax M2.5 Becomes First Chinese Model to Top SWE-Bench Verified, Beating All US Frontier Labs
A Beijing-based lab just shipped a model that outperforms GPT-5.2 and Gemini 3 Pro on the benchmark that actually matters for production coding, at one-fifth the price.
By Lina Xie | February 12, 2026
MiniMax released M2.5 today, scoring 80.2% on SWE-Bench Verified. That number matters. It's the first time a Chinese model has beaten every US frontier lab on the benchmark most predictive of real-world software engineering capability.
GPT-5.2 sits at 78.1%. Gemini 3 Pro at 76.8%. Claude Opus 4.6 remains competitive at 80.5%, but the gap has collapsed to statistical noise. MiniMax's stock jumped 11% on the announcement.
Let me be clear about what just happened: a Beijing-based company, subject to US chip export controls, just shipped a model that rivals or beats everything coming out of San Francisco on the task that developers actually care about.
Why SWE-Bench Verified Is the Benchmark That Matters
I've spent years watching labs optimize for benchmarks that don't transfer to production. MMLU measures trivia recall. HumanEval tests toy coding problems a competent junior developer solves in minutes. Most benchmarks reward pattern matching over reasoning.
SWE-Bench Verified is different. It presents models with real GitHub issues from production repositories: actual bugs reported by actual developers in codebases like Django, Flask, and scikit-learn. The model must understand the issue, navigate thousands of files, identify the relevant code, and produce a working patch that passes the repository's test suite.
There's no shortcut. You can't memorize your way to a high score. The "Verified" designation means human experts have confirmed each problem has a valid, reproducible solution, eliminating the data quality issues that plagued earlier benchmarks.
When a model scores 80% on SWE-Bench Verified, it means that model can resolve four out of five real software engineering tasks that stumped human developers enough to file bug reports. That's not a parlor trick. That's economic value.
The Technical Achievement
MiniMax describes M2.5 as "the world's first production-level model designed natively for Agent scenarios." Strip away the marketing, and there's substance here.
The model architecture appears optimized for the long-context, multi-step reasoning that software engineering demands. Resolving a SWE-Bench issue isn't a single forward pass. It requires understanding the bug report, exploring the codebase, forming hypotheses, testing them against the code structure, and generating a patch that integrates cleanly.
Most models bolt agentic capabilities onto architectures designed for chat. MiniMax claims they built for agency from the ground up. The benchmark results suggest this isn't empty positioning.
I'm skeptical of benchmark-first development, optimizing for the test rather than the underlying capability. But SWE-Bench Verified is hard to game precisely because it tests the entire pipeline. You can't inflate your score with prompt engineering or cherry-picked examples. Either your patch passes the tests or it doesn't.
The Pricing Play
Performance parity would be noteworthy. But MiniMax didn't stop there.
M2.5 pricing: $0.30 per million input tokens, $1.20 per million output tokens.
For comparison: GPT-5.2 runs approximately $1.50/$6.00. Claude Opus 4.6 is comparable. Gemini 3 Pro slightly cheaper but still 3-4x M2.5's rates.
At these prices, MiniMax isn't competing on performance alone. They're making the economic case that US pricing is rent extraction. A company running a million API calls per day just saw their AI infrastructure costs drop 80% by switching providers.
This is the DeepSeek playbook scaled up. Chinese labs have consistently shipped models at price points that make US executives uncomfortable. The strategy seems to be: match performance, undercut on price, release weights, and let ecosystem adoption do the work.
The Open-Source Gambit
M2.5 weights are public. You can download them today.
This is the strategic divergence that matters most. US frontier labs have consolidated around closed weights and API access. OpenAI, Anthropic, and Google treat model weights as crown jewels, competitive moats that justify their valuations.
China is betting the opposite direction. DeepSeek open-sourced. Qwen open-sourced. Now MiniMax. The theory: open weights accelerate adoption, build ecosystem lock-in, and make your architecture the standard that everyone else builds on.
There's precedent. Android won mobile by being open when iOS stayed closed. Linux became the server default through the same dynamic. If M2.5 weights proliferate, fine-tuned, distilled, integrated into every coding tool, that's a form of dominance that doesn't show up on benchmark leaderboards.
The US export control strategy assumed that choking compute would slow Chinese AI development. The labs would hit a ceiling, unable to train frontier models without access to H100s and their successors.
M2.5 suggests the ceiling is higher than Washington expected, or that Chinese labs found ways around the walls.
What This Means for US Labs
The comfortable narrative was that China trailed by 12-18 months on frontier capabilities. That gap appears to have closed, at least for coding-specific tasks.
US labs now face a strategic trilemma:
Match on price. Slash API costs to compete. This craters revenue and makes the venture math harder. Anthropic and OpenAI have raised at valuations that assume premium pricing. A race to the bottom isn't in their business plans.
Match on openness. Release weights. This is philosophically complicated for labs that have built safety cases around controlled deployment. It's also commercially complicated. Once weights are public, the pricing power evaporates.
Differentiate on capability. Push harder on the frontier. Ship models that justify the premium. This is the default play, but it requires staying ahead of a competitor that just caught up while burning less capital.
None of these options are comfortable. The most likely response is some combination: selective price cuts for high-volume customers, keeping flagship models closed while open-sourcing older generations, and accelerating release cadence.
But the strategic initiative has shifted. US labs are now responding to Chinese moves rather than setting the pace.
What I'm Watching
First, independent replication. MiniMax's numbers are self-reported. I want to see third-party evaluations on held-out tasks before I fully trust the 80.2% figure. Benchmark scores have been gamed before.
Second, production deployment. High benchmark scores don't always translate to smooth production behavior. Latency, reliability, content policy edge cases: these matter for real adoption. Let's see how M2.5 performs when thousands of developers are hitting it with real workloads.
Third, US policy response. Export controls were supposed to prevent this. If Chinese labs can ship frontier models despite chip restrictions, the strategic calculus in Washington changes. Expect hearings.
Fourth, what OpenAI and Anthropic do next. Scheduled release timelines may accelerate. Pricing structures may shift. The comfortable duopoly, really an oligopoly, just got a credible challenger.
The Bottom Line
MiniMax M2.5 is real. The benchmark achievement is real. The pricing disruption is real.
I've covered enough hype cycles to stay skeptical of any single announcement. But the pattern here is clear: Chinese AI labs are not trailing. On the task that matters most for economic value, automated software engineering, they're now at parity or better, at a fraction of the cost, with open weights.
The gap closed faster than anyone in San Francisco expected. The question now is whether it stays closed, or whether MiniMax just signaled that the lead might flip.