Categories
AI Code Generation Model Comparison Reasoning Models

Xiaomi MiMo-V2-Flash vs Kimi K2-Think – What’s the Best Openweight Model?

The mid-December 2025 launch of MiMo-V2-Flash by Xiaomi, a 309b parameters model, added a new player in the openweight model playground. This article offers a technical overview and the story behind MiMo-V2-Flash  to see how it stacks up not only against its de facto rival, Kimi K2-Thinking, but also against other major models like DeepSeek V3.2, Google Gemini 3 Pro, OpenAI’s GPT-5, and Anthropic’s Claude Sonnet 4.5. (Try the latter three models here.) Let’s dig in.

A Brief Background of MiMo-V2

MiMo-V2-Flash represents Xiaomi’s “step 2” on their AGI roadmap, according to Chinese AI prodigy Luo Fuli, who recently joined Xiaomi’s MiMo team after a stint at DeepSeek.

The V2 release builds on lessons from MiMo-7B, incorporating FP8 mixed-precision training over 27 trillion tokens. The model was initially pre-trained with a 32K context window, then extended to 256K through continued training.

A novel Multi-Teacher On-Policy Distillation (MOPD) approach during post-training proved crucial. Rather than learning from a fixed dataset, the student model learns from its own generated responses with dense token-level guidance from domain-specific expert models. This eliminates exposure bias and enables more stable gradient updates.

Xiaomi contributed all inference code to SGLang on day zero of the release and worked closely with the LMSYS team to optimise deployment. The model is available under an MIT license with no restrictions on commercial use.

MiMo-V2-Flash: Speed Demon with Smart Architecture

Xiaomi

Xiaomi’s entry into the frontier AI race isn’t just another “me too” release. MiMo-V2-Flash is purpose-built for one thing above all else: raw speed without sacrificing capability.

The model packs 309 billion total parameters but only activates 15 billion during inference—a Mixture-of-Experts design that delivers GPT-5-class performance while running at 150 tokens per second. That’s roughly 3x the speed of most competitors.

What makes MiMo-V2-Flash genuinely innovative is its hybrid attention architecture. Instead of using full global attention across all layers (expensive and slow), Xiaomi implemented a 5:1 ratio: five consecutive layers use an aggressive 128-token sliding window, followed by one layer with full global attention.

This cuts memory requirements by nearly 6x while maintaining strong long-context performance. The model supports a massive 256K token context window—enough for hundreds of rounds of agent interactions without losing coherence.

Multi-Token Prediction (MTP) adds another speed boost. Rather than generating tokens one at a time, MiMo predicts 3-4 tokens ahead and validates them in parallel. This self-speculative decoding approach delivers 2-2.6x speedups without additional memory overhead.

MiMo-V2-Flash Benchmark Performance

According to Xiaomi’s official benchmarks, MiMo-V2-Flash achieves:

  • 94.1% on AIME 2025 (mathematical reasoning)
  • 73.4% on SWE-Bench Verified (software engineering)
  • 60.6 on LongBench V2 (long-context tasks)

These numbers place it squarely in the company of models like DeepSeek V3.2, Claude Sonnet 4.5, and even GPT-5 on certain tasks. On SWE-Bench Multilingual, it scored 71.7%—claiming the top spot among all open-source models.

The real kicker? Pricing at just $0.10 per million input tokens and $0.30 per million output tokens makes it one of the most cost-effective frontier models available.

The Reality Check

Community testing reveals some gaps between benchmarks and real-world performance. Several developers report that MiMo-V2-Flash struggles with instruction following and can be “all over the place” on complex prompts. One reviewer noted that while the model excels at mathematical and algorithmic problems, it’s less reliable than Claude Sonnet 4.5 for general software engineering work.

The aggressive 128-token attention window, while brilliant for efficiency, may sacrifice some context awareness compared to full-attention models. And the model’s knowledge capacity appears limited compared to larger competitors—a natural trade-off at 15B active parameters.

Kimi K2-Thinking: The Reasoning Powerhouse

Moonshot AI took a different path with Kimi K2-Thinking. This isn’t just a language model—it’s a thinking agent that was natively trained to interleave reasoning with tool use.

Built on a 1-trillion-parameter MoE architecture with 32 billion active parameters (more than double MiMo’s activation), K2-Thinking was designed from the ground up for deep reasoning tasks that require hundreds of sequential steps.

What sets K2-Thinking apart is its ability to maintain coherent, goal-directed behaviour across 200-300 consecutive tool invocations. Most models degrade after 30-50 steps; K2-Thinking just keeps going.

The model was trained with an end-to-end agent methodology, learning when and how to call tools during the reasoning process. It performs dynamic cycles of think → search → browse → code → think, continually generating and refining hypotheses.

On BrowseComp—a benchmark specifically designed to test web-based reasoning—K2-Thinking scored 60.2%, significantly outperforming the human baseline of 29.2%. This demonstrates genuine capability for autonomous research and information synthesis.

Kimi-K2 Openweight Benchmark Dominance

K2-Thinking’s benchmark results are consistently strong across the board:

  • 94.5% on AIME 2025 (slightly ahead of MiMo)
  • 71.3% on SWE-Bench Verified
  • 44.9% on Humanity’s Last Exam with tools
  • 89.4% on HMMT February 2025

The model achieved gold-medal performance on multiple international competitions, including IMO and IOI, showcasing world-class reasoning ability.

Notably, K2-Thinking preserves the distinctive writing quality and style from the original Kimi K2 Instruct model—something that often degrades during extended reinforcement learning training. Users report that it handles diverse tones and formats with natural fluency.

Built for the Long Haul

K2-Thinking supports native INT4 quantisation through Quantisation-Aware Training during post-training. This delivers roughly 2x generation speed improvements while maintaining state-of-the-art performance—all benchmark results are reported under INT4 precision.

The 256K context window matches MiMo’s capability, but K2-Thinking’s full global attention architecture means it doesn’t sacrifice context awareness for efficiency gains.

Head-to-Head: Xiaomi MiMo-V2 vs Kimi K2

Benchmark Comparison
Benchmark MiMo-V2 Flash Kimi-K2 Thinking DeepSeek-V3.2 Thinking Gemini-3.0 Pro Claude Sonnet 4.5 GPT-5 High
Reasoning
MMLU-Pro84.984.685.090.188.287.5
GPQA-Diamond83.784.582.491.983.485.7
HLE22.123.925.137.513.726.3
AIME 202594.194.593.195.087.094.6
HMMT Feb. 202584.489.492.597.579.288.3
LiveCodeBench-v680.683.183.390.764.084.5
General Writing
Arena-Hard (Hard Prompt)54.171.953.472.663.371.9
Arena-Hard (Creative Writing)86.280.188.893.676.792.2
Long Context
LongBench V260.645.158.465.661.8
MRCR45.744.255.589.755.4
Code Agent
SWE-Bench Verified73.471.373.176.277.274.9
SWE-Bench Multilingual71.761.170.268.055.3
Terminal Bench Hard30.530.635.439.033.330.5
Terminal Bench 2.038.535.746.454.242.835.2
General Agent
BrowseComp45.451.424.154.9
BrowseComp (w/ Context Manage)58.360.267.659.2
τ²-Bench80.374.380.385.484.780.2


Here’s a sharp side-by-side comparison of MiMo-V2 vs Kimi K2.

</> Mathematical Reasoning: MiMo edges ahead with 94.1% vs K2’s 94.5% on AIME—effectively a tie at this level. Both models operate with a world-class mathematical reasoning ability.

</> Coding Performance: MiMo claims 73.4% on SWE-Bench Verified vs K2’s 71.3%, but community testing suggests K2 is more reliable in real-world coding scenarios. MiMo excels at algorithmic problems; K2 handles broader software engineering tasks with more consistency.

</> Long-Context Processing: Here’s where things get interesting. Despite K2’s larger active parameters (32B vs 15B), MiMo actually outperforms on long-context benchmarks—60.6 vs 58.4 on LongBench V2. The hybrid attention architecture proves its worth.

</> Agent Capabilities: K2-Thinking wins decisively. Its native training for tool orchestration and ability to maintain coherence across hundreds of tool calls make it the superior choice for complex agentic workflows.

</> Speed and Cost: MiMo dominates. At 150 tokens/second and roughly 2.5% the inference cost of Claude Sonnet 4.5, it’s built for high-throughput production deployments where every millisecond and dollar counts.

</> Instruction Following: K2-Thinking is more reliable. Multiple community reports indicate MiMo struggles with complex instructions and can be inconsistent, while K2 maintains strong compliance across diverse tasks.

How Do They Stack Up Against Closed Models?

Xiaomi

Both models represent the bleeding edge of what’s possible with open-weight architectures, but how do they compare to proprietary alternatives?

vs. Claude Sonnet 4.5: Xiaomi claims MiMo matches or exceeds Claude on multilingual coding benchmarks, but Claude’s superior instruction following and reliable tool calling keep it ahead for production systems. MiMo offers compelling cost savings for teams willing to work around quirks.

vs. GPT-5: Both open models approach GPT-5’s performance on reasoning benchmarks (MiMo: 94.1% AIME vs GPT-5’s 94.6%). K2-Thinking’s Speciale variant would likely close this gap further. The real difference is deployment flexibility—open weights enable on-premise hosting that GPT-5 can’t match.

vs. DeepSeek V3.2: This is the most direct competition. DeepSeek activates approximately 37B parameters compared to MiMo’s 15B, giving it advantages in general reasoning. MiMo edges ahead on math (94.1% vs 93.1% AIME) while matching DeepSeek’s 73.1% on SWE-Bench Verified. K2-Thinking sits between them in most benchmarks but excels at agentic tasks.

vs. Gemini 3.0 Pro: Google’s model still leads on most benchmarks—90.1% MMLU-Pro, 91.9% GPQA-Diamond—demonstrating the advantages of proprietary training at massive scale. But the open models are catching up fast, particularly on specialised tasks like coding and mathematical reasoning.

Which Model Should You Choose?

The answer depends entirely on your use case:

Choose MiMo-V2-Flash if you need:

  • Maximum inference speed for production workloads
  • Cost-effective deployment at scale
  • Strong mathematical reasoning
  • Long-context processing on a budget
  • English-dominant workflows

Choose Kimi K2-Thinking if you need:

  • Reliable multi-step reasoning
  • Complex agentic workflows with extensive tool use
  • Consistent instruction following
  • High-quality creative and practical writing
  • Chinese language support (90.9% CMMLU vs MiMo’s 87.4%)

Look at closed models if you need:

  • Zero deployment headaches (Claude, GPT-5)
  • Absolute frontier performance across all tasks (Gemini 3.0 Pro)
  • Production-ready reliability without technical team oversight

The Bottom Line

What’s remarkable about both MiMo-V2-Flash and Kimi K2-Thinking isn’t just their performance; it’s what they represent. Two Chinese AI labs, both founded within the last few years, are now producing open-weight models that legitimately compete with the best proprietary systems from OpenAI, Google, and Anthropic.

The training costs tell the story: K2-Thinking cost $4.6 million to train, while even massive models like DeepSeek V3 required just $5.6 million. Compare that to the estimated $50-100 million for GPT-4, and the trajectory becomes clear.

Open-weight models are catching up faster than anyone predicted. And with full transparency, MIT licensing, and no API lock-in, they’re changing the calculus for production AI deployments.