Technology

Moonshot’s Kimi K2 Thinking emerges as leading open source AI, outperforming GPT-5, Claude Sonnet 4.5 on key benchmarks

November 7, 2025

Chinese Open-Source AI Model Surpasses OpenAI’s GPT-5 in Key Benchmarks

As OpenAI continues to expand its AI infrastructure and invest heavily in proprietary models, Chinese open-source AI developers are rapidly closing the gap. A newly launched Chinese AI model has now matched or exceeded OpenAI’s flagship GPT-5 in several critical third-party performance tests, despite being freely available to the public.

Introducing Kimi K2 Thinking: A New Leader in Open AI

Released today by the Chinese startup Moonshot AI, Kimi K2 Thinking has outperformed both proprietary and open-weight competitors in benchmarks measuring reasoning, coding, and autonomous tool use. This milestone marks a significant breakthrough for open-source AI, demonstrating that free models can rival-and even surpass-paid industry leaders like OpenAI’s GPT-5, Anthropic’s Claude Sonnet 4.5 (Thinking mode), and xAI’s Grok-4.

Developers can access Kimi K2 Thinking through popular platforms such as Hugging Face and GitHub, where the model’s weights and code are openly hosted. The release also includes APIs supporting chat, complex reasoning, and multi-tool workflows, enabling broad integration possibilities. Users can test the model directly via its dedicated web interface and through partner platforms.

Flexible Licensing with Commercial Rights and Attribution

Moonshot AI has published Kimi K2 Thinking under a modified standard open-source license on Hugging Face. This license grants full commercial and derivative rights, allowing researchers and enterprises to freely use and adapt the model for commercial purposes. The only stipulation is an attribution requirement for large-scale deployments:

If the software or any derivative product serves over 100 million monthly active users or generates more than $20 million USD in monthly revenue, the deployer must prominently display “Kimi K2” on the product’s user interface.

This condition acts as a light-touch attribution clause, preserving the freedoms typical of MIT-style licenses while ensuring recognition for the model’s creators. As a result, Kimi K2 Thinking stands out as one of the most permissively licensed, cutting-edge AI models currently available.

Benchmark Performance: Leading in Reasoning, Coding, and Tool Use

Kimi K2 Thinking is a trillion-parameter Mixture-of-Experts (MoE) model, activating 32 billion parameters per inference to balance scale and efficiency. It excels in long-horizon reasoning and structured tool invocation, capable of autonomously executing 200-300 sequential tool calls without human input.

According to Moonshot’s published results, K2 Thinking achieved remarkable scores on several benchmarks:

44.9% on Humanity’s Last Exam (HLE), a cutting-edge reasoning test;
60.2% on BrowseComp, an agentic web search and reasoning challenge;
71.3% on SWE-Bench Verified and 83.1% on LiveCodeBench v6, key coding benchmarks;
56.3% on Seal-0, a real-world information retrieval evaluation.

These results consistently surpass GPT-5’s corresponding scores and outperform MiniMax AI’s recently released open-source model, MiniMax-M2.

Outperforming Proprietary AI Giants

While GPT-5 and Claude Sonnet 4.5 remain dominant proprietary “thinking” models, Kimi K2 Thinking’s agentic reasoning capabilities exceed both in multiple tests. For example, on BrowseComp, K2 Thinking scored 60.2%, significantly ahead of GPT-5’s 54.9% and Claude 4.5’s 24.1%. It also edges out GPT-5 on the GPQA Diamond benchmark (85.7% vs. 84.5%) and matches it on advanced mathematical reasoning tasks like AIME 2025 and HMMT 2025.

Only in specialized “heavy mode” configurations-where GPT-5 aggregates multiple inference trajectories-does the proprietary model regain parity. The fact that Moonshot’s fully open-weight model can rival or outperform GPT-5 marks a pivotal moment, effectively closing the performance gap between closed-source frontier systems and publicly accessible AI.

Surpassing Previous Open-Source Benchmarks

MiniMax-M2, launched just over a week ago, was previously celebrated as the “new king of open-source large language models,” achieving top scores among open-weight systems:

τ²-Bench: 77.2
BrowseComp: 44.0
FinSearchComp-global: 65.5
SWE-Bench Verified: 69.4

While MiniMax-M2 approached GPT-5-level agentic tool use, Kimi K2 Thinking now surpasses it by wide margins. For instance, K2’s 60.2% on BrowseComp outperforms M2’s 44.0%, and its 71.3% on SWE-Bench Verified beats M2’s 69.4%. On financial reasoning tasks like FinSearchComp-T3, K2 Thinking delivers comparable results while maintaining superior general reasoning capabilities.

Both models utilize sparse Mixture-of-Experts architectures for computational efficiency, but Moonshot’s design activates more experts and employs advanced INT4 quantization-aware training (QAT). This approach doubles inference speed compared to standard precision without sacrificing accuracy-crucial for extended “thinking-token” sessions with context windows up to 256,000 tokens.

Advanced Agentic Reasoning and Autonomous Tool Integration

K2 Thinking’s standout feature is its transparent reasoning process. The model generates an auxiliary output field, reasoning_content, which reveals intermediate logical steps before producing final answers. This transparency ensures coherence across lengthy multi-turn interactions and complex multi-step tool invocations.

Moonshot’s reference implementation showcases the model autonomously executing a “daily news report” workflow: it calls date and web search tools, analyzes retrieved information, and composes structured summaries-all while maintaining an internal reasoning state. This end-to-end autonomy enables K2 Thinking to plan, search, execute, and synthesize evidence over hundreds of steps, exemplifying the emerging class of “agentic AI” systems that operate with minimal human oversight.

Cost-Effective Performance at Scale

Despite its massive scale, K2 Thinking offers competitive runtime costs:

$0.15 per 1 million tokens (cache hit)
$0.60 per 1 million tokens (cache miss)
$2.50 per 1 million tokens output

These prices are notably lower than MiniMax-M2’s $0.30 input / $1.20 output rates and dramatically undercut GPT-5’s $1.25 input / $10 output pricing, making K2 Thinking an economical choice for enterprises and developers.

Rapid Progress in Open-Source AI Research

The swift succession of MiniMax-M2 and Kimi K2 Thinking highlights how quickly open-source AI is advancing toward-and now surpassing-frontier proprietary models. Both leverage sparse activation for efficiency, but K2 Thinking’s higher active parameter count (32 billion vs. 10 billion) delivers stronger reasoning accuracy across diverse domains. Its ability to scale inference time by increasing “thinking tokens” and tool calls without retraining offers practical performance improvements not yet seen in MiniMax-M2.

Technical Innovations Behind K2 Thinking

K2 Thinking supports native INT4 inference and handles ultra-long context windows of up to 256,000 tokens with minimal performance loss. Its architecture integrates quantization, parallel trajectory aggregation (“heavy mode”), and Mixture-of-Experts routing optimized for reasoning tasks.

These technical advances enable the model to sustain complex iterative workflows-such as code compile-test-fix cycles and multi-step search-analyze-summarize processes-across hundreds of tool calls. This capability underpins its superior performance on benchmarks like BrowseComp and SWE-Bench, where maintaining reasoning continuity is critical.

Strategic Impact on the Global AI Landscape

The convergence of open and closed AI models at the highest performance levels signals a fundamental shift in the industry. Organizations that once depended solely on proprietary APIs can now deploy open-source alternatives matching GPT-5-level reasoning, while retaining full control over model weights, data privacy, and regulatory compliance.

Moonshot’s open publication approach builds on precedents set by models like DeepSeek R1, Qwen3, GLM-4.6, and MiniMax-M2, but extends open-source capabilities into full agentic reasoning. For academic researchers and enterprise developers alike, K2 Thinking offers transparency through inspectable reasoning traces and the flexibility to fine-tune the model for specialized applications.

This development arrives amid growing scrutiny of the financial sustainability of AI’s largest players. Recently, OpenAI’s CFO sparked debate by suggesting that the U.S. government might need to provide financial “backstops” for the company’s massive compute and data center investments-comments interpreted as calls for taxpayer-backed guarantees. Meanwhile, major tech firms like Microsoft, Meta, and Google are aggressively securing chip supply chains, fueling concerns about an unsustainable “AI arms race” driven more by strategic fear than commercial viability.

Against this backdrop, Moonshot AI’s and MiniMax’s open-weight releases intensify pressure on U.S. proprietary AI companies and their investors to justify their enormous expenditures and paths to profitability. If enterprises can obtain equal or superior performance from free, open-source Chinese models compared to paid proprietary solutions like GPT-5, Claude Sonnet 4.5, or Google’s Gemini 2.5 Pro, the rationale for costly subscriptions weakens. Notably, some Silicon Valley companies have already expressed concerns about escalating AI costs.

For investors and businesses, these trends suggest that cutting-edge AI capability no longer requires massive capital outlays. Instead, the future may belong to research teams optimizing model architectures and quantization techniques for efficiency rather than scale alone.

Implications for Enterprises and the AI Community

Within weeks of MiniMax-M2’s rise, Kimi K2 Thinking has overtaken it-and leading proprietary models like GPT-5 and Claude 4.5-across nearly all reasoning and agentic benchmarks. This demonstrates that open-weight AI systems can now match or exceed the performance and efficiency of closed-source frontier models.

For the AI research community, K2 Thinking represents more than just another open model; it signals a new era of collaborative frontier development. The most advanced reasoning AI available today is no longer confined to closed commercial products but is accessible as an open-source system to anyone.

Loading…

Here are the results for the search: "{{td_search_query}}"

No results!

{{post_title}}