It's Qwen's summer: new open source Qwen3-235B-A22B-Thinking-2507 tops OpenAI, Gemini reasoning models on key benchmarks

Want smarter insights delivered to your inbox?

Subscribe to our weekly newsletters and get only the information that matters to enterprise AI, security, and data leaders. Subscribe Now If the AI sector had a song of the summer, a hit that is heard everywhere in the warmer months in the Northern Hemisphere, Alibaba’s Qwen Team would be the clear winner.

In the last week alone, the Chinese ecommerce giant’s frontier model AI research team has released not one but two new AI models. Threebut Four (!!) New open source generative AI model that offers record-breaking benchmarks, bettering even some leading proprietary solutions.

Yesterday, Qwen Team capped it off with the release of Qwen3-235B-A22B-Thinking-2507it’s updated reasoning large language model (LLM), which takes longer to respond than a non-reasoning or “instruct” LLM, engaging in “chains-of-thought” or self-reflection and self-checking that hopefully result in more correct and comprehensive responses on more difficult tasks.

In fact, the new Qwen3 -Thinking-2507 (as we’ll call this model) now leads or trails top-performing models in several major benchmarks.

AI Impact Series Returns To San Francisco – 5 August

Are you ready for the next phase of AI? Join leaders from Block GSK and SAP to get an exclusive look at the ways autonomous agents are reshaping workflows in enterprise – from end-to-end automated workflows to real-time decision making.

Reserve your seat now as space is limited. https://bit.ly/3GuuPLF

As AI influencer and news aggregator Andrew Curran wrote: Qwen’s strongest reason model has arrived and it is on the frontier.

The model also shows a commanding performance on LiveCodeBench v6 scoring 74.1, ahead of Google Gemini-2.5 Pro (72.5) and Openai O4-Mini (71.8]]. The model also performs well on LiveCodeBenchwith a score of 74.1, beating out OpenAI’s o4 Mini (71.8) and Google Gemini-2.5 (72.5)as well as its previous version which scored55.7The model scores81.1%in GPQA, a benchmark that measures graduate-level multiple choice questions. This is almost the same as Deepseek-R1-0528’s score of 81.0%and trails Gemini-2.5Pro’s top score of 86.4%. Qwen3-Thinking-2507 scored 81.1on Arena-Hardwhich evaluates alignment through win rates and subjective preferences.

These results show that the model not only outperforms its predecessors in every major category, but also sets new standards for open-source reasoning-focused models.

A move away from “hybrid reasoning”

Qwen3 Thinking-2507 reflects an broader strategic shift made by Alibaba’s Qwen Team: moving away hybrid reasoning models, which required users to manually toggle “thinking mode” and “non thinking mode” modes.

The team now trains separate models for reasoning tasks and instruction tasks. This separation allows for each model to be optimized according to its intended purpose, resulting in improved consistency and clarity. This design philosophy is fully embodied in the new Qwen3 Thinking model.

Alongside it, Qwen launchedQwen3-Coder-480B-A35B-Instructa 480B-parameter model built for complex coding workflows. It supports 1,000,000 token context windows, and outperforms GPT 4.1 and Gemini 2.5 on SWE Bench Verified.

Also Qwen3MTwas announced as a multilingual translator model trained on trillions tokens across 92+ different languages. It supports domain adaption, terminology control, as well as inference starting at just $0.50 for every million tokens.

Earlier in the week, the team released Qwen3-235B-A22B-Instruct-2507a non-reasoning model that surpassed Claude Opus 4 on several benchmarks and introduced a lightweight FP8 variant for more efficient inference on constrained hardware.

The models are licensed under Apache 2.0, and are available via Hugging Face, ModelScope and the Qwen AI.

Licensing: Apache 2.0 and its enterprise advantage

Qwen3-235B-A22B-Thinking-2507 is released under the Apache 2.0 licensea highly permissive and commercially friendly license that allows enterprises to download, modify, self-host, fine-tune, and integrate the model into proprietary systems without restriction.

This is in contrast to proprietary or research-only releases, which often require APIs, impose usage limitations, or prohibit commercial deployment. Apache 2.0 licensing allows full flexibility and ownership for compliance-conscious teams and organizations looking to control costs, latency and data privacy.

Availability and pricing

Qwen3-235B-A22B-Thinking-2507 is available now for free download on Hugging Face ModelScope

For enterprises that don’t want or have the resources to host the model-inference on their own hardware, or virtual private cloud, through Alibaba Cloud’s vLLM and SGLang.

Input Price:$.70 per million tokens
Output Price:$11.40 per million tokens
Free Tier:One million tokens valid for 180 days.

This model is compatible with agents via Qwen-Agent. It can be run locally by using transformer frameworks, or integrated into development stacks via Node.js tools, CLI interfaces, or structured prompting.

For best performance, sample settings include temperatures=0.6 and top_p=0.95 as well as maximum output length of 81920 tokens when performing complex tasks.

Enterprise applications and future outlook.

Qwen3 Thinking-2507’s strong benchmark performance, its long-context capabilities, and its permissive licensing make it a good choice for enterprise AI systems that involve reasoning, planning, or decision support. The Qwen3 ecosystem, which includes coding, translation, and instruction models, is more appealing to teams and businesses looking to integrate AI in verticals such as engineering, localization, research, and customer support.

Qwen’s decision to release specialized AI models for specific use cases, backed up by technical transparency and a community-based support system, signals a deliberate move toward building an open, performant and production-readyAI infrastructure.

As enterprises look for alternatives to API-gated black-box models and seek open-source solutions, Alibaba’s Qwen Series is increasingly positioned as a viable foundation that offers both control and capability.

VB Daily provides daily insights on business use-cases

Want to impress your boss? VB Daily can help. We provide you with the inside scoop on what companies do with generative AI. From regulatory shifts to practical implementations, we give you the insights you need to maximize ROI.

Read our privacy policy

Thank you for subscribing. Click here to view more VB Newsletters.

An error occured.

It’s Qwen’s summer: new open source Qwen3-235B-A22B-Thinking-2507 tops OpenAI, Gemini reasoning models on key benchmarks

A move away from “hybrid reasoning”

Licensing: Apache 2.0 and its enterprise advantage

Availability and pricing

Enterprise applications and future outlook.

African startups have $60B in return. How will they do it?

Google Launches New AI Scam detection in Circle to Search, Google...

Black Friday deals under 50 dollars: Apple AirTags Legos Ugreen chargers...

Google rolling out Gemini 3 Deep Think for AI Ultra

Recomended

African startups have $60B in return. How will they do it?

Google Launches New AI Scam detection in Circle to Search, Google Lens and Google Lens

Black Friday deals under 50 dollars: Apple AirTags Legos Ugreen chargers Blink cameras and other items

Google rolling out Gemini 3 Deep Think for AI Ultra

OpenAI says ChatGPT can save the average worker an hour per day

OpenAI boasts enterprise win days after internal ‘code red’ on Google threat