Follow-up on OpenAI: China’s o1 Class Reasoning Models are being introduced one after another

Domestic o1 models began to undergo intensive updates in the first month 2025. Moonshot AI StepFun and DeepSeek are among the publishers, which are independent of the entrepreneurial landscape.

DeepSeek released DeepSeek-R1 on January 20th. The model weights were also made available.

According to test results revealed by DeepSeek it performs comparable to OpenAI-o1-1217 on tasks such as mathematics and coding. Three test sets, AIME 2024 (American Invitational Mathematics Examination), SWE-Bench-Verified (software development domain) and MATH-500 are particularly excelling.

As a way to validate R1’s abilities, several smaller versions of the 660B R1 version have been developed. The 32B and the 70B models are compatible with OpenAI’s o1 mini in terms of various abilities. These distilled models are from the Qwen and Llama series. The performance of the 14B Qwen Series distilled model is significantly better on various inference tests sets than that of QwQ 32B-Preview.

Note that DeepSeek has also open-sourced DeepSeek R1-Zero, which is a breakthrough that only incorporates reinforcement learning (RL) on top pre-training and does not undergo SFT (supervised tuning).

R1-Zero, which does not use human-supervised data, may have poor readability or mixed language in the generation phase but is still comparable with OpenAI-o1-0912. Its greater significance is in exploring the technical possibilities of training large language model solely by reinforcement learning without supervised fine tuning. This provides a foundation for subsequent research.

DeepSeek’s pricing strategy is the same as its identity. Pinduoduo” in AI large model field. Pricing for DeepSeek R1 API services is set at 1 Yuan per Million Input Tokens (cache Hit) / 4 Yuan per Million Input Tokens (cache Miss), and 16 Yuan per Million Output tokens. In this price range the cost of a cache-hit input token costs less than 2% as compared to OpenAI O1, while the cache-missed output price and input price are only 3.6% higher.

Moonshot AI released the K1.5 reasoning model on the same date. It is a model that stands in stark opposition to DeepSeek R1.

SEE OTHERWISE: Kimi K1.5: The First Non OpenAI Model to Match Full Powered O1 Performance (19659001)

Moonshot AI updated the k0-math math model, k1 model of visual thinking, and other k-series reinforcement learning models since November last year. K1.5 is a model that advances multimodal thinking.

Moonshot AI describes K1.5 as having a’multimodal o1′ capability. In other words, K1.5 has both general multimodal capabilities and reasoning abilities.

According to official data its Short-CoT mode (Short Consideration mode)’s mathematical and coding capabilities, as well as its visual multimodal capabilities, are comparable to GPT-4o, Claude 3.5 Sonnet, while its Long CoT mode (Long Consideration mode)’s multimodal reasoning, mathematical, coding and coding abilities, reach the level OpenAI o1 version.

Both R1 and K1.5 use reinforcement learning, multi-stage processes, thought chains and reward models. Using publicly available information, it appears that each has their own technical strategy at different stages.

DeepSeek used thousands of CoT cold-start data to fine-tune the DeepSeek-V3 base model. Then, a large-scale RL was conducted focusing on reasoning, and language consistency incentives were introduced to overcome the language confusion issue. After supervised fine tuning (SFT), reinforcement-learning applicable to all scenarios using different reward rules was performed.

R1 also incorporated the Group Relative Policy Optimizer (GRPO) algorithm into reinforcement learning. It can improve sample efficiency and algorithm stability.

On the one hand, k1.5 increased the context window for reinforcement learning to 128k. On the other hand it used a variant online mirror descent in order to optimize robust policies. This combination allowed k1.5 establish a relatively simple reinforcement learning framework which could ensure performance, without incorporating more complicated techniques such as Monte Carlo Tree Search, value functions, or procedural reward models.

It is also worth noting that K1.5 introduced a ‘length penality’ in reinforcement learning by establishing a reward formula that assigns values based on the response length and determinism. It also adopted methods such as’shortest refusal adopt’ (selecting shortest correct responses for supervised fine tuning) to suppress response.

A feature of k1.5 that gives it multimodal capability is the joint training of text and visual data. Kimi admits that because some inputs are primarily text formats, its ability to understand geometric questions with graphics may not be strong enough.

StepFun had also released the experimental version (called “Step R-mini”) of Step Reasoner Mini on January 16, 2008. This is a reasoning model that has ultra-long inference abilities.

It is not yet ready. In the test set it is mainly benchmarking against OpenAI o1 preview and o1 mini, rather than the full o1 version. This should also be related with the model size and the training method. Its performance is similar to that of QwQ-32B Preview in terms of domestic benchmarks.

StepFun, however, emphasizes “balance between art and science” by using the On-Policy reinforcement algorithm. It can be used to create literary content and chat daily, while also ensuring logical reasoning, coding and mathematical abilities.

Since September last year, when OpenAI revolutionized the model training paradigms by introducing the o1 transformation, major large scale model companies have started to meet industry expectations. This has led to an increase in the number of domestic models that are o1-class.

OpenAI, while all players closely watched o1, also revealed o3 and the o3-mini during the release season in December last year. OpenAI has revealed that despite o3 not being officially launched, its performance is significantly better than o1.

In the SWE Bench Verified Software Development Test Set, for example, o3 scored 72.7% while o1 had only 48.9%. In the AIME2024 Test Set, o3 achieved a accuracy of 96.7% compared to 83.3% in o1. Some of o3โ€™s performances are beginning to show the first signs of AGI (Artificial General Intelligence).

Of course, the o3 has its own problems. On the one hand, o series models perform better in tasks with well-defined boundaries and definitions. However, they still lack handling abilities for some real engineering tasks. In a recent mathematical benchmark test called FrontierMath there were doubts regarding o3’s abilities due to early access to real questions through funding from affiliated institutions.

The problem that large model companies in the United States face is clear. Technically, it is not known if DeepSeek-R1 and k1.5 models have successfully incorporated more complicated techniques such as Monte Carlo tree searches and process reward models.

OpenAI announced that the scaling up of reasoning stages from o1-o3 in only three months is faster than the pre-training paradigm for GPT models, which operates on a yearly basis.

OpenAI not only has a more clear technological path, but also the resources to verify and advance quickly. This is what domestic large model companies are facing now. In China’s large-model industry, breakthrough innovations that accelerate efficiency are more important than ever.

SEE OTHERWISE: AI Assistant DeepSeek Official app Launched

www.aiobserver.co

More from this stream

Recomended