The initial reactions to OpenAI’s landmark open source gpt-oss models are highly varied and mixed

August 8, 2025

Initial reactions from the AI developer and user community have been mixed and varied

despite the fact that OpenAI’s gpt oss open source models achieved technical benchmarks comparable to OpenAI’s other powerful proprietary AI models. According to my observations, if this release was a movie that premiered and was graded by Rotten Tomatoes we’d see a near 50/50 split.

To start, a little background: OpenAI released two new text-only (no image analysis or generation) language models under the permissive Apache 2.0 license – this is the first time the company has done this since 2019 (before ChatGPT). This is the first time the company has done it with a cutting edge language model since 2019.

For the last 2.7 years, all ChatGPT models have been proprietary or closed-source ones that OpenAI controlled. Users had to pay for access to them (or use a limited free tier), and they were not able to run on private computing hardware or offline.

AI Scaling Has Its Limits.

Power limits, rising token costs, inference delays, and other factors are reshaping enterprise AI. Join our exclusive event to learn how top teams are: https://bit.ly/4mwGngO

But that all changed thanks to the release of the pair of gpt-oss models yesterday, one larger and more powerful for use on a single Nvidia H100 GPU at say, a small or medium-sized enterprise’s data center or server farm, and an even smaller one that works on a single consumer laptop or desktop PC like the kind in your home office. The models are so new that it took several hours for the AI community to test and run them on their own benchmarks and tasks.

We’re now getting a wave of feedback ranging between optimistic enthusiasmregarding the potential of these powerful and efficient new models. There is also an undercurrent dissatisfaction with what users perceive as significant problems and limits.

High benchmarks but still behind Chinese leaders in open source

Intelligence Benchmarks place the gpt oss models above most American open-source offerings. According to independent third parties Artificial Analysissays that gpt -oss 120B is the “most intelligent American open weights models” but it falls short of Chinese heavyweights such as DeepSeek R1 or Qwen3235B.

On reflection, that was all they did. Self-proclaimed DeepSeek stan wrote: “Mogged on benchmarks” @teortaxesTex. “No good derivative model will be trained… No usecases will be created… Barren claims to bragging rights.” AI Researcher Teknium, @Teknium1,– co-founder of a rival open source AI model provider Researchis a company that specializes in analyzing and evaluating the effectiveness of various products. On X, they called the release a “legitimate nothing burger” and predicted that a Chinese model would soon surpass it. “Overall, I am very disappointed. I came to this with an open mind.” They wrote.

Do you think it’s a good idea to focus on math and coding, at the expense of writing skills?

The Gpt-oss model’s apparent narrow usefulness was also criticized.

Influenced “””https://x.com/scaling01/status/1953047913954791696″”> Oral Al Magical, @scaling01,noted that the models excelled at math and coding, but “completely lacked taste and common sense.”

Some users found that the model injected equations into poetry outputs. “This is what you get when you benchmarkmax,” Teknium commentedand shared a screenshot showing the model adding an integral formula in the middle of a poem.

and @kalamazea researcher at decentralized AI model training company Prime Intellectwrote that “gpt oss 120b knows less about world than a good 32b.” They probably pre-trained on majority synth to avoid copyright issues. “pretty devastating stuff”

Former independent AI developer and Googler Kyle Corbitt agreed with Corbitt that the gpt and oss models were “extremely spiky” because they were “great for the tasks that it was trained on and really bad for everything else.” Corbitt wrote that it is good at coding and math tasks, but bad at linguistic tasks such as creative writing or report creation .

The charge is that OpenAI deliberately used more synthetic data to train the model than real-world facts and figures in order to avoid using copyrighted information scraped from websites or other repositories they don’t own, or have a license to use. This is something many leading gen AI firms have been accused of and are currently facing lawsuits. Others speculated that OpenAI may have trained its model on primarily artificial data to avoid copyrighted data scraped from websites and other repositories it doesn’t own or have license to use. Avoid safety and security issues.

Third-party benchmarking results

In the eyes of some users, evaluating models on third-party tests has revealed concerning metrics.

SpeechMap, which measures how well LLMs comply with user prompts and produce disallowed or biased outputs, has been a concern for some users. Compliance scores for gpt oss 120B hovered below 40% near the bottom peer open models This indicates resistance to following user requests and defaulting guardrails at the expense or providing accurate information.

Help’s Polyglot Evaluation The gpt oss 120B scored only 41.8% for multilingual reasoning, far below competitors such as Kimi-K2 (59%), and DeepSeek R1 (56.9%).

Some testers also stated that their tests indicated that the model is It is strange that the US and EU are treated differently byin comparison to China and Russia. This raises questions about bias, and data filtering training.

Some experts have praised the release of the software and what it means for U.S. Open Source AI

. To be fair, there is not only negative commentary. Software engineer and AI watcher Simon Willison described the release as “really impressive” on Xelaborating. In a blog about OpenAI’s proprietary models o3-mini, and o4 mini, the models were praised for their efficiency and ability to compete with them.

The model’s performance on reasoning and STEM benchmarks was praised, as well as the new “Harmony”prompt template format that offers developers more structured terms to guide model responses. He also praised support for third-party tools and the new “Harmony”prompt template format.

The new “Harmony” prompt template format, which offers developers more structured terms for guiding model responses — and support for third-party tool use as meaningful contributions. Clem Delangue is the CEO and co-founder AI code sharing community. Hugging Face encouraged users to not rush to judgement, pointing out the complexity of inferences for these models. Early issues could be caused by infrastructure instability and inadequate optimization among hosting providers. Delangue wrote, “The beauty of open-source software is that it’s not cheating.” We’ll uncover the strengths and weaknesses… gradually.”

Wharton School of Business professor Ethan Mollick at the University of Pennsylvania was even more cautious. OpenAI questioned if this was a one-off. He noted that “the lead will disappear quickly as others catch-up,” and that it is unclear what incentives OpenAI have to keep the models up-to date.

Nathan Lambert is a leading AI research at the rival open-source lab Allen Institute for AI (Ai2) and commentator (19459061). On his blog Interconnects, praised the symbolic importance of the release, calling it “a phenomenal move for the open eco-system, especially for the West, and its allies.”

However, he On X, he warned that gpt -oss is“unlikely to meaningfully slow [Chinese e-commerce giant Aliaba’s AI team] [Qwen”citing its usability and performance.

According to him, the release marks a significant shift in the U.S. towards open models. However, OpenAI has a long way to go in order for it catch up.

A split verdict

For now, the verdict is split. OpenAI’s gpt – oss models represent a milestone in terms of licensing, accessibility and affordability.

While the benchmarks are impressive, the “vibes”as many users call them, are less convincing.

The ability of developers to build robust applications and derivatives based on gpt will determine if the release is remembered for a breakthrough or as a blip.

VB Daily provides daily insights on business use-cases

Want to impress your boss? VB Daily can help. We provide you with the inside scoop on what companies do with generative AI. From regulatory shifts to practical implementations, we give you the insights you need to maximize ROI.

Read our privacy policy

Thank you for subscribing. Click here to view more VB Newsletters.

An error occured.

Loading…

Here are the results for the search: "{{td_search_query}}"

No results!

{{post_title}}

High benchmarks but still behind Chinese leaders in open source

Do you think it’s a good idea to focus on math and coding, at the expense of writing skills?

Third-party benchmarking results

Some experts have praised the release of the software and what it means for U.S. Open Source AI

A split verdict

RELATED ARTICLES

The AI lab revolving door spins ever faster

This AI finds simple rules where humans see only chaos

This tiny chip could change the future of quantum computing