ChatGPT 4.1 Early Benchmarks compared to Google Gemini

(

)

ChatGPT is now rolling out. It’s a significant step up from GPT 4o but it doesn’t beat the benchmark set by Google Gemini.

Yesterday, OpenAI confirmed that developers with access to APIs can try up to three new models, GPT 4.1, GPT 4.1 mini, and GPT 4.1 nano. According to benchmarks, the new models are superior to the existing GPT-4o or GPT-4o Mini, especially in terms of coding.

GPT-4.1, for example, scored 54.6% in SWE-bench verified, which is higher than GPT-4o’s score of 21.4%, and 26.6% better than GPT-4.5. OpenAI has shared similar results with other benchmarking tools, but how do they compare to Gemini models? Benchmarks for ChatGPT 4.1

(19659008) Benchmarks comparing LLMs.

Stagehand shared benchmarks that showed Gemini 2.0 Flash had the lowest error rate (6.67%) and the highest exact match score (90%). It is also cheap and quick.

GPT-4.1, on the other hand has a higher rate of error (16.67%). It also costs more than 10 times as much as Gemini 2.0 Flash.

Other GPT versions (like “nano”, “mini”) may be cheaper or faster, but they are not as accurate as GPT 4.1.

Chart compares LLMs, plotting their performance on the vertical axis against their price per million tokens (on horizontal axis).

According to another data shared by Pierre Bongrand who is a Harward scientist working on RNA, This is important because GPT4.1 costs less than ChatGPT 4.

Models such as Gemini 2.0 Flash and Gemini 2.5 Pro are closer to the frontier. This suggests that they offer higher performance for a lower cost.

While GPT-4.1 is still an option, there are cheaper or more capable alternatives.

Coding Benchmarks show GPT 4.1 lags behind Gemini 2.1

We are seeing similar results with Aider Polyglot which lists GPT 4.1 with a score of 52%, while Gemini 2.1 is miles ahead with 73%. GPT-4.1 can be accessed via API. However, you can get it for free by signing up for Windsurf AI (19459042).

ChatGPT 4.1 Early Benchmarks compared to Google Gemini

Coding Benchmarks show GPT 4.1 lags behind Gemini 2.1

WordPad is no more in Windows 11, however Notepad has absorbed...

Grab it before it ends

Multimodal Foundation Models Fall Short on Physical Reasoning: PHYX Benchmark Highlights...

A Coding Guide to Building a Scalable Multi-Agent Communication Systems Using...

Recomended

WordPad is no more in Windows 11, however Notepad has absorbed its skills

Grab it before it ends

Multimodal Foundation Models Fall Short on Physical Reasoning: PHYX Benchmark Highlights Key Limitations in Visual and Symbolic Integration

A Coding Guide to Building a Scalable Multi-Agent Communication Systems Using Agent Communication Protocol (ACP)

This AI Paper Introduces ARM and Ada-GRPO: Adaptive Reasoning Models for Efficient and Scalable Problem-Solving

Cisco’s Latest AI Agents Report Details the Transformative Impact of Agentic AI on Customer Experience