OpenAI’s new GPT-4.1 AI models focus on coding

April 15, 2025

OpenAI’s GPT-4.1 AI Models Focus on Coding

OpenAI launched Monday a new model family called GPT-4. Yes, “4.1”as if OpenAI’s nomenclature weren’t confusing enough.

Openai claims that GPT 4.1, GPT 4.1 mini, and GPT 4.1 Nano “excel” in coding and instruction-following. The multimodal models are available through OpenAI’s AI but not ChatGPT. They have a context window of 1 million tokens, which means they can process roughly 750,000 words at once (longer than the book “War and Peace”)

GPT 4.1 is now available as OpenAI competitors like Google and Anthropic intensify their efforts to develop sophisticated programming models. Google’s Gemini 2.5 Pro, with its 1-million token context window and a recently released version, is rated highly on popular coding tests. Anthropic’s Claude 3.7 Sonnet, and Chinese AI startup DeepSeek V3 upgraded

Many tech giants including OpenAI are aiming to train AI coding model capable of performing complex software-engineering tasks. OpenAI’s goal is to create a “software agent” as CFO Sarah Friar said at a tech summit held in London last month. The company claims that its future models will have the ability to program entire applications end-to-end. This includes aspects such as bug testing, quality assurance, and documentation. GPT-4.1 represents a step forward in this direction.

An OpenAI spokesperson told TechCrunch by email that they have optimized GPT-4.1 to be used in real-world situations based on feedback. This includes improving frontend coding, making less extraneous changes, following formats with reliability, adhering consistently to response structure and order, and consistent tool usage. These improvements allow developers to create agents that are significantly better at real world software engineering tasks.

OpenAI claims that the full GPT 4.1 model outperforms GPT-4o and GPT-4o Mini models on coding benchmarks including SWE-bench. OpenAI claims that GPT-4.1 mini is faster and more efficient, but at the expense of accuracy. GPT-4-1 nano is their fastest and cheapest model.

The GPT-4.1 model costs $2 per million tokens of input and $8 per millions tokens of output. GPT-4.1 Mini costs $0.40/million tokens for input and $1.60/million tokens for output, while GPT 4.1 Nano costs $0.10/million tokens for input and $0.40/million tokens as output.

OpenAI’s own testing revealed that GPT-4.1 scored between 52% and 54% on SWE Bench Verified, an unbiased subset of SWE Bench. OpenAI explained in a blog that some solutions for SWE-bench verified problems could not run on their infrastructure, which is why the scores ranged. These figures are slightly below the scores reported by Google for Gemini 2.5 Pro (63.8%) and Claude 3.7 Sonnet (62.3%), respectively, on the benchmark.

OpenAI conducted a separate evaluation of GPT-4.1, using Video-MME. This tool is designed to measure a model’s ability to “understand content” in videos. OpenAI claims that GPT-4.1 achieved a chart topping 72% accuracy in the “long, without subtitles” category of videos.

Although GPT-4.1 does well on benchmarks, and has a recent “knowledge threshold,” it is still unable to accurately reflect current events. (Up to June 2024). For example, Many Studies has shown that code-generating model often fails to fix security vulnerabilities and bugs, and can even introduce them.

OpenAI also acknowledges that GPT-4.1 is less reliable (i.e. more likely to make mistakes) as it deals with more input tokens. OpenAI-MRCR was one of their own tests. The model’s accuracy dropped from 84% to 50% when it had 8,000 tokens. GPT-4.1 was also more “literal”according to the company. This meant that sometimes more specific and explicit prompts were needed.

Kyle Wiggers, TechCrunch AI Editor. His writings have appeared in VentureBeat, Digital Trends and a variety of gadget blogs, including Android Police and Android Authority, Droid-Life and XDA-Developers. He lives in Manhattan, with his music therapist partner.

View Bio