OpenAI’s study highlights the limitations of LLMs for software engineering

(

).

Credit : VentureBeat made using Midjourney

Join our daily and weekday newsletters to receive the latest updates on AI coverage. Learn More


Although large language models (LLMs), which have revolutionized software development, are a great tool, enterprises should think twice before replacing human software engineers. This is despite OpenAI CEO Sam Altman claiming that LLMs can replace software engineers. Models can replace low-level engineers.

The aforementioned is a New paper Researchers at OpenAI explain how they developed a benchmark called SWE Lancer to test the amount foundation models can earn by performing real-life freelance software engineer tasks. The test revealed that while the models are able to solve bugs, they cannot see why the bug is there and continue making mistakes.

Researchers assigned three LLMs – OpenAI’s GPT-4o, o1 and Anthropic’s Claude 3.5 Sonnet – 1,488 freelance software engineer jobs from the freelance platform upwork. The tasks paid out $1 million. The researchers divided the tasks into two categories – individual contributor tasks (resolving bug or implementing new features) and management tasks (where models roleplay as managers who choose the best solution to issues).

The researchers state that “results indicate that real-world freelance work is still challenging for frontier language model.”

Test results show that foundation models can’t replace human engineers. They can help fix bugs, but they are not yet at the point where they can earn money as freelancers.

Benchmarking freelancing

The researchers, along with 100 other professional software developers, identified potential tasks on upwork and, without altering any words, fed them to a Docker Container to create the SWELancer dataset. The container has no internet access and can’t access GitHub, “to prevent models scraping code or pulling request details,” the researchers explained.

According to the team, 764 individual contributor tasks totaling $414,775, from 15-minute bug fixing to week-long feature requests, were identified. These tasks, including reviewing freelancer proposals, job postings and other forms of advertising, would pay $585,225.

These tasks were added to Expensify, an expense management platform.

Based on the task title, description, and snapshot of the codebase, the researchers created prompts. If there were other proposals to resolve the problem, “we also created a management task based on the issue description and the list of proposals,” the researchers explained.

The researchers then moved on to the development of end-to-end tests. They wrote Playwright test for each task which applied these generated patches, which were then “triple verified” by professional software engineering engineers.

The paper explains that “Tests simulate actual user flows such as logging in to the application, performing complex tasks (such as making financial transactions), and verifying the model’s solution is working as expected.”

Test results

The researchers found that none the models achieved the $1 million task value. Claude 3.5 Sonnet was the best-performing model and earned only $208,050, while resolving 26.2% of individual contributor issues. Researchers point out that “the majority of the solutions are incorrect and higher reliability is required for trustworthy deployment.”

Models performed well in most individual contributor tasks. Claude 3.5 Sonnet was the best model, followed by GPT-4o and o1.

The report explains that agents excel at localizing but fail to root-cause, resulting in incomplete or flawed solutions. “Agents pinpoint an issue remarkably fast, using keyword search across the entire repository to locate the relevant file or function — often much faster than a person would. They often have a limited understanding of the issue, which can span multiple components or files. They also fail to address the root causes, leading to incorrect or inadequate solutions. We rarely see cases where the agent fails to find the correct file or location or tries to reproduce the problem.

It is interesting to note that the models performed better in manager tasks requiring reasoning to evaluate technical knowledge.

These tests showed that AI models could solve some “low level” coding issues, but they couldn’t yet replace “low level” software engineers. The models took time, made mistakes and couldn’t track down the root cause of code problems. Researchers said that many “low-level engineers” work better than their counterparts, but this may not last for long.

VB Daily provides daily insights on business use-cases

Want to impress your boss? VB Daily can help. We provide you with the inside scoop on what companies do with generative AI. From regulatory shifts to practical implementations, we give you the insights you need to maximize ROI.

Read our privacy policy

Thank you for subscribing. Click here to view more VB Newsletters.

An error occured.

www.aiobserver.co

More from this stream

Recomended