Study finds AI tools slow down open source software developers by 19 percent

The time saved by active coding is overwhelmed by the amount of time required to wait, prompt and review AI outputs.

The time saved on active coding was overshadowed by the time required to wait, review, and prompt AI outputs. Meter

METR’s findings appear to contradict other benchmarks and experiments which demonstrate an increase in coding efficiency with AI tools. These benchmarks often measure productivity by total lines of code, or the number discrete tasks/code commitments/pull request completed. However, these can be poor proxy measures for actual coding effiency.

Many existing coding benchmarks focus on synthetic, algorithmically scoreable tasks created for the benchmark test. This makes it difficult to compare the results with those that are based on real-world, pre-existing code bases. In surveys, developers in METR’s study stated that the complexity of repositories they work with (which are on average 10 years old and have over 1,000,000 lines of code) limited the AI’s ability to be helpful. Researchers note that the AI was not able to use “important tacit knowledge or context” regarding the codebase while “high developer familiarity with [the] repositories” helped their very human coding productivity in these tasks.

These findings lead the researchers conclude that AI coding tools are not well-suited for “settings with very high quality standards, or with many implicit requirements (e.g., relating to documentation, testing coverage, or linting/formatting) that take humans substantial time to learn.” Although these factors may not be applicable to “many realistic, economically relevant settings” which involves simpler code bases, it could limit the impact AI tools have in this study and other real-world situations.

The researchers are optimistic that even for complex coding tasks like those studied, further refinement of AI tool could lead to future gains in efficiency for programmers. Researchers write that systems with better reliability, lower latencies, or more relevant outcomes (via techniques like prompt scaffolding or fine tuning) “could speed up developers in our setting,” are possible. They say that there is “preliminary evidence” the recent release of Claude 3.7. “can often correctly implement the core functionality of issues on several repositories that are included in our study.”

However, METR’s research provides some strong evidence for now that AI’s much vaunted usefulness in coding tasks could have significant limitations when it comes to complex, real-world coding situations.

www.aiobserver.co

More from this stream

Recomended