Ask those who are building generative artificial intelligence what they’re most excited about right now, and many will say: coding.
Jared Kaplan is chief scientist at Anthropic and told MIT Technology Review this month that “that’s something very exciting for developers”: “It’s understanding what’s going wrong with code, debugging.”
Copilot is a tool built using OpenAI’s large-language models, launched by Microsoft’s GitHub in 2022. It’s used by millions of software developers around the globe. Anthropic’s Claude and OpenAI’s ChatGPT are among the general-purpose chatbots that millions of people use for everyday assistance.
Sundar Pichai, Alphabet CEO, claimed in an earnings call held in October that “today, more than one-quarter of all new Google code is generated by AI. This code is then reviewed and approved by engineers.”: “This helps engineers do more and move quicker.”
Not only the big companies are releasing AI coding tools. This buzzy market has also attracted a number of new startups. Zencoder (which was valued at $750m within months of its launch), Merly, Cosine (which was valued at $3bn before it released a product), and Poolside are among the newcomers vying for a piece of this lucrative market. Nathan Benaich is an analyst with Air Street Capital. “It looks like developers will pay for copilots,” he says. “And code is one of easiest ways to monetize AI.” This next generation of tools can prototype, test and debug code instead of just providing developers with supercharged autocomplete like most existing tools. The result is that developers may become managers who spend more time reviewing code written by models than they do writing it themselves.
There’s more. Many people who are building generative coding assistances believe that they can be a fast-track to artificial general intelligence (AGI), a hypothetical superhuman technology which a number top firms claim is in their sights.
Eiso Kant is CEO and cofounder at Poolside. He says, “The first time that we will see an economically valuable activity reach human-level abilities will be in software design.” (OpenAI boasted about its latest o3 models beating the company’s chief scientist in a competitive programming challenge.)
Welcome back to the second wave in AI coding.
Correct Code
There are two types of correctness. The first is the sense of correctness in a program’s grammar (syntax) – meaning that all the words, numerical operators, and mathematical symbols are in the correct place. This is more important than correct grammar in natural language. If you get one tiny mistake in thousands of lines, the entire program will not run.
In this sense, the first generation of coding assistances is pretty good. They have been trained on billions of pieces and have assimilated surface-level structure of many types programs.
There’s also a sense in which the function of a program is correct: Sure it runs, but will it do what you want it to? The new generation of generative coding tools are aiming to achieve this second level of correctness, and it is this that will change the way software is created.
Alistair Pullen is a Cosine cofounder. “Large Language Models can write code that compiles but they may not write the program you wanted,” he says. “To do that, you need to re-create the thought processes that a human coder would have gone through to get that end result.”
The problem is that the data most coding assistants have been trained on–the billions of pieces of code taken from online repositories–doesn’t capture those thought processes. It is a finished product and not the process that went into its creation. Kant says that there is a lot code in the world. “But this data does not represent software development.”
Pullen, Kant and others are discovering that to build a machine that can do more than autocomplete, one that can come up useful programs, test and fix bugs, you need to show a whole lot more than code. You must show how the code was created.
In other words, companies like Cosine or Poolside are creating models that not only mimic how good code looks–whether it’s working well or not- but also mimic the process of producing such code. If you get it right, the models will produce far better code as well as better bug fixes.
The breadcrumbs
You first need to create a data set which captures the steps that a developer would take when writing code. Imagine these steps as a trail of breadcrumbs that a machine can follow to create a similar piece.
One part of that is figuring out which materials to draw on: Which sections of an existing codebase are required for a specific programming task? Zencoder’s founder Andrew Filev says that context is crucial. The first generation of tools were very bad at determining context. They would only look at the tabs you had open. Zencoder hired search engine veterans to build a tool which can analyze large codebases, and determine what is relevant. Filev says that this detailed context reduces hallucinations, and improves the code quality that large language models are capable of producing. It uses that context to create an entirely new type of data set. The company asked dozens coders to document what they did as they performed hundreds of different programming tasks. Pullen says, “We asked them all to write everything down. Why did you open this file? Why did you scroll half way through? Why did you close it?” The researchers also asked coders for annotated pieces of completed code, highlighting sections that would require knowledge of other pieces or documentation to write.
Cosine takes all this information and creates a large, synthetic data set mapping the typical steps that coders take and the information sources they use to produce finished pieces of codes. This data set is used to train a model that can determine what breadcrumb trails it needs to follow in order to produce a specific program and how to do so.
Poolside in San Francisco is also creating a synthesized data set to capture the coding process, but it relies more on a technique known as RLCE – reinforcement learning from code execution. (Cosine also uses this, but in a lesser extent.)
RLCE can be compared to the technique that makes chatbots such as ChatGPT slick conversators, called RLHF – reinforcement learning from human feedback. RLHF trains a model to produce text more similar to the type of text that human testers prefer. With RLCE a model can be trained to produce code more like that which is executed (or run) when the program is running.
Gamming the system
Cosine, and Poolside say they were inspired by the DeepMind approach to its AlphaZero game-playing model. AlphaZero, given the steps to take in a game, was left to play itself over and again until it figured out which moves were winning and which weren’t.
They let it explore every possible move, simulate as many different games as you could throw at it–that’s how they beat Lee Sedol, the Korean Go Grandmaster that AlphaZero defeated in 2016. Wang worked for Google DeepMind before Poolside on applications of AlphaZero that went beyond board games. This included FunSearch, which was trained to solve complex math problems.
When the AlphaZero approach to coding is applied, the steps in producing a code–the breadcrumbs –become available moves in a video game, and a correctly programmed program becomes the winner of that game. A model can improve much faster than a person if left to play alone. Kant says that a human coder will try and fail one at a time. “Models are able to try things 100 times simultaneously.”
One key difference between Cosine’s and Poolside’s models is that Cosine uses a customized version of GPT-4o from OpenAI which allows it to train on larger data sets than the base model is capable of, while Poolside builds its own large language model.
Poolside’s Kant believes that training a new model on code will yield better results than adapting a model that has absorbed not only billions and billions of lines of code, but also most of the internet. “I’m perfectly okay with our model forgetting butterfly anatomy,” he said.
Cosine says that its generative code assistant, Genie, is the top performer on SWE Bench, a standard test set for coding models. Poolside claims to be still building their model, but that what they have so far matches the performance of GitHub Copilot.
Kant says, “I personally believe that large language modeling will take us to the same level of capability as a software developer.”
This view is not shared by everyone.
Illogical Language Models
According to Justin Gottschlich the CEO and founder at Merly, large-scale language models are not the right tool for the job. He uses his dog as an example: “No training will ever make my dog able to code. It just won’t be possible,” he says. “He’s capable of all sorts of other things but he is just incapable of this deep level of cognition.” Programming requires a high level of precision in solving logical puzzles. He says that no matter how well large models of language can mimic human programmers, they are still statistical slot machines at their core. “I cannot train an illogical machine to become logical.”
Merly’s system is not shown any human-written code. Gottschlich says that to build a model capable of generating code, you must work at the level that code represents and not the code itself. Merly’s system is therefore trained on an intermediate representation–something like the machine-readable notation that most programming languages get translated into before they are run.
Gottschlich will not say how this works or what it looks like. He uses an analogy. In mathematics, it is believed that only prime numbers are required because they can be used to calculate all other numbers. “Take that concept, and apply it to the code,” he says.
This approach not only gets straight to the logic in programming, but it’s fast because millions of lines are reduced to a few thousands of lines of intermediate languages before the system analyses them.
Shifting Mindsets
Your opinion of these rival approaches will depend on how you envision generative coding assistants.
Cosine’s engineers were banned from using any other tools than their own in November. Genie is having a significant impact on the engineers of Cosine, who are now often watching as the tool generates code for them. Yang Li, a Cosine cofounder, says that you can now tell the model what outcome you want, and the tool will take care of the implementation. Pullen,
admits it can be confusing and requires a change of mindset. “We have engineers flitting from window to window, doing multiple tasks simultaneously,” he says. “While Genie runs code in one window, they may be instructing it to perform something else in another.” Imagine you are developing software and need a payment system. Instead of having to code each option one by one, you can have a coding assistant simultaneously test out several options, such as Stripe, Mango and Checkout.
Genie is able to fix bugs 24 hours a day. Most software teams have bug-reporting tools where people can upload descriptions of errors. Genie can read the descriptions and suggest fixes. Then, a human will need to review the descriptions before updating the code.
Li says that no human can understand the trillions of lines in code that make up today’s largest software systems. “And as more software is written by other software the amount of codes will only increase.”
Coding assistants who maintain this code for us will be essential. Li says that the bottleneck will be how quickly humans can review machine-generated codes.
What do Cosine engineers think about this? Pullen says that it’s all fine. “If I give a difficult problem, you will still think about how to describe it to the model,” says Pullen. “Instead, you must write it in natural languages. It’s not like you’re taking away the fun of engineering. The itch is still there.”
Others may adapt quicker than others. Cosine invites potential hires to spend some time coding with the team. Two months ago, it asked a candidate to create a widget to allow employees to share cool software they were working with on social media.
This task was not easy, as it required a working knowledge of many sections of Cosine’s millions of lines code. The candidate completed the task in just a few hours. Li says that a person who hadn’t seen our code base before showed up on Monday, and by Tuesday afternoon had shipped something. “We thought he would take all week.” (They employed him.)
But, there’s also another angle. Many companies will use the technology to reduce the number of programmers that they hire. Li believes that we will soon have different levels of software engineers. There will be tiers of software engineers. At one end, there will be elite developers earning millions who can diagnose issues when AI fails. On the other end of the spectrum, smaller teams of 10-20 people will perform a task that used to require hundreds of coders. Li says that the ATMs have transformed banking.
He says that “anything you want to achieve will be determined by computation and not headcount.” “I believe it’s widely accepted that the days of adding a few thousand engineers to an organization are over.”
The warp drive
For Gottschlich, machines with better coding skills than humans will be essential. He believes that this is the only way to build the complex, vast software systems we will need in the future. He shares the same vision as many Silicon Valley entrepreneurs: a future where humans will be able to move to other planets. He says that we can only get AI to create the software needed if we use it. “Merly’s real goal is getting us to Mars.”
Gottschlich would rather talk about “machine-programming” than “coding assistants” because he believes the term frames the issue in the wrong way. He says: “I don’t believe that these systems should assist humans, I think humans should assist them.” “They can move as fast as AI.” “There’s a cartoon called The Flintstones in which they have cars that only move when their drivers use their feet,” Gottschlich says. “This is how I feel that most people do AI for software systems.”
He adds, “But what Merly is building is, in essence, spaceships.” He is not joking. “I don’t believe that spaceships should run on humans riding bicycles. Spaceships should be powered with a warp motor.”
It sounds crazy, but it is. There’s an important point to make about what the people who are building this technology believe the end goal is.
Gottschlich’s galaxy-brained view is not unique. These companies are not only focused on creating products that developers want to use now, but they also have a much bigger payoff in mind. Cosine describes itself on its website as a “Human Reasoning Lab” and sees coding only as the first step towards a general-purpose model which can mimic human problem solving in a variety of domains.
Poolside also has similar goals. The company says that it is developing AGI. Kant says that code is a formalization of reasoning.
Wang invokes the agents. Imagine a system which can create its own software on the fly to perform any task, says Wang. “If you reach a point where you can have your agent solve any computational problem you want using software, that is a demonstration of AGI.”
On Earth, such systems are likely to remain a pipedream. Software engineering is changing much faster than many thought.
Pullen of Cosine says, “We’re still not at the point where everything is done by machines but we are definitely stepping out from the traditional role of a Software Engineer.” “We’re starting to see the sparks of this new workflow, what it means to be an engineer in the future.”