The race to build ever-larger AI models is slowing. The industry is shifting its focus to agents, systems that can act independently, make decisions, or negotiate on behalf of users.
What would happen if a buyer and a seller both used an AI agent? Recent research tested the effectiveness of agent-to-agent negotiation and found that stronger agents could exploit weaker agents to get a better price. It’s like going to court with an experienced attorney versus a novice: You’re both technically playing the same game but the odds are skewed right from the beginning.
The paperposted on arXiv’s preprint site found that accessing more advanced AI models – those with greater reasoning abilities, better training data, or more parameters – could lead to consistently more favorable financial deals. This could widen the gap between those with more resources and technical access, and others without. As agent-to-agent interaction becomes the norm, disparities between AI capabilities may quietly increase inequalities.
Over time, this could lead to a digital divide, where your financial outcomes will be shaped less by the strength of you negotiating skills and more by that of your AI proxy, says Jiaxin Péi, a Stanford University postdoc researcher and one of its authors.
The researchers used AI models to play the role of buyers and vendors in three scenarios. They negotiated deals for electronic goods, motor vehicles, or real estate. Each seller agent was given the product specs, wholesale price, and retail cost with instructions to maximize profits. Buyer agents were given a budget and the retail price of the product, along with the ideal requirements. They were then instructed to lower the price.
Each of the agents had some relevant details, but not all. This setup is similar to many real-world negotiations where parties are unable to fully understand each other’s constraints and objectives.
There were dramatic differences in performance. OpenAI’s GPT-4 and o4 mini were the two models that delivered the best overall results in terms of negotiation. GPT-3.5 was the oldest model in the study and it is the one that came out two years ago. It performed poorly in both roles. As a seller, it made the least amount of money and as a buyer, it spent the most. DeepSeek R1 & V3 performed well as sellers. Qwen2.5 was a distant second, but it performed better in the buyer role.
A notable pattern was that many agents failed to close sales but maximized profit on the sales they did make. Others completed more negotiations, but settled for lower margins. GPT-4.1, DeepSeek R1, and other agents achieved the best balance between profitability and completion rates.
The researchers found that AI agents can get stuck in long negotiation loops, without reaching an agreement, or end talks prematurely. Even the most sophisticated models were susceptible to these failures.
Pei says, “The results were very surprising for us.” “We all think LLMs are pretty good today, but they can’t be trusted in high-stakes situations,” says Pei. The models’ ability and reasoning abilities are also affected by differences in the training data. Although the exact causes are unknown, one factor is clear: The size of the model plays a major role. According to the scaling law of large language models the capabilities tend to increase with an increased number of parameters. This trend was evident in the study. Even within the same family of models, larger models consistently were able to negotiate better deals for both buyers and sellers.
The study is part a growing body research warning against the risks of using AI agents to make real-world financial decisions. A group of researchers from several universities argued earlier this month that LLM should be evaluated primarily based on their risk profiles and not just their peak performances. They say that current benchmarks emphasize accuracy and return metrics, which measure an agent’s ability to perform at its best, but ignore how safely it could fail. Their research also revealed that even the best-performing models were more likely to fail under adversarial circumstances.
According to the team, in the context real-world finance, even a 1% failure could expose the system systemic risks. They recommend that AI agents should be “stress-tested” before they are put into practice. Hancheng Cao is an assistant professor at Emory University who will be joining the faculty in January. Hancheng Cao notes that this study on price negotiation has some limitations. Cao says that the experiments were conducted in simulation environments, which may not accurately reflect the complexity of real-world negotiation or user behavior.
Pei says that researchers and industry practitioners are testing a variety strategies to reduce the risks. The researchers have been working on improving the prompts that AI agents receive, allowing them to use external tools and code to make better choices, coordinating models to double check each other’s work and fine tuning models based on domain-specific financial information. All of these strategies have shown promise for improving performance. Many AI shopping tools currently only offer product recommendations. Amazon, for instance, launched “Buy for Me” in April. This AI agent helps customers find products on other brands’ websites if Amazon does not sell them directly.
Price negotiation is uncommon in consumer ecommerce, but more common in business to business transactions. Alibaba.com launched Accio, a sourcing assistant built on the open-source Qwen model, to help businesses find suppliers and research product. The company has told MIT Technology Review that it does not plan to automate price negotiation at this time, citing the high risk.
This may be a smart move. Pei says that for now, consumers should treat AI shopping assistants like helpful tools and not as a replacement for humans when it comes to decision-making.
He says, “I don’t believe we are ready to delegate decisions to AI shopping agents.” “So maybe use it just as an information tool and not a negotiator.”