AI agents are often wrong about 70% of the time when it comes to office tasks, and many of them aren't AI.

Feature: Gartner, an IT consultancy, predicts that by the year 2027, more than 40% of agentic AI project will be cancelled due to rising costs or unclear business value. Insufficient risk controls are also a factor.

This implies that 60 percent of the agentic AI projects will be retained. This is quite remarkable, given that researchers at Carnegie Mellon University and Salesforce have measured the rate at which AI agents successfully complete tasks. Gartner claims that many of the purported agents AI vendors are offering products or services which do not qualify as agentic artificial intelligence. This further complicates the math.

AI Agents use a machine-learning model that has been connected with various services and applications in order to automate business processes or tasks. Imagine them as AI models that are in an iterative cycle, trying to respond to input through API services and applications.

It is believed that an AI model with the ability to read the display screen of a mail client and access message data will be able interpret and execute a natural language directive better than a script or human employee.

In theory, the AI agent would be able, in order to parse and analyze text, to define “exaggerated claims” on its own. A human programmer, however, might find this task challenging. One might be tempted to simply test for the existence of the term “AI” within the body of scanned emails. A human employee could presumably identify the AI hype within an inbox, but it would take longer than a computerized solution.

In science fiction, the idea of software that accepts orders, executes them correctly, efficiently, and affordably is often used. When Captain Picard says in Star Trek: The Next Generation, “Tea, Earl Grey, hot,” that’s agentic AI, translating the voice command and passing the input for the food replicator. When astronaut Dave Bowman orders the HAL 9000 computer to, “Open the pod bay doors, HAL,” that’s agentic AI too.

AI tool makers like Anthropic suggest more practical applications, like AI-based customer service representatives who can answer calls and perform certain tasks such as issuing refunds or referring complex calls to a real agent.

This is a good idea, but it’s not without its problems. These include copyright, labor issues, bias and environmental concerns. As Meredith Whittaker of the Signal Foundation observed earlier this year at SWSX, “There’s a profound issue with security and privacy that is haunting this sort of hype around agents…” specifically, agents need to access sensitive data to act as a person’s representative and that threatens personal and corporate privacy and security expectations.

But agents that exhibit the competence of Iron Man’s

CMU researchers developed a benchmark for evaluating how AI agents perform given common knowledge tasks such as browsing the web, writing codes, running applications, or communicating with coworkers. They call it TheAgentCompany (19459077)

. It’s a simulation designed to mimic the business operations of a small software company. They did this to clarify the debate between AI believers, who claim that the majority human labor can be automatizedand AI skeptics, who view such claims as part a giant AI grift. In a paper [PDF] describing their project, they claim that the gap between these positions is due to a lack of a test to see how agents perform common workplace tasks. The need for a benchmark is a sign that AI agents still have a long way to go before becoming truly useful.

CMU’s boffins evaluated the following models based on their task success rates using two agent frameworks: OpenHands codeAct(OpenHands CodeAct) and OWL-Roleplay. The results were disappointing.

Gemini-2.5-Pro (30.3 percent)
Claude-3.7-Sonnet (26.3 percent)
Claude-3.5-Sonnet (24 percent)
Gemini-2.0-Flash (11.4 percent)
GPT-4o (8.6 percent)
o3-mini (4.0 percent)
Gemini-1.5-Pro (3.4 percent)
Amazon-Nova-Pro-v1 (1.7 percent)
Llama-3.1-405b (7.4 percent)
Llama-3.3-70b (6.9 percent),
Qwen-2.5-72b (5.7 percent),
Llama-3.1-70b (1.7 percent)
Qwen-2-72b (1.1 percent). The authors state this in their paper.
- Microsoft Blue Screen of Death goes dark
- Gridlocked – AI’s power requirements could short circuit US infrastructure
- Japanese firm uses mee-AI to detect stressed cats.
- Amazon Ring can now use AI to learn your home’s routines’
Researchers observed various failures throughout the testing process. Agents were unable to handle certain UI features like popups while browsing and even deceived. In one instance, an agent was unable to find the correct person to consult via RocketChat, a Slack-like open-source alternative. The CMU authors – Frank F. Xu and Yufan Song as well as Mengxue Baio, Zora Z.Wang, Xuhui Zhou Zhitong Guo Murong Cao Mingyang Yang Hao Yang Lu Amaad Maben Zhe Su Leander Mehta Wayne Chi Lawrence Jang Yi In a phone conversation, Graham Neubig, associate professor at CMU’s Language Technologies Institute, and one of the co-authors of the paper, explained to The Register that the inspiration for TheAgentCompany came from a paper by researchers at OpenAI and the Wharton School of the University of Pennsylvania that discussed all of the jobs which could theoretically be automated. He explained. “They also asked people whether the job could be automated and then they said ChatGPT and people agreed some portion of the time.”

Neubig who is also a coding agent builder at a startup, said that he was sceptical and wanted to create a standard to test how AI models handled knowledge work tasks. After eight months of hard work, TheAgentCompany was released.

At first, a software-agent was able to complete about 24 percent tasks that involved web surfing, coding and related tasks. He said. Neubig said that he expects the agents to become more competent over time, but added that even imperfect ones can be useful. For agents who are responsible for more general office tasks, it is a different story. He said. Neubig believes that the Model Context Protocol is a positive development because it allows agents to access more systems programmatically. Researchers from Salesforce – Kung-Hsiang Hui, Akshara Prabhakar, Onkar Thorat, Divyansh Agarwal, Prafulla Kubey, Yixin Mao, Silvio Savarese, Caiming Xiong, and Chien Sheng Wu – have also proposed a benchmark that is tuned for Customer Relationship Management. The benchmark, called CRMArena Pro consists of “nineteen expert-validated tasks across sales, service, and ‘configure, price, and quote’ processes, for both Business-to-Business and Business-to-Customer scenarios,” which covers both single-turn interactions (prompts and responses) and multi-turn interactions (a series prompts and answers where the context of the conversation is maintained throughout). The Salesforce computer scientists say

“Our results reveal that even leading LLM agents achieve modest overall success rates on CRMArena-Pro, typically around 58 percent in single-turn scenarios, with performance significantly degrading to approximately 35 percent in multi-turn settings,” .

“Our findings indicate that LLM agents are generally not well-equipped with many of the skills essential for complex work tasks; Workflow Execution stands out as a notable exception, however, where strong agents like gemini-2.5-pro achieve success rates higher than 83 percent.”

Then they add up all the models that were evaluated “demonstrate near-zero confidentiality awareness.” This will make AI agents difficult to sell in corporate IT environments.

CMU’s and Salesforce’s findings are more or less in line with Gartners assessment of the current state of agentic artificial intelligence. “Most agentic AI proposals lack significant value or ROI, as current models do not have the maturity or agency to autonomously achieve business goals or follow subtle instructions over time,” Anushree verma, senior analyst, said in a press release. “Many use cases positioned as agentic today don’t require agentic implementations.”

Gartner still predicts that by 2028, AI agents will make 15 percent of the daily decisions, up from zero percent last year. The firm also expects that by 2028, agentic AI will be included in 33 percent of enterprise applications. (r)

AI agents are often wrong about 70% of the time when it comes to office tasks, and many of them aren’t AI.

CMU researchers developed a benchmark for evaluating how AI agents perform given common knowledge tasks such as browsing the web, writing codes, running applications, or communicating with coworkers. They call it TheAgentCompany (19459077)

African startups have $60B in return. How will they do it?

Google Launches New AI Scam detection in Circle to Search, Google...

Black Friday deals under 50 dollars: Apple AirTags Legos Ugreen chargers...

Google rolling out Gemini 3 Deep Think for AI Ultra

Recomended

African startups have $60B in return. How will they do it?

Google Launches New AI Scam detection in Circle to Search, Google Lens and Google Lens

Black Friday deals under 50 dollars: Apple AirTags Legos Ugreen chargers Blink cameras and other items

Google rolling out Gemini 3 Deep Think for AI Ultra

OpenAI says ChatGPT can save the average worker an hour per day

OpenAI boasts enterprise win days after internal ‘code red’ on Google threat