Home News My 8 ChatGPT tests produced only one near-perfect result, and a lot...

My 8 ChatGPT tests produced only one near-perfect result, and a lot alternative facts

0
My 8 ChatGPT tests produced only one near-perfect result, and a lot alternative facts
ZDNET

Last week, OpenAI unveiled Agent, its new tool that combines the capabilities of Deep Research and Operator. Operator was OpenAI’s first attempt at a computer-using model, a model that actually can open windows and click on user interface elements. ChatGPT Agent can do that and more.

Right now, ChatGPT Agent is only available for $200/mo Pro tier subscribers and provides for 400 agent interactions per month. When the $20/mo Plus tier gains access to Agent, which should be today, those users will get 40 interactions per month.

Also: Is ChatGPT down? You’re not alone. Here’s what OpenAI is saying

(Disclosure: Ziff Davis, ZDNET’s parent company, filed an April 2025 lawsuit against OpenAI, alleging it infringed Ziff Davis copyrights in training and operating its AI systems.)

I upgraded my plan from Plus to Pro just so I could test out the new Agent mode and report back to you. In this article, I’ll show you detailed results from eight comprehensive tests.

TL;DR test results

Before we go into the detailed tests, I’ll start with some overall TL;DR observations.

Test count: In the past two days, I used 25 of the available 400 queries, for a total of almost 12 hours of hyper-uber-supercomputer use. No wonder this thing costs $200/month.

Also: I found 5 AI content detectors that can correctly identify AI text 100% of the time

Nearly every query required a follow-on, so when it comes time for Plus users, don’t assume you can give Agent 40 projects. More likely, you’ll be giving it 20-25, and using the rest of your queries to convince the Agent to follow directions.

Screenshot by David Gewirtz/ZDNET

Result quality: In all my tests, Agent appeared to understand the problem. But it failed to produce useful results for most of the tests. That said, the final test produced results that can only be characterized as amazingly useful.

Project scale: Agent can’t handle big projects, the sort of data analysis projects you really want an AI to be able to handle. It has trouble scrolling through web pages. It can’t visit sites that have AI or robots.txt restrictions in place. And long processing exceeds session time allocations, even with the super top-of-the-line gold-pressed latinum Proching.

Presentation Quality: One major selling point for Agent is the ability to create spreadsheets, presentations and other documents. The spreadsheets were okay, but the presentation graphics were pretty poor. This will change with time, but I don’t think Agent can produce presentations that you can use without significant cleanup.

Microsoft is saving millions of dollars with AI, and laying off thousands. Where do we go next?

Accuracy Artificial intelligence hallucinates. The OpenAI team warned against using Agent due to the new risks. While I received some accurate results, Agent also returned unforced mistakes, results that it could have easily test and deemed inaccurate. This verification or validation did not occur. The final test was accurate, and showed what this tech is capable of when it works.

Agent comes with the capability to use connectors to link to Gmail (via API calls), Google Calendar, Google Drive and more. I didn’t test the connectors due to how often Agent acts irrationally or hallucinates. I didn’t trust Skynet enough to allow them access to my accounts. Not yet, at least.

Screenshot by David Gewirtz/ZDNET

Limits: I was unable to use Agent in the MacOS app. I also found that Agent stalled hard when I tried to run it in multiple Chrome tabs at once. For now, you launch an Agent process and wait. It’s not like Codex, where you can launch a bunch of projects and come back later and harvest all the results. But since that capability exists in Codex, I’m sure it will show up soon in Agent.

Screenshot by David Gewirtz/ZDNET

That should give you a pretty good overview. Let’s get started looking at the eight test results. For each result, I’ve included a link to the session recording, so you can see the prompts I used, the detailed results, and watch Agent reason its way through the problem.

Also, definitely read to the end. Some of the early results are fairly bad, but the last one knocks it out of the park. And with that, here we go.

1. Selecting products on Amazon

  • Understanding of the problem: Solid
  • Execution: Both good and bad
  • Hallucination: Weird church reference, fake Amazon links
  • Processing time: 20 + 12 minutes

When OpenAI introduced ChatGPT Agent, the team demoed how they used the tool to shop for wedding clothes and a wedding gift. That seemed like a fairly uncommon and impractical application for a super-intelligence, especially since gift registries exist and are widely used.

Instead, I gave Agent a purchasing project I had actually extensively researched and completed a few months earlier. I’m running Power-over-Ethernet cables all across my yard to upgrade my security system. As such, I’m creating a lot of custom cables. I already know that doing so requires some key tools: a cutter to slice the cable, a cable end stripper, a crimper to attach the RJ-45 ends, and a tester to confirm that long cable runs work.

Also: How a circuit breaker finder helped me map my home’s wiring (and why that matters)

I gave Agent a prompt asking for three configurations: a budget toolset, a “money-is-no-object” solution, and a sweet spot solution. I asked for links, product descriptions, and product images.

Once you give Agent your prompt, it creates a virtual desktop. You can watch it conducting its activities, jumping between a desktop view, a text view, and code.

Screenshot by David Gewirtz/ZDNET

The budget solution turned out to be a win. Agent found a Single $34 kit that includes everything I requested. It provided a link and even explained why it chose this solution. Unfortunately, the picture it provided was not the same as the actual kit.

Screenshot by David Gewirtz/ZDNET

The mid-tier and top-tier solutions were less than perfect. None of the links worked. The mid-tier sweet spot solution did have a product-accurate image, but without a link, it wasn’t really helpful.

Screenshot by David Gewirtz/ZDNET

Unfortunately, the model recommended doesn’t actually exist on Amazon. In fact, none of the mid- or upper-tier products exist on Amazon. It looks like Agent did a pile of web surfing to find the products, disregarding my instructions to search only on Amazon.

Screenshot by David Gewirtz/ZDNET

It also clearly visited other sites, probably gathering model names and descriptions.

Screenshot by David Gewirtz/ZDNET

Then, when it packaged up its final recommendations, it just assigned random Amazon links to the description, even though those products and those links don’t seem to exist on Amazon.

Screenshot by David Gewirtz/ZDNET

I did request it go back and try again. When it did, after 12 minutes, it presented most of the same products, although one of the links that had failed earlier did, in fact, point to a product on Amazon in the second run.

Also: Coding with AI? My top 5 tips for vetting its output – and staying out of trouble

I can’t leave this section without pointing out something just plain weird. As I was watching Agent work, it presented this in its desktop view. I don’t even want to know.

Screenshot by David Gewirtz/ZDNET

You can watch a Replay the entire sessionhere.

2. Comparing egg prices.

  • Understanding the problem: Solid
  • Execution DID what I asked.
  • Halucination: It was my fault for the imprecise prompting.
  • Process time: Fourteen minutes

In a presentation about ChatGPT Agent by OpenAI, they showed a Agent was a good choice for me because my family uses Instacart regularly. I let Agent loose to see what it could tell about the egg prices in our local stores.

Agent didn’t have access to my account but I did share my ZIP code in Salem, Oregon. I told it “Please visit all the grocery stores on Instacart and compare egg prices.”

How to use ChatGPT for writing code – and how to debug what it generates.

And it did. You’ve heard of the phrase Garbage In Garbage Out. This is what happens when an AI is asked to look at “all the grocery stores.” . I should have only asked it to search within a 5- or 10-mile radius. But I didn’t.

Screenshot by David Gewirtz/ZDNET

Agent came back with 21 stores, ranging from nearby to up to almost 47 miles away. It did accomplish what I asked, comparing egg prices. Without prompting, it decided to rank the eggs by price. This was good. But when it chose the eggs to rank, it didn’t always choose the least expensive product from each store.

For example, it recommended the Good & Gather eggs from Target at $2.99 a dozen, rather than the $1.99/dozen Market Pantry egg, also from Target.

Screenshot by David Gewirtz/ZDNET

You can watch a Replay the entire sessionhere.

3. Create a PowerPoint slide.

  • Understanding the problem: Solid
  • Execution Corrected the data point
  • I had a hallucination Could not reproduce graphic quality.
  • Process time Ten minutes

The next project is one that I completed early last week. My editor asked me to update an article I wrote about Bitcoin investments. In it, I track the value of a $100 Bitcoin investment from 2022.

My holdings increased in value, so I had to add a slide. Each slide has a date value in the X-axis and a point of value in the Y-axis. This meant moving the graphics over to make room for the value and, in this instance, adjusting vertical scale to accommodate the substantial increase in value.

And: The best free AI classes

I took about 45 minutes to do it. OpenAI had said that ChatGPT Agent excelled at PowerPoint, so I wanted to find out if Agent would save me time in the future.

I uploaded my existing PowerPoint deck, minus the final slide I created for the article. Then I asked agent to create the slide for me.

The desktop view displayed the terminal interface as it worked. You can see Agent putting together code to create a graphic image.

Screenshot by David Gewirtz/ZDNET

Here’s what that slide should have looked like (note: foreshadowing).

Screenshot by David Gewirtz/ZDNET

Here’s what Agent gave me.

Screenshot by David Gewirtz/ZDNET

To be fair, Agent clearly understood the problem. It moved the existing data points over to the left to make room for the new node. It also placed the new Bitcoin item properly in relation to the existing ones, and added both price and percentage change text blocks.

That means Agent read and understood the context of my PowerPoint deck’s layout. That, in and of itself, is very impressive.

Also: The best AI for coding in 2025 (and what not to use)

But it failed on adding more scale lines a nd new Y-axis values. It failed on reproducing the fonts. It failed on properly placing the text blocks. And it pushed the entire graphic up and to the left of the slide.

I’m guessing the graphics library that Agent uses isn’t really up to the task of making fine graphic changes. That will undoubtedly improve over time.

You can watch a Replay the entire sessionhere.

4. Article categorization method II

  • Understanding the problem: Solid
  • Execution failed due to exceeding the session time limit
  • Hallucination Gave back partial results
  • Process time: Eight minutes + three minutes + twenty-one minutes

I have published a weekly newsletter for the last two years that shares the articles I have published on ZDNET. Each newsletter includes a title, a link, and an article description. By pointing Agent at My back issue archivewould have to be categorized with close to 300 summaries of articles.

Unfortunately Agent ran into several problems that it created. It was unable scroll through the list of articles using JavaScript. It started using the web interface when I told it to. But it reported: “Unfortunately, I’ve reached the end of the allotted browsing sessions for this task, which means I’m unable to explore further pages and collect the additional data at this time.”

Is ChatGPT Plus worth $20, considering the free version has so many premium features already?

I’m paying $200 per month for OpenAI’s best plan and it still won’t allow me to look up 300 articles. This is a real gotcha. It’s also disappointing, because a task such as scrolling through an article archive or doing some tabulating would be the kind of task that you might assign to an assistant. If the AI stops because it takes too much time, we can’t rely on AI to do all assistant-type tasks. No one wants an assistant who is fussy and picky.

Agent did, however, give me a spreadsheet and slide based on what limited data they were able to find prior to my little request exceeding the hourly budget for power in the City of Las Vegas.

Screenshot by David Gewirtz/ZDNET

You can watch a You can watch the entire sessionhere.

5. Extract remembered text from video.

  • Understanding the problem: Partial
  • Implementation: Didn’t return full transcription on first run, corrected on second run.
  • Hallucination: Decided what it wanted on first-run.
  • Process time: Two minutes.

I watch a lot YouTube videos to enhance my learning and research. Plus, nothing beats a relaxing video about How pavers are madeIt’s easy to get the transcript of a whole video, either directly from YouTube or by using Apple Voice Memos. However, finding where in the video you want to explore a particular segment can be time-consuming. Here’s an illustration. When? OpenAI introduced Agentin a video. CEO Sam Altman discussed the cautions and warns about using ChatGPT Agent Mode. I knew they were near the very end of the video but didn’t want spend time going through it to find the exact quotes.

I instead delegated this assignment to Agent. It found the segment with ease, but instead returned a transcript word-for-word, it returned quotes interspersed by its own analysis.

I also mapped my iPhone Control Button to ChatGPT. Here are 5 ways I use this every day.

After I clarified what it was I wanted, the second time it ran, it gave me what I needed. In this case though, the prompt wasn’t unclear. I had to repeat my request for a transcript a second time to get the AI to follow through.

Unfortunately this extra review cycle reduced the time-saving benefit for me. I still think that using Agent was quicker than if i sifted the video myself. But I had a second prompt to create and a second result to wait for, which took me time.

This is still a useful tool. You can watch a You can watch the entire sessionhere.

6.

6. As a result, I spend many days learning about new topics.

ChatGPT Agent was able to prepare a full report and presentation on remote working trends. I told it that the PowerPoint was destined for my management team, so it should be comprehensive and professional-looking.

The analysis document returned was very similar to what we’ve seen from ChatGPT’s deep research. The report contains many assertions and statistics, which I do not have time to confirm. ChatGPT is able to record, transcribe and analyze your meetings.

The majority of the conclusions are in line with my understanding of work-from home trends. We’re aware of the model’s tendency to hallucinate, so I would be very worried about using this data professionally without further vetting.

Agent produced a 17-slide PowerPoint presentation that was well organized. The graphic quality was not as good as in previous experiments. The first slide looks good.

Screenshot by David Gewirtz/ZDNET

But later in the deck, it doesn’t look right. Notice how the following slide has graphics on top of text, and bullets in front of bullets on top of empty bullets.

Screenshot by David Gewirtz/ZDNET

In the following slide, not only is the text running off the end of the page, but there’s no legend. As such, it’s not clear what’s represented by red and by blue.

Screenshot by David Gewirtz/ZDNET

Once again, you can see how Python is used to construct the deck.

Screenshot by David Gewirtz/ZDNET

Agent does a fair job, so I’m fairly confident that the AI will get better over time. Programmatic construction of slides based on templates is not a new technology. I just don’t think OpenAI prioritized slide presentation aesthetics as part of this release.

You can watch a You can watch the entire sessionhere.

7. The accuracy of a presentation

  • Understand the problem: Solid Execution: Excellent
  • Hallucination It seems complete, but is still from an AI.
  • This was just plain fun. I decided to present the presentation that was created in the previous test and ask it to validate the claims.

    Agent concluded. “Several quantitative claims—especially those concerning productivity/innovation impacts, the size and growth of the gig economy, rates of side‑gig participation, and the influence of politics and culture—could not be verified with accessible evidence during this review.”

    The Agent provided a detailed evaluation of each assertion. I’ve summarized my findings below. I’ve summarized the results below. Compare this to how GPT-4o analyzed results. GPT-4o considered that all claims were confirmed when it was given the same PowerPoint presentation. You can view GPT-4o. Details of the results are available here

    Although I used the AI for the AI validation, I wouldn’t feel comfortable using any of these presumed facts without personal, Mark 1 Eyeball confirmation. It was still a fun experiment, and it was fascinating to see the differences between ChatGPT agent and ChatGPT 4.

    Watch a Replay the entire sessionhere.

    8. It was a good job.

    • Analysis of building code for fence installation.
    • Execution is Near perfect.
    • No hallucinations. It got everything but one graphic right
    • Process time: four minutes

    When we lived in Palm Bay Florida, we had a corner lot. The house was equipped with a fence that could only be described as a fence. We had to replace it and, since we wanted privacy we wanted to know how much fence was legal.

    I spent many hours with the planning office over the course of two years to understand what I could and could not do with a fencing, and to see what other options I might have.

    Because I had a lot of experience with this project, and was very familiar with Palm Bay code (even after years of moving away), I decided that ChatGPT Agent would be the best tool to use.

    The analysis was accurate and detailed in just four minutes. It created working diagrams to illustrate the options. Based on my experience, the results are accurate.

    Screenshot by David Gewirtz/ZDNET

    ChatGPT Agent produced output that could be used to take this project to the next step. Back when I lived in Palm Bay, the equivalent probably took me 20 calls, a ton of emails, and a few visits to City Hall to come up with options. The level of presentation and organization I came up with wasn’t even close.

    If Agent can up its game elsewhere to be on a par with this test, then it will have some legs.

    You can watch a Replay the entire sessionhere.

    What does it all mean? It’s not sentient yet. It’s at best like the administrative assistant you hired just because your mother said you had to employ her cousin’s unemployed slacker child. There are flashes, but the output is mostly the result of following directions aggressively and inventing alternative fact.

    Does the Pro program cost $200/month? Not Agent. Not yet. Agent is unreliable, and performs poorly in general. I’m certain that it will improve in a year. But now? No. The only reason I spend $200 a year on it is because I test it to see how the technology is at present.

    Keep an eye on this, because despite the inaccuracies, and other problems, it shows where AI technology can go. If a web-browsing AI Agent is the way of the future and all content sites block it because AI steals our content, we’ll be in for a very interesting situation.

    I’m a AI tools expert and these are the two I pay for. (plus three others I’m considering.)

    It is still early days. It remains to be determined whether this technology will be a boon for all humanity, or a technology which destroys the internet and kills people in their sleep.

    But, hey, while you wait, I will and the rest ZDNET team try to make sense of everything for you. Keep coming back. We’ll tell you more. I’ll be playing with Agent, and I’m certain I’ll also have more to say.

    Do you have any experience with ChatGPT Agent? If so, was it able to follow your instructions or did it take its own interpretation? Did it hit the mark or hallucinate? What do you think about allowing AI tools to access your files, accounts or browser? Do you see the value of this type of automation or are you waiting for it to be useful? Comment below. You can follow me on social media for my daily project updates. Subscribe to Follow me on Twitter/X and sub scribe to my weekly update newsletter (). @DavidGewirtzon Facebook at Facebook.com/DavidGewirtzon Instagram at Instagram.com/DavidGewirtzon Bluesky at @DavidGewirtz.com (19459185) and on YouTube: YouTube.com/DavidGewirtzTV.

    Featured (19659206)

www.aiobserver.co

Exit mobile version