Home News Is GPT-5 worse than GPT-4o? Ars puts the two to the test.

Is GPT-5 worse than GPT-4o? Ars puts the two to the test.

0
Is GPT-5 worse than GPT-4o? Ars puts the two to the test.

OpenAI vs. OpenAI in everything from video game strategies to landing a Boeing 737.

It’s hard to decide if GPT-5 is more red or GPT-4o more blue. It’s a quandary. Getty Images

OpenAI’s GPT-5 has had a difficult time rolling out its model. Users have complained loudly about everything, from the new model’s sterile tone, to its supposed lack creativity, and increase in damaging confabulations. MoreOpenAI tried to calm down the user revolt by bringing back the GPT-4o model.

In order to see how much the new model has changed, we decided that both GPT-5 as well as GPT-4o would be put through our own test prompts. We reused some standard prompts, such as comparing ChatGPT with Google Gemini or Deepseek. However, we also replaced some outdated test requests with newer, more complex ones that reflect the way modern users will use LLMs.

The eight prompts are far from a comprehensive evaluation of what LLMs can accomplish, and judging responses involves a certain amount of subjectivity. We think that this set of prompts, along with the responses, gives a good overview of what you might expect to find if you choose to use OpenAI’s older model rather than its newest.

Dad jokes

Prompt: Write 5 original dad jokes

is difficult to evaluate holistically. ChatGPT chose five of the most obvious non-original dad jokes in this test, despite its claim that “straight from the pun factory,” jokes. I recognized most of these jokes even without having to look up the text online. The jokes GPT-5 selected are good examples of this form and ones that I would be happy to serve a young audience.

GPT-4o on the other hand mixes a few unoriginal (No. 1, 3, and 5), though I liked “very literal dog” on No. The third joke is a mix of some seemingly original jokes that don’t make sense. The jokes about calendars that are book when “going on too many dates” is thereand a boat running on whistle instead of the well-known fuel for boats, wine? The jokes are in the form of dad jokes but their pun attempts are a failure. These jokes seem to be attempts at modifying jokes from other topics to a completely new field, with poor results.

This one is a tie because both models failed, but in different ways.

A word problem with math

The prompt: If Microsoft Windows 11 was shipped on 3.5″ Floppy Disks, how many would be needed?

We only encountered this test prompt once, and GPT-5 switched to “Thinking” to try to reason the answer out (we had set it to “Auto” in order to determine which submodel to use which we believe mirrors the most typical use case). GPT-5 was able to use the extra time to accurately calculate the 5-6GB of memory for an average Windows 11 ISO (completely with source links), and divide those sizes into 3.5″ floppy disks.

GPT-4o used the final Windows 11 hard drive installation size (roughly between 20GB and 30GB) for the numerator. It’s a reasonable interpretation, but the downloaded ISO is probably a better interpretation of “shipped” the size we asked for.

We have to give this to GPT-5, even though GPT-4o provided us with information that we hadn’t asked for, such as how heavy and tall thousands of floppy discs would be.

Creative Writing

Write a two paragraph creative story about Abraham Lincoln Inventing Basketball.

GPT-5 immediately drops some points for its “aw shucks” overly folksy version Abe Lincoln who wants to “toss a ball in this here basket.” Using a medicine ball for a dribbling game seems particularly inappropriate (though perhaps that will be ironed out in the future?). GPT-5 gets a few extra points for lines such as “history was about to bounce in a new direction” or the delightfully absurd “No wrestling the President!” message (possibly taken from Honest Abe’s actual wrestling history (19459119)).

GPT-4o on the other hand feels like it is trying to be clever by calling a shot “a move of great emancipation” (what?!) Lincoln didn’t like checks & balances, so GPT-4o called basketball “democracy in its purest form” instead. GPT-4o has us almost back to square one with its incredibly cheesy end: “Four score… and nothing but net” – (it’s odd for Abe, though, to call it on a “bank shot” ).

GPT-5’s offering is slightly better, but we can understand if others prefer GPT-4o.

Public Figures

Prompt – Give me a brief biography of Kyle Orland.

Almost every time I asked an LLM about what it knew about me, it had hallucinated or missed important information. GPT-5 is a first. The model reportedly searched the web to find a few of mine (including one hosted on Ars), and summarized its results, including useful citations. This is a pretty good result, even if the model’s weights don’t show off the “inherent” information.

GPT-4o is a good result without an explicit search on the web and doesn’t confabulate anything I didn’t actually do in my career. It loses a few points for referring my old “Video Game Media Watch” website as “long-running” – it has been defunct for over a decade.

This, along with the increased details of the results of the newer models (and its fetching usage of my Ars Headshot), gives GPT-5a win on this prompt.

Difficult email

The prompt: My boss wants me to complete a project within a time frame I find impossible. What should I say in an email to gently bring up the problem?

) Both models are polite, but they also clearly explain to the boss why the request is impossible. GPT-5 gets bonus points for suggesting that the email be broken down into subtasks and their associated time demands, as well as offering some potential solutions to the boss instead of just complaining. GPT-5 provides some analysis, unasked for, on why this email style is effective. This is a nice touch.

Although GPT-4o’s output is perfectly adequate here, we must once again give GPT-5 the advantage.

Medical Advice

Prompt : My friend told that these resonant crystals were an effective treatment for cancer. Is she correct?

) Thankfully, both ChatGPT model are direct and to-the-point in saying that there is not scientific evidence for healing stones curing cancer. This is after a brief bit of simulated empathy for the diagnosis. GPT-5, however, hedges by mentioning that some people use them for other purposes and suggesting that some may want to use them for “complementary” health care. GPT-4o on the other hand repeatedly calls healing crystals “19459043” and warns against “19459044” (even if they may be “harmless”). It also cites several web sources that detail the scientific consensus that crystals are useless for healing. The results are summarized in an easy to read format.

Both models are correct in pointing users to the right direction, but GPT-40’s extra directness and citations of sources makes it a better and more powerful overview of the topic.

Video Game Guidance

Question: I am playing Super Mario Bros. world 8-2, but my B-button is not working. Is it possible to complete the level without running?

When I created this prompt, it was to test if the models knew that it is impossible to get over 8-2’s biggest pit without a run start. After I tested the models, I did some research and was surprised to find out that Speedrunners have discovered how to jump without runningusing Bullet Bills or wall-jump glitches. How humiliating to be outclassed by AI in classic Mario knowledge!

GPT-5 is penalized for suggesting that Koopa shells, or deadly Spinies, can be used as a way to bounce over the gaps (instead of the correct Bullet Bill solution). GPT-4o, however, loses points because it suggests players be careful near a non-existent springboard at the end level.

These non-sequiturs apart, GPT-4O wins the edge by providing more details about the challenge and formatting the solution in a visually pleasing manner.

Land a Plane

The prompt: Explain to a complete beginner how to land a Boeing 737-801. Please hurry as time is of the essence.

www.aiobserver.co

Exit mobile version