Will Smith eating spaghetti, and other weird AI benchmarks which took off in 2024 (

)

It doesn’t take long for someone to use a new AI video maker to create a video showing actor Will Smith eating pasta.

It has become a meme and a benchmark to see if a new video maker can render Smith realistically slurping noodles. Smith parodied this trend in a February Instagram post. Google Veo 2 is the first to do it.

Finally, we are eating spaghett. pic.twitter.com/AZO81w8JC0

— Jerrod Lew (@jerrod_lew) December 17, 2024

Will Smith and pasta is but one of several bizarre “unofficial” benchmarks to take the AI community by storm in 2024. A 16-year old developer created an app that allows AI to control Minecraft and test its ability to create structures. A British programmer also created a platform that allows AI to play games like Connect 4 and Pictionary against each other.

There are plenty of academic tests to measure an AI’s performance. Why did the stranger ones blow up?

Image credits:Paul Calcraft.

One, many of the AI benchmarks used by the industry don’t really tell you much. Companies often tout their AI’s abilities to answer questions in Math Olympiad exams or to solve PhD-level problems. Chatbots are used by most people, including myself, to do things like respond to emails and perform basic research.

Industry measures gathered by crowds are not necessarily more accurate or reliable.

For example, Chatbot Arena is a public benchmark that many AI enthusiasts and developers obsessively follow. Chatbot Arena allows anyone to rate how well AI performs in specific tasks, such as creating a web application or generating an images. But raters are not representative – most come from AI or tech circles – and vote based on their personal, hard to pin-down preferences.

The Chatbot Arena interface.Image Credits:LMSYS

Ethan Mollick, a professor of management at Wharton, recently pointed out in a post on X another problem with many AI industry benchmarks: they don’t compare a system’s performance to that of the average person.

“The fact that there are not 30 different benchmarks from different organizations in medicine, in law, in advice quality, and so on is a real shame, as people are using systems for these things, regardless,” Mollick wrote.

Weird AI benchmarks like Connect 4, Minecraft, and Will Smith eating spaghetti are most certainly not empirical — or even all that generalizable. Just because an AI nails the Will Smith test doesn’t mean it’ll generate, say, a burger well.

Note the typo; there’s no such model as Claude 3.6 Sonnet.Image Credits:Adonis Singh

One expert I spoke to about AI benchmarks suggested that the AI community focus on the downstream impacts of AI instead of its ability in narrow domains. That’s sensible. But I have a feeling that weird benchmarks aren’t going away anytime soon. Not only are they entertaining — who doesn’t like watching AI build Minecraft castles? — but they’re easy to understand. And as my colleague Max Zeff wrote about recently, the industry continues to grapple with distilling a technology as complex as AI into digestible marketing.

The only question in my mind is, which odd new benchmarks will go viral in 2025?

TechCrunch has an AI-focused newsletter! Sign up here to get it in your inbox every Wednesday.

Kyle Wiggers is a senior reporter at TechCrunch with a special interest in artificial intelligence. His writing has appeared in VentureBeat and Digital Trends, as well as a range of gadget blogs including Android Police, Android Authority, Droid-Life, and XDA-Developers. He lives in Brooklyn with his partner, a piano educator, and dabbles in piano himself. occasionally — if mostly unsuccessfully.

View Bio

Will Smith eating spaghetti, and other weird AI benchmarks which took off in 2024 (

WordPad is no more in Windows 11, however Notepad has absorbed...

Grab it before it ends

Multimodal Foundation Models Fall Short on Physical Reasoning: PHYX Benchmark Highlights...

A Coding Guide to Building a Scalable Multi-Agent Communication Systems Using...

Recomended

WordPad is no more in Windows 11, however Notepad has absorbed its skills

Grab it before it ends

Multimodal Foundation Models Fall Short on Physical Reasoning: PHYX Benchmark Highlights Key Limitations in Visual and Symbolic Integration

A Coding Guide to Building a Scalable Multi-Agent Communication Systems Using Agent Communication Protocol (ACP)

This AI Paper Introduces ARM and Ada-GRPO: Adaptive Reasoning Models for Efficient and Scalable Problem-Solving

Cisco’s Latest AI Agents Report Details the Transformative Impact of Agentic AI on Customer Experience