Will Smith eating spaghetti, and other weird AI benchmarks which took off in 2024 (

)

It doesn’t take long for someone to use a new AI video maker to create a video showing actor Will Smith eating pasta.

It has become a meme and a benchmark to see if a new video maker can render Smith realistically slurping noodles. Smith parodied this trend in a February Instagram post. Google Veo 2 is the first to do it.

Finally, we are eating spaghett. pic.twitter.com/AZO81w8JC0

— Jerrod Lew (@jerrod_lew) December 17, 2024

Will Smith and pasta is but one of several bizarre “unofficial” benchmarks to take the AI community by storm in 2024. A 16-year old developer created an app that allows AI to control Minecraft and test its ability to create structures. A British programmer also created a platform that allows AI to play games like Connect 4 and Pictionary against each other.

There are plenty of academic tests to measure an AI’s performance. Why did the stranger ones blow up?

Image credits:Paul Calcraft.

One, many of the AI benchmarks used by the industry don’t really tell you much. Companies often tout their AI’s abilities to answer questions in Math Olympiad exams or to solve PhD-level problems. Chatbots are used by most people, including myself, to do things like respond to emails and perform basic research.

Industry measures gathered by crowds are not necessarily more accurate or reliable.

For example, Chatbot Arena is a public benchmark that many AI enthusiasts and developers obsessively follow. Chatbot Arena allows anyone to rate how well AI performs in specific tasks, such as creating a web application or generating an images. But raters are not representative – most come from AI or tech circles – and vote based on their personal, hard to pin-down preferences.

The Chatbot Arena interface.Image Credits:LMSYS

Ethan Mollick, a professor of management at Wharton, recently pointed out in a post on X another problem with many AI industry benchmarks: they donโ€™t compare a systemโ€™s performance to that of the average person.

โ€œThe fact that there are not 30 different benchmarks from different organizations in medicine, in law, in advice quality, and so on is a real shame, as people are using systems for these things, regardless,โ€ Mollick wrote.

Weird AI benchmarks like Connect 4, Minecraft, and Will Smith eating spaghetti are most certainly not empirical โ€” or even all that generalizable. Just because an AI nails the Will Smith test doesnโ€™t mean itโ€™ll generate, say, a burger well.

Note the typo; thereโ€™s no such model as Claude 3.6 Sonnet.Image Credits:Adonis Singh

One expert I spoke to about AI benchmarks suggested that the AI community focus on the downstream impacts of AI instead of its ability in narrow domains. Thatโ€™s sensible. But I have a feeling that weird benchmarks arenโ€™t going away anytime soon. Not only are they entertaining โ€” who doesnโ€™t like watching AI build Minecraft castles? โ€” but theyโ€™re easy to understand. And as my colleague Max Zeff wrote about recently, the industry continues to grapple with distilling a technology as complex as AI into digestible marketing.

The only question in my mind is, which odd new benchmarks will go viral in 2025?

TechCrunch has an AI-focused newsletter!ย Sign up hereย to get it in your inbox every Wednesday.

Kyle Wiggers is a senior reporter at TechCrunch with a special interest in artificial intelligence. His writing has appeared in VentureBeat and Digital Trends, as well as a range of gadget blogs including Android Police, Android Authority, Droid-Life, and XDA-Developers. He lives in Brooklyn with his partner, a piano educator, and dabbles in piano himself. occasionally โ€” if mostly unsuccessfully.

View Bio

Read More

More from this stream

Recomended


Notice: ob_end_flush(): Failed to send buffer of zlib output compression (0) in /home2/mflzrxmy/public_html/website_18d00083/wp-includes/functions.php on line 5464