AI video has just taken a huge leap in realism. Are we doomed?

Tales of the cultural singularity.

Google Veo 3 produces AI videos with realistic people and music. We put it to test.

Image from an AI-generated Veo 3 of “A 1980s fitness video with models in leotards wearing werewolf masks.” Credit Google

Google launched a new product last week. I see 3is the newest video generation tool from the company. It can create clips of 8 seconds with sound effects and audio dialogue, a first in AI tools. The model, which creates videos in 720p resolution based on text descriptions (called “prompts” ) or still images inputs, represents what may be most capable consumer video generators to date. It brings video synthesis to a point that it is difficult to differentiate between “authentic” -generated media.

Google is also Flowis an online AI filmmaking software that combines Veo 3, Imagen 4 and Gemini language models. It allows creators to describe scenes using natural language, manage characters, locations and visual styles through a web interface.

An AI-generated video from Veo 3: “ASMR scene of a woman whispering “Moonshark” into a microphone while shaking a tambourine”

Both tools are now available to US subscribers of Google AI Ultra, a plan that costs $250 a month and comes with 12,500 credits. Veo 3 videos cost 150 credits per generation, allowing 83 videos on that plan before you run out. Extra credits are available for the price of 1 cent per credit in blocks of $25, $50, or $200. That comes out to about $1.50 per video generation. But is the price worth it? We ran some tests with various prompts to see what this technology is truly capable of.

How does Veo work?

Like other modern video generation models, Veo 3 is built on diffusion technology—the same approach that powers image generators like Stable Diffusion and Flux. The training process works by taking real videos and progressively adding noise to them until they become pure static, then teaching a neural network to reverse this process step by step. During generation, Veo 3 starts with random noise and a text prompt, then iteratively refines that noise into a coherent video that matches the description.

AI-generated video from Veo 3: “An old professor in front of a class says, ‘Without a firm historical context, we are looking at the dawn of a new era of civilization: post-history.'”

DeepMind won’t say exactly where it sourced the content to train Veo 3, but YouTube is a strong possibility. Google owns YouTube, and DeepMind Previously, TechCrunch reported that Google models such as Veo “may” are trained using YouTube material.

Veo 3 is composed of a number of AI models. These include a large-language model (LLM) that interprets user prompts and helps with detailed video creation. A video diffusion model creates the video. And an audio generation model adds sound to the video.

An AI-generated video from Veo 3: “A male stand-up comic on stage in a night club telling a hilarious joke about AI and crypto with a silly punchline.” An AI language model built into Veo 3 wrote the joke.

In an attempt to prevent misuse, DeepMind says it’s using its proprietary watermarking technology, SynthID, to embed invisible markers into frames Veo 3 generates. These watermarks persist even when videos are compressed or edited, helping people potentially identify AI-generated content. As we’ll discuss more later, though, this may not be enough to prevent deception.

Google also censors certain prompts and outputs that breach the company’s content agreement. During testing, we encountered “generation failure” messages for videos that involve romantic and sexual material, some types of violence, mentions of certain trademarked or copyrighted media properties, some company names, certain celebrities, and some historical events.

Putting Veo 3 to the test

Perhaps the biggest change with Veo 3 is integrated audio generation, although Meta previewed a similar audio-generation capability with “Movie Gen” last October, and AI researchers have experimented with using AI to add soundtracks to silent videos for some time. Google DeepMind itself showed off an AI soundtrack-generating model in June 2024.

An AI-generated video from Veo 3: “A middle-aged balding man rapping indie core about Atari, IBM, TRS-80, Commodore, VIC-20, Atari 800, NES, VCS, Tandy 100, Coleco, Timex-Sinclair, Texas Instruments”

Veo 3 can generate everything from traffic sounds to music and character dialogue, though our early testing reveals occasional glitches. Spaghetti makes crunching sounds when eaten (as we covered last week, with a nod to the famous Will Smith AI spaghetti video), and in scenes with multiple people, dialogue sometimes comes from the wrong character’s mouth. But overall, Veo 3 feels like a step change in video synthesis quality and coherency over models from OpenAI, Runway, Minimax, PikaMeta Klingand Hunyuanvideo.

The videos tend to show garbled sub-titles that almost match spoken words. This is an artifact from subtitles on videos in the training data. The AI model is mimicking what it “seen” has done before.

An AI-generated video from Veo 3: “A beer commercial for ‘CATNIP’ beer featuring a real a cat in a pickup truck driving down a dusty dirt road in a trucker hat drinking a can of beer while country music plays in the background, a man sings a jingle ‘Catnip beeeeeeeeeeeeeeeeer’ holding the note for 6 seconds”

We generated each of the eight-second-long 720p videos seen below using Google’s Flow platform. Each video took between three and five minutes to create, and we paid ourselves. It’s important to note that better results come from cherry-picking–running the same prompt multiple times until you find a good result. We only ran each prompt once due to cost and testing.

Audio prompts

Let’s dive into the deep end of audio generation to get an idea of what this technology is capable. In our last Veo 3 video, we showed you a man singing and rapping about spaghetti. Now let’s see some more complex dialogue.

We’ve been testing AI image generators such as Midjourney since 2022 using the prompt “a muscular barbarian with weapons beside a CRT television set, cinematic, 8K, studio lighting” . It’s time for that barbarian to come to life.

A muscular man holding an axe standing next to a television with a CRT. He looks at the television, then turns to the camera, and says: “You’ve been looking for this for years: a muscular barbarian with weapons beside a CRT television set, cinematic, 8K, studio lighting. Got that, Benj?”

This video represents significant progress in AI media synthesizing over only three years. We’ve gone a blurry, colorful still-image barbarian into a photorealistic guy who talks to us in high definition 720p with audio. There’s no reason for the technical capabilities of AI generation to slow down.

Horror movie: A woman in Victorian clothing running through a forest in dolly shot being chased by man in peanut costume screaming. “Wait! You forgot your wallet!”

Tim Burton’s The Haunted Basketball Train trailer: a 1990s basketball player is stuck in the end of a haunted train with basketball court car and must beat different ghosts in each car at basketball.

ASMR of a muscular Barbarian man whispering into a mic. “You love CRTs, don’t you? That’s OK. It’s OK to love CRT televisions and barbarians.”

An 1980s She says: “Oh my lord, look at that Atari 800 you have behind you! I can’t believe how nice it is!”

One can imagine a virtual world with AI personalities that flatter people. This is an innocent example of a vintage computer. But you can extrapolate to make the fake person speak about any topic. Google’s filters have limitations, but based on what we’ve seen before, a future version of an AI video generator with similar capabilities is likely.

Screenshot of a Zoom call. A psychologist in an office with a cozy, dark atmosphere. The therapist says, in a friendly tone, “Hi Tom, thanks for calling. Tell me about how you’re feeling today. Is the depression still getting to you? Let’s work on that.”

1960s NASA video of the first man to step onto the surface on the Moon. He squishes down into a pile mud, yells and screams in a hillbilly accent, “What in tarnation??”

Local TV news interview with a muscular barbarian discussing why he always carries a CRT television around with him.[19Veo3ismosteffectiveincasualmediadeception[19Veo3ismosteffectiveincasualmediadeception

Footage of a news report on Russia’s invasion of the United States.

Attempts to make music.

Veo 3’s AI audio generator is capable of creating music in a variety genres. It’s still a new feature for AI video generators. Here are some examples of different musical genres. Here are some examples of different musical genres.

PBS show with a crazy barbarian painting pictures of Trees while singing “HAPPY BIG TREES” along to some music.

1950s cowboy riding up to the cameras and singing in country music. “I love mah biiig ooold donkeee”

1980s hair metal group driving up to camera and singing in rock music. “Help me with my huge huge huge hair!”

Mister Rogers Neighborhood PBS kids show Veo 3 has a much better temporal coherency than any of the earlier video synthesis models that we tested. It’s not perfect. Aerial view of a herd of 1 million cats running up a hillside

Video game footage from a 1990s third-person 3D game featuring an anthropomorphic Shark Boy

Video games footage of a dynamic 90s third-person 3D game starring a shark boy in a rubber shark costume

Some notable fails

Google Veo 3′ As we noted in previous coverage, AI video creators are fundamentally imitative. They make predictions based on statistical patterning rather than a real understanding of how the world works.

If you see, for example, mouths moving when speaking, or clothes wrinkled in a certain manner when touched, that means the neural networks doing the video generation have “seen” sufficient similar examples of the scenario in the training dataset to render it convincingly and apply it to other situations.

When a novel situation or combination of themes is not well-represented in training data, however, you will see “impossible” and illogical things happening, such as weird bodies, magically appearing clothes, or an item that “shatters” in the scene but remains there afterward.

In the introduction, we mentioned audio and video glitches. Especially scenes with multiple characters can confuse the character speaking, as in this argument between tech enthusiasts.

A 2000s television debate between fans of Intel Pentium and PowerPC chips

A 1980s infomercial for “Ars Technica” ‘s online service. With cheesy music and user testimonials,

1980s Rambo battling Soviets on the Moon.

Some requests are not coherent. In this case “Rambo” may be on the Moon, firing a gun but he is not wearing a spacesuit. He’s tougher than we expected.

A animated infographic showing the number of floppy discs needed to install Windows 11

Veo 3 is also weak with large amounts of text, but when a short text quote is specified explicitly in the prompt it usually works.

A woman performing a complex floor routine at the Olympics that includes running and flips.

Veo 3 has made improvements in audio generation and temporal coherency, but it still suffers the same “jabberwockies” as OpenAI’s viral Sora Gymnast video – those non-plausible videos hallucinations such as impossible morphing bodies parts.

A group of men and woman cartwheeling along the road while singing “CHEEEESE” for 8 seconds and then falling over.

YouTube-style video of a person wearing various corncob costume. They shout. “Corncob haul!!”

The glass man runs into a wall and breaks, screaming.

The man in the spacesuit holds up five fingers and counts down to zero before blasting off with rocket boots.

Veo 3 has difficulty counting down with fingers, likely because this is not well represented in training data. Hands are probably shown in only a few positions, such as a fist or a five finger open palm. Other common hand positions include a two-finger number one, a peace sign with two fingers, and a two-finger countdown.

As future models are trained on vastly bigger datasets and exponentially more computing power, these systems may form deeper statistical connections between concepts they observe in video, dramatically improving the quality and the ability to generalize with novel prompts. What else can we say about the “cultural singularity?”

Some of you may be concerned that we are in trouble as society because of the potential deception of this type of technology. There’s good reason to be concerned: the American pop culture diet relies heavily on clips that are shared by strangers via social media, such as TikTok. And now, all of this can be easily faked. Automated generations can now argue for ideologies in a way to manipulate the masses.

AI-generated video by Veo 3: “A man on the street interview about someone who fears they live in a time where nothing can be believed”

Such videos could be (and were) manipulated before through various means prior to Veo 3, but now the barrier to entry has collapsed from requiring specialized skills, expensive software, and hours of painstaking work to simply typing a prompt and waiting three minutes. What once required a team of VFX artists or at least someone proficient in After Effects can now be done by anyone with a credit card and an Internet connection.

But let’s take a moment to catch our breath. At Ars Technica, we’ve been warning about the deceptive potential of realistic AI-generated media since at least 2019. In 2022, we talked about AI image generator Stable Diffusion and the ability to train people into custom AI image models. We discussed Sora “collapsing media reality” and talked about persistent media skepticism during the “deep doubt era.”

AI-generated video with Veo 3: “A man on the street ranting about the ‘cultural singularity’ and the ‘cultural apocalypse’ due to AI”

I also In I wrote in detail about the future ability of people to pollute historical records with AI-generated sound. In that article, I used “cultural singularity” as a term to describe a time where truth and fiction will be impossible to distinguish in media, both because of the deceptiveness of AI-generated content and also due to the vast quantities of AI generated and AI augmented media that we’ll soon be inundated.

In an article I wrote about cloning the handwriting of my father using AI last year, I came to the conclusion my fears about the cultural singleity were overblown. Since ancient times, media has been susceptible to forgery. Trust in any remote communication depends on its source.

AI-generated video with Veo 3: “A news set. There is an ‘Ars Technica News’ logo behind a man. The man has a beard and a suit and is doing a sit-down interview. He says “This is the age of post-history: a new epoch of civilization where the historical record is so full of fabrication that it becomes effectively meaningless.”

The Romans had laws against forgery in 80 BC, and people have been doctoring photos since the medium’s invention. What has changed isn’t the possibility of deception but its accessibility and scale.

With Veo 3’s ability to generate convincing video with synchronized dialogue and sound effects, we’re not witnessing the birth of media deception—we’re seeing its mass democratization. What once cost millions of dollars in Hollywood special effects can now be created for pocket change.

An AI-generated video created with Google Veo-3: “A candid interview of a woman who doesn’t believe anything she sees online unless it’s on Ars Technica.”

As these tools become more powerful and affordable, skepticism in media will grow. But the question isn’t whether we can trust what we see and hear. It’s whether we can trust who’s showing it to us. In an era where anyone can generate a realistic video of anything for $1.50, the credibility of the source becomes our primary anchor to truth. The medium was never the message—the messenger always was.

Benj Edwards is Ars Technica’s Senior AI Reporter and founder of the site’s dedicated AI beat in 2022. He’s also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.

80 Comments

www.aiobserver.co

More from this stream

Recomended