Synthesia’s AI clones are more expressive than ever. Soon they’ll be able to talk back.

Earlier this summer, I stepped into the sleek, glass-fronted lobby of a prestigious London office building, took an elevator up, and followed a corridor to a pristine, carpeted studio bathed in natural sunlight. Overhead, large umbrella-shaped lights enhanced the brightness. Standing before a tripod-mounted camera and a laptop displaying a teleprompter, I braced myself and began reading the script aloud.

I’m neither a professional broadcaster nor an actor auditioning for a role. Instead, I was at Synthesia’s AI studio to provide the raw material needed to generate a hyperrealistic AI avatar of myself. Synthesia’s avatars serve as a striking example of how rapidly AI technology has evolved in recent years, and I was eager to see how closely their newest model, launched just last month, could mimic my appearance and mannerisms.

From Humble Beginnings to Cutting-Edge Avatars

When Synthesia debuted in 2017, its main goal was to create AI-generated faces of real people-such as former athletes-paired with dubbed voices in multiple languages. By 2020, the platform expanded to allow businesses to produce polished presentation videos featuring AI versions of employees or consenting actors. However, early avatars often exhibited stiff, unnatural movements, inconsistent accents, and mismatched emotional expressions between voice and face.

Today, Synthesia’s avatars have undergone significant refinement. They now display fluid gestures, nuanced facial expressions, and voices that better preserve the speaker’s unique accent and intonation. For corporate clients, this means more engaging and professional presentations for everything from financial briefings to internal training sessions.

Experiencing the Avatar Creation Firsthand

My colleague Melissa’s previous visit to Synthesia involved a lengthy calibration process, requiring her to read scripts in various emotional tones and mouth specific sounds to help the avatar articulate vowels and consonants. Fifteen months later, my experience was notably more streamlined. Josh Baker-Mendoza, Synthesia’s technical supervisor, encouraged me to use natural hand gestures while cautioning against excessive movement. I read an enthusiastic script designed to elicit expressive delivery, resulting in a digital persona that felt like a blend of Steve Jobs’ charisma and a British monotone.

Though the script made me sound like a Synthesia spokesperson-“I am thrilled to share our groundbreaking innovations with you today”-the process took just an hour to capture all necessary footage. Within weeks, I received two avatars: one created with the older Express-1 model and another with the latest Express-2 technology. Synthesia claims the newer model produces avatars with more lifelike hand gestures, facial expressions, and speech patterns.

Video comparison courtesy of Synthesia

Melissa’s earlier Express-1 avatar struggled to capture her transatlantic accent and emotional range-when asked to sound angry, it came across as more whiny than furious. My Express-1 avatar still exhibited rapid blinking and awkward synchronization between speech and body language. In contrast, the Express-2 version closely resembled me, with facial features and voice that were eerily accurate. Although it gestured more than I typically do, its movements generally aligned with the spoken words.

Yet, subtle giveaways remain: unnaturally smooth, pink palms; stiff hair strands that don’t move naturally; glassy, infrequent blinking; and occasional odd vocal inflections, such as an out-of-place “This is great!” before returning to a more measured tone.

The Uncanny Valley and Emotional Authenticity

Anna Eiserbeck, a postdoctoral researcher in psychology at Humboldt University of Berlin who studies human reactions to deepfake faces, admitted she might not immediately recognize my avatar as artificial. However, she noted subtle inconsistencies-like a static earring or abrupt body movements-that eventually reveal its synthetic nature.

More profoundly, she sensed an emotional void. “There’s no genuine feeling behind it-it’s not a conscious being,” she explained. Watching the avatar evoked an uncanny sensation, a reminder that despite visual realism, the AI lacks true emotional depth.

Reflecting on this, I realized part of my discomfort stemmed from the avatar’s overly cheerful tone, which contrasts with my typically reserved British demeanor. Repeatedly watching the video loop made me question my own gestures and speech patterns-much like the humbling experience of seeing oneself on a Zoom call, but amplified by the presence of a full digital double.

Back when Facebook was new in the UK nearly two decades ago, my friends and I found it hilarious to hack each other’s accounts and post outrageous updates. I wonder if the future equivalent will be manipulating someone’s avatar to say embarrassing things-perhaps endorsing a controversial figure or confessing a guilty pleasure like enjoying pop music.

Express-2 transforms every subject into a polished, energetic presenter with exaggerated body language-ideal for corporate videos but less reflective of individual personality. Watching my avatar felt less like seeing myself and more like encountering a distinct, artificial persona.

Behind the Scenes: The Technology Powering AI Avatars

According to Björn Schuller, AI professor at Imperial College London, the main challenge isn’t replicating physical appearance but capturing authentic behavior. “Getting the micro-gestures, intonation, voice tone, and timing right is crucial,” he says. “An AI frowning at the wrong moment could completely change the message.”

Synthesia’s latest advancements involve multiple AI models working in concert. A voice cloning system preserves the speaker’s accent and expressiveness, avoiding the robotic tones common in other voice synthesis technologies.

When a script is uploaded to Express-1, the system analyzes the text to determine the appropriate emotional tone, which feeds into a diffusion model that animates facial expressions and movements accordingly.

Express-2 enhances this process with three additional models: one generates gestures synchronized with speech, another evaluates alignment between audio and motion to select the best match, and a powerful rendering model produces the final avatar. This rendering engine boasts billions of parameters-far surpassing Express-1’s few hundred million-enabling faster and more nuanced avatar creation.

Youssef Alami Mejjati, Synthesia’s head of R&D, explains, “Previously, the system needed to observe someone expressing emotions to replicate them. Now, trained on vast and diverse datasets, it learns these associations automatically.”

Bridging the Gap: Making AI Avatars More Relatable

While AI-generated humanlike avatars have existed for years, the surge in generative AI has made creating realistic synthetic humans more accessible and affordable. Companies like Synthesia, alongside others such as Rephrase.ai and Hour One, empower businesses to produce engaging videos featuring AI actors or digital staff replicas, offering cost-effective marketing and training solutions.

In China, AI avatars have gained popularity in e-commerce, where they can promote products around the clock without fatigue or salary demands.

Currently, Synthesia focuses primarily on corporate applications but is exploring expansion into education and entertainment. A recent partnership with Google integrates Google’s generative video model, Veo 3, enabling users to embed AI-generated clips seamlessly into Synthesia videos. This hints at a future where AI avatars could star in dynamic virtual environments with customizable backdrops.

For example, a video might feature a Synthesia avatar explaining the operation of meat-processing machinery alongside generated footage of the equipment. Future iterations could tailor educational content to individual knowledge levels-offering a biology lecture adapted for experts or high school students alike. Alex Voica, Synthesia’s head of corporate affairs, envisions this as a more engaging and personalized learning experience.

The Next Step: Interactive AI Avatars

Synthesia aims to develop avatars capable of real-time interaction-understanding and responding to user input much like ChatGPT but embodied in a lifelike digital human. Currently, users can interact with avatars during quizzes, but the goal is to enable natural conversations where the avatar can pause, elaborate, or answer questions on demand.

“Our mission is to create the best learning experience through entertaining, personalized, and interactive video,” says Alami Mejjati. “This is the missing piece in today’s online education, and we’re close to achieving it.”

Research shows humans can form emotional bonds with AI, even simple text chatbots. Pat Pataranutaporn, assistant professor at MIT Media Lab, warns that adding a realistic human face could intensify this effect, potentially leading to new forms of AI dependency.

“If the system becomes too lifelike, people might develop strong attachments,” he notes. “We’ve seen cases where users become emotionally invested in AI companions through text alone. A talking avatar would be even more compelling.”

Schuller concurs, predicting future avatars will be finely tuned to maintain engagement by modulating emotion and charisma. “It will be tough for humans to compete with AI that’s always available, attentive, and understanding,” he says. “AI will transform human-to-human connection.”

Contemplating the Digital Doppelgänger

As I watch my Express-2 avatar, I imagine conversing with this perpetually upbeat, ever-present digital version of myself-an entity made of pixels and algorithms that looks and sounds like me but lacks my lived experiences. This virtual Rhiannon has never laughed until tears fell, fallen in love, run a marathon, or witnessed a sunset in a foreign land.

Yet, she could deliver an impeccable presentation on why Ed Sheeran is the UK’s greatest musical export-and only those closest to me would know she isn’t truly me.

More from this stream

Recomended