Microsoft Unveils Powerful AI Video Generator That Only Needs 1 Pic & Audio Clip
By Mikelle Leow, 19 Apr 2024
Image via Microsoft, background generated on AI
Microsoft’s research arm has unveiled a remarkable new AI model called VASA-1, pushing the boundaries of video generation beyond Uncanny Valley by creating hyperrealistic talking faces from just a single photograph and an audio sample.
VASA-1, which stands for Visual Affective Skills with Audio, goes beyond basic lip-syncing, animating the entire face with natural expressions and head movements that correspond to the provided audio to create a remarkably convincing illusion of a real person speaking. Although the underlying image might be static or even AI-generated (the company used StyleGAN2 and DALL-E-3 for its inputs), VASA-1 breathes remarkable believability into it.
Video via Microsoft
The tool utilizes a “face latent space” model to generate both facial dynamics and head movements holistically. This translates to disentangling an image’s appearance, 3D head pose, and facial expressions. The separation allows for independent control and customization of the generated content, opening doors for more creative possibilities. Furthermore, VASA-1 demonstrates remarkable flexibility. It can handle photo and audio inputs that fall outside the parameters of its training data.
While similar technology exists from companies like Runway and Nvidia, VASA-1 seems to achieve a new level of realism by minimizing mouth deformations and incorporating a wider range of facial subtleties. This research aligns with Google’s recent VLOGGER AI, demonstrating a growing trend in creating expressive, lifelike characters through the power of AI.
Here’s what can be achieved on the model with just a single portrait and a minute-long audio clip:
Video via Microsoft
And here’s the Mona Lisa alive, kicking, and rapping at over 500 years young.
Video via Microsoft
Potential applications include creating personalized avatars for educational purposes, generating realistic spokespersons for virtual assistants, eerily lifelike video game non-playing characters (NPCs), or even reviving historical figures in interactive exhibits.
However, ethical concerns also arise. The ability to manipulate speech and movement onto any image raises questions about potential misuse for deepfakes or misinformation. As such, the tech giant is currently keeping the technology behind closed doors.
Microsoft has yet to announce public availability for VASA-1, as it remains a research project. For now, you will have to attend your own virtual meetings—no digital stand-ins yet.
[via BGR, Android Authority, SiliconANGLE, Tom’s Guide, videos and cover image via Microsoft]