Microsoft says its new VASA-1 AI framework for generating lifelike talking faces of virtual characters is so good that it could easily be misused for impersonating humans and, therefore, Microsoft says it has “no plans” to release any aspect of it until it can be sure it can be used responsibly.
What’s The Problem?
2024 is an election year in at least 64 countries (including the US, UK, India, and South Africa) and the risk of AI being misused to spread misinformation has grown dramatically. In the US, for example, the Senate Committee on the Judiciary, Subcommittee on Privacy, Technology, and the Law has held a hearing titled “Oversight of AI: Election Deepfakes”. There is also now widespread recognition of the threats posed by deepfakes and proactive measures are being taken by governments and private sectors to safeguard electoral integrity. AI companies are keenly aware of the risks and have been taking their own measures. For example, Google’s Gemini has been restricted in the kinds of election-related questions that its AI chatbot will return responses to.
Google has also recently (in a blog post) addressed India’s AI concerns as regards its potential impact (deepfakes and misinformation) on what is the world’s largest election. None of the main AI companies have, therefore, wanted to simply release their latest updated generative AI without being seen to test them and include what safeguards they can against misuse. Also, none of the main AI companies are keen to be publicly singled-out as enabling electoral interference.
VASA-1
Microsoft says its VASA-1 AI can produce lifelike audio-driven talking faces, generated in real-time, all from a single static portrait photo and a speech audio clip.
How Good Is It?
Microsoft says that its premier model, VASA-1, is “capable of not only producing lip movements that are exquisitely synchronised with the audio, but also capturing a large spectrum of facial nuances and natural head motions that contribute to the perception of authenticity and liveliness.”
The “core innovations” of VASA-1 include “a holistic facial dynamics and head movement generation model that works in a face latent space, and the development of such an expressive and disentangled face latent space using videos”.
See some demos of VASA-1 in action here: https://www.microsoft.com/en-us/research/project/vasa-1/
Key Benefits
Microsoft says some of the key benefits of the VASA-1 model that set it apart are:
– Realism and liveliness. The model can produce convincing lip-audio synchronisation, and a large spectrum of expressive facial nuances and natural head motions. It can also handle arbitrary-length audio and stably output seamless talking face videos.
– Controllability of generation. Microsoft says its diffusion model accepts optional signals as conditions, such as main eye gaze direction and head distance, and emotion offsets.
– Out-of-distribution generalisation. In other words, the model can handle photo and audio inputs that weren’t present in its training set, e.g., artistic photos, singing audios, and non-English speech.
– Power of disentanglement. VASA-1’s latent representation disentangles appearance, 3D head pose, and facial dynamics, enabling separate attribute control and editing of the generated content.
– Real-time efficiency. Microsoft says VASA-1 generates video frames of 512×512 size at 45fps in the offline batch processing mode and can support up to 40fps in the online streaming mode with a preceding latency of only 170ms, evaluated on a desktop PC with a single NVIDIA RTX 4090 GPU.
Not Yet
However, Microsoft says it is holding back the release of VASA-1 pending the addressing of privacy and usage issues, stating that: “we have no plans to release an online demo, API, product, additional implementation details, or any related offerings until we are certain that the technology will be used responsibly and in accordance with proper regulations”.
What Does This Mean For Your Business?
Given what VASA-1 can do, you’d think Microsoft would be itching to get VASA-1 out there, monetised, and competing with the likes of Google’s Gemini family of models. However, as with Gemini and other generative AI, it may not be fully ready and may have some issues – as Gemini did when it received widespread criticism and had to be worked-on to correct ‘historical inaccuracies’ and woke outputs.
This is also, crucially, an important and busy electoral year globally with governments nervous, trying to introduce legislation and safeguards, and keeping a close eye on AI companies and their products’ potential to cause damaging deepfake and misinformation/disinformation and electoral interference issues, as well as their potential for use in cybercrime. As such, AI companies are queuing up to be seen to be acting as responsibly and ethically as possible, claiming to be holding back and testing every aspect of their products that could be misused – at the same time basically avoiding the eyes of governments and regulators, and potentially bad publicity and penalties.
As some have pointed out, however, it would be difficult for anyone to regulate who uses certain AI models for the right or wrong reasons and that some very sophisticated open source models can be made from source code found on GitHub by those who are determined. All that said, it shouldn’t be forgotten that VASA-1 appears to be very advanced and could offer many benefits and useful value-adding applications, e.g. for personalising emails and other business mass-communication. It remains to be seen how long Microsoft is prepared to wait before making VASA-1 generally available.