Meta’s new text-to-video AI generator is like DALL-E for video

AI text-to-image generators have been making headlines in recent months, but researchers are already moving on to the next frontier: AI text-to-video generators.

A team of machine learning engineers from Facebook’s parent company Meta has unveiled a new system called Make-A-Video. As the name suggests, this AI model allows users to type in a rough description of a scene, and it will generate a short video matching their text. The videos are clearly artificial, with blurred subjects and distorted animation, but still represent a significant development in the field of AI content generation.

The model’s output is clearly artificial but still impressive

“Generative AI research is pushing creative expression forward by giving people tools to quickly and easily create new content,” said Meta in a blog post announcing the work. “With just a few words or lines of text, Make-A-Video can bring imagination to life and create one-of-a-kind videos full of vivid colors and landscapes.”

In a Facebook post, Meta CEO Mark Zuckerberg described the work as “amazing progress,” adding: “It’s much harder to generate video than photos because beyond correctly generating each pixel, the system also has to predict how they’ll change over time.”

The clips are no longer than five seconds and contain no audio but span a huge range of prompts. The best way to judge the model’s performance is to watch its output. Each of the videos below was generated by Make-A-Video and captioned with the prompt used to generate the video. However, it’s also worth noting that each video was provided to The Verge by Meta, which is not currently allowing anyone access to the model. That means the clips could have been cherry-picked to show the system in its best light.

Again, while it’s clear these videos are computer-generated, the output of such AI models will improve rapidly in the near future. As a comparison, in just the space of a few years, AI image generators have gone from creating borderline incomprehensible pictures to photorealistic content. And though progress in video could be slower given the near-limitless complexity of the subject matter, the prize of seamless video generation will motivate many institutions and companies to pour great resources into the project.

As with text-to-image models, there is potential for harmful applications

In Meta’s blog post announcing Make-a-Video, the company notes that video generation tools could be invaluable “for creators and artists.” But, as with text-to-image models, there are worrying prospects, too. The output of these tools could be used for misinformation, propaganda, and — more likely, based on what we’ve seen with AI image systems and deepfakes — generating nonconsensual pornography that can be used to harass and intimidate women.

Meta says it wants to be “thoughtful about how we build new generative AI systems like this” and is only currently publishing a paper on the Make-A-Video model. The company says it plans to release a demo of the system but does not say when or how access to the model might be limited.

It’s also worth noting that Meta is not the only institution working on AI video generators. Earlier this year, for example, a group of researchers from Tsinghua University and the Beijing Academy of Artificial Intelligence (BAAI) released their own text-to-video model, named CogVideo (the only other publicly available text-to-video model). You can watch sample output from CogVideo here, which is limited in much the same way as Meta’s work.

In a paper describing the model, Meta’s researchers note that Make-A-Video is training on pairs of images and captions as well as unlabeled video footage. Training content was sourced from two datasets (WebVid-10M and HD-VILA-100M), which together, contain millions of videos spanning hundreds of thousands of hours of footage. This includes stock video footage created by sites like Shutterstock and scraped from the web.

The researchers note in the paper that the model has many technical limitations beyond blurry footage and disjointed animation. For example, their training methods are unable to learn information that might only be inferred by a human watching a video — e.g., whether a video of a waving hand is going left to right or right to left. Other problems include generating videos longer than five seconds, videos with multiple scenes and events, and higher resolution. Make-A-Video currently outputs 16 frames of video at a resolution of 64 by 64 pixels, which are then boosted in size using a separate AI model to 768 by 768.

Meta’s team also notes that, like all AI models trained on data scraped from the web, Make-A-Video has “learnt and likely exaggerated social biases, including harmful ones.” In text-to-image models, these biases often reinforce social prejudices. For example, ask a model to generate an image of a “terrorist,” and it will likely depict someone wearing a turban. However, it’s impossible to say what biases Meta’s model has learned without open access.

Meta says it is “openly sharing this generative AI research and results with the community for their feedback, and will continue to use our responsible AI framework to refine and evolve our approach to this emerging technology.”