Generative AI models have come an extremely long way in recent years. Their capacity has increased considerably with the advancement of broadcast models. In particular, the text-to-image generation with scatter patterns produced truly impressive results.
It didn’t end there, however. We’ve seen AI models that could successfully generate text to X. From style transfer to 3D object generation, these broadcast-based models have outperformed previous approaches when it comes to generating text. semantically correct and visually pleasing results.
Probably the most appealing generation reached has been the text-to-video models. The idea of seeing how “an astronaut on horseback with a koala by his side on the Moonwould watch without spending hours in CGI was obviously very interesting to people. However, despite the very few successful attempts, text generation in video is still an under-explored task.
Discover Hailo-8™: an artificial intelligence processor that uses computer vision for multi-camera and multi-person re-identification (sponsored)
The task of generating text in video is extremely difficult in nature. It is particularly difficult to achieve using diffusion models for several reasons. First, building a large-scale data set of video-text description pairs is much more difficult than collecting image-text pairs. It is not easy to describe video content in a single sentence. Also, there could be several scenes in the video where most of the frames would not give useful information.
Moreover, the video itself is a delicate source of information. It includes complex visual dynamics, which are much more difficult to learn than images. When you add the timing information between different frames on top of that, it becomes really difficult to model the video content.
Finally, a typical video contains around 30 frames per second, so there will be hundreds, if not thousands, of frames in a single video clip. Therefore, processing long videos requires a huge amount of computational resources.
These limitations have forced recent broadcast-based models to output low-resolution video and then apply super-resolution to improve visual quality. However, even this trick is not enough to reduce the enormous computational complexity.
So what is the solution? How can we transfer the success of the image generation models to the video generation task? Can we develop a delivery model capable of generating high-quality, time-consistent videos? The answer is yes, and it has a name: MagicVideo.
MagicVideo generates videos in latent space using a pre-trained variational autoencoder. This trick enables an extremely reduced computational requirement for MagicVideo. Additionally, he has even more innings on his turn to solve the problems mentioned above.
MagicVideo uses 2D convolution instead of 3D convolutions to overcome having a paired video-text dataset. Temporal arithmetic operators are used with 2D convolution operations to process both spatial and temporal information in video. Additionally, the use of 2D convolutions allows MagicVideo to use pre-trained weights of text-image models.
Even though switching from 3D convolution to 2D convolution greatly reduces the computational complexity, the memory cost is still too high. Thus, MagicVideo shares the same weights for each 2D convolution operation. However, this will reduce the quality of the build because this approach assumes that all frames are nearly identical, even though in reality the time difference is there. To overcome this problem, MagicVideo uses a custom lightweight adapter module to adjust the frame distribution of each frame.
MagicVideo learns the inter-image relationship using a directed self-attention module. Frames are calculated based on previous ones, similar to the approach used in video encoding. Finally, the video clips produced are enhanced with a post-processing module.
MagicVideo is another step towards reliable video generation. It manages to transfer the success of image generation models to the video domain. MagicVideo generates videos in latent space to deal with computational complexity.
This was a brief summary of MagicVideo. You can find information on the links below if you want to know more about it.
Check Paper and Project. All credit for this research goes to the researchers on this project. Also don’t forget to register. our Reddit page and discord channelwhere we share the latest AI research news, cool AI projects, and more.
Ekrem Çetinkaya obtained his B.Sc. in 2018 and M.Sc. in 2019 from Ozyegin University, Istanbul, Türkiye. He wrote his M.Sc. thesis on image denoising using deep convolutional networks. He is currently pursuing a doctorate. degree at the University of Klagenfurt, Austria, and working as a researcher on the ATHENA project. His research interests include deep learning, computer vision and multimedia networks.
#ByteDance #researchers #present #MagicVideo #efficient #framework #generating #text #video #based #latent #diffusion #models