MAGVIT

MAGVIT: Masked Generative Video Transformer We introduce MAGVIT to tackle various video synthesis tasks with a single model, where we demonstrate its quality, efficiency, and flexibility.

(Unmute for narrations)

Multi-task Video Generation

Benchmark Results

(Click each to expand)

Multi-task Generation on Something-Something-v2

Frame Prediction on Kinetics-600

Frame Prediction on BAIR Robot Pushing

Class-conditional Generation on UCF-101

Skiing

Comparing Decoding Method on BAIR

Comparaing VQ Tokenizers on UCF-101

Inspirational Applications

(Click each to expand)

Panoramic Video

Multi-task Video Generation

A single MAGVIT model is trained to perform 10 different video generation tasks. All examples in the left column are produced by the same MAGVIT model trained only on the public Something-Something-v2 dataset.


Frame Prediction
Frame Interpolation
Central Outpainting
Vertical Outpainting
Horizontal Outpainting
Dynamic Outpainting
Central Inpainting
Dynamic Inpainting
Class-conditional Generation / Frame Prediction	Squeezing something
Class-conditional Frame Prediction / Frame Interpolation	Moving something down

Multi-task Generation on Something-Something-v2

A single MAGVIT model is trained to perform 10 different tasks on the Something-Something-v2 dataset.


Frame Prediction
Frame Interpolation
Central Outpainting
Vertical Outpainting
Horizontal Outpainting
Dynamic Outpainting
Central Inpainting
Dynamic Inpainting
Class-conditional Generation	Pretending to open something without actually opening it	Scooping something with something
Class-conditional Frame Prediction	Plugging something into something	Pulling something from left to right

Frame Prediction on Kinetics-600

MAGVIT achieves 39% improvement over state-of-the-art on the large-scale Kinetics-600 benchmark (FVD 16.5 9.9).


MAGVIT (ours)
MAGVIT (ours)
RaMViD (Höppe et al. 2022)	Condition information unavailable.	Condition information unavailable.	Condition information unavailable.
RaMViD (Höppe et al. 2022)	Condition information unavailable.	Condition information unavailable.	Condition information unavailable.

Frame Prediction on BAIR Robot Pushing

MAGVIT surpasses previous state-of-the-art FVD on the BAIR frame prediction benchmark by a large margin (84 62, 26%).


MAGVIT (ours)
MAGVIT (ours)
RaMViD (Höppe et al. 2022)
RaMViD (Höppe et al. 2022)

Class-conditional Generation on UCF-101

MAGVIT achieves best published FVD (76) and IS (89.3) scores on the UCF-101 benchmark.


MAGVIT (ours)
TATS (Ge et al. 2022)
CCVS+StyleGAN (Moing et al. 2021) (from TATS)

Comparing Decoding Method on BAIR

We compare different decoding methods to show the quality and efficiency of MAGVIT COMMIT decoding. We train a base transformer model with the same 3D-VQ tokenizer for each method.


Condition frames
Real videos
MAGVIT COMMIT decoding 12 steps (ours)
Autoregressive decoding 1024 steps
MaskGIT (Chang et al. 2022) MTM decoding 12 steps

Comparaing VQ Tokenizers on UCF-101

We compare different VQ tokenizers to demonstrate the superior reconstruction quality of MAGVIT 3D-VQ. These models are only trained on 9.5K training videos of the small UCF-101 dataset. See Perceptual Compression for large real-world examples of MAGVIT 3D-VQ.


Real
MAGVIT 3D-VQ (ours)
TATS (Ge et al. 2022) 3D-VQ
MaskGIT (Chang et al. 2022) 2D-VQ

Panoramic Video

Given a small vertical shot, MAGVIT can turn it into a panoramic video by applying video outpainting multiple times on both sides.

		Outpaint 10 times on each side


		Outpaint 5 times on each side

Image to Animation

Given a single image, MAGVIT can turn it into an animation by frame prediction, optionally with action conditions.

Stop Motion

Given two images, MAGVIT can turn it into a stop motion animation by frame interpolation.

Future Prediction

Given a single image, MAGVIT can turn it into an animation by frame prediction, optionally with action conditions.

Perceptual Compression

MAGVIT compresses a video by over 600x into a learned latent space.

We compare the original (top) and reconstructed (bottom) videos at 240p resolution.