Learning to predict the long-term future of video frames is notoriously challenging due to the inherent ambiguities in a distant future and dramatic amplification of prediction error over time. Despite the recent advances in the literature, existing approaches are limited to moderately short-term prediction (less than a few seconds), while extrapolating it to a longer future quickly leads to destruction in structure and content. In this work, we revisit the hierarchical models in video prediction. Our method generates future frames by first estimating a sequence of dense semantic structures and subsequently translating the estimated structures to pixels by video-to-video translation model. Despite the simplicity, we show that modeling structures and their dynamics in categorical structure space with stochastic sequential estimator leads to surprisingly successful long-term prediction. We evaluate our method on two challenging video prediction scenarios, car driving and human dancing, and demonstrate that it can generate complicated scene structures and motions over a very long time horizon (i.e. thousands frames), setting a new standard of video prediction with orders of magnitude longer prediction time than existing approaches.
We collected Web videos of a single person covering various dance moves (link). All models are trained to predict 40 future frames given 5 context frames, where each frame is subsampled with the interval of 2. All predictions shown below are produced conditioned on 5 ground-truth context frames.
Given the predicted masks at the first column, we can synthesize dancing sequences with diverse human appearances as shown in the rest of the columns. For this, we train Vid2Vids for each appearance and use them for the pixel-level translations.
All models are trained to predict 15 future frames given 5 context frames, where each frame is subsampled with the interval of 2. All predictions shown below are produced conditioned on 5 ground-truth context frames.
All model is trained to predict 3 future frames given 4 context frames, where each frame is subsampled with the interval of 3. All predictions are produced conditioned on 4 ground-truth context frames.
In the Cityscapes dataset, we observe that some structures distant from the dashboard camera tend to be static. It is because the Cityscapes datasets are composed of very short training sequences (less than 30 frames), making the model unable to learn global scene transition. Please refer to our paper for a more detailed discussion.