This was my first research project on video generation, focusing on enhancing long-sequence video generation by integrating memory modules into traditional CNN models.
The task of predicting long sequences of high-quality video using simple, low cost models remains a challenge. Recent state-of-the-art methods such as SimVP have demonstrated significant improvements in generation quality using a block-to-block method with convolutional neural networks (CNNs), eliminating frequent frame-level processing. However, block-based prediction still struggles to predict long sequences. In this work, we present a new model ViP-LVM, and introduce three main ideas to improve on this front. First, we shift the conditioning of the generator from previously generated frames to using pre-generation representations, thereby minimizing encoding inaccuracies; this reduces the compounding errors caused by generating redundant frames. Second, we introduce STEAM blocks, a CNN based network block that contains a learnable memory. It enhances long-term information preservation, which counteracts information loss over a long time horizon. Third, we introduce a memory update mechanism that enables read and write operations to the global context, effective giving model control on what information to keep or discard. We show that these changes improve video prediction on moving-MNIST, KTH, CityScape, and BAIR.
