Predict the future semantci masks of moving objects.
Problem Context
Video prediction is considered to be a more challenging task than language prediction. Given the previous frames of a video, the challenge lies in how a program can predict what will happen next. The difficulty further increases when combining prediction with semantic segmentation, as the system needs to both predict the future and segment semantic objects from the future scene. In this project, we were provided with 22 consecutive frames of images, and the task was to predict the semantic masks for all 22 frames using only the first 11 frames. During the course, we attempted to use two separate models: UNnet for semantic segmentation and SimVP for video prediction. This project is currently ongoing, and we are exploring the potential of using Vector Quantized Variational AutoEncoder (VQVAE) for video representation pre-training and semantic prediction fine-tuning.
Implementation
The training scheme of the initial attempt is shown below. First, we used a UNet for semantic segmentation training with cross-entropy loss. Second, SimVP, which is a structure consisting of a spatial encoder, temporal prediction, and spatial decoder, was employed to predict the pixel-level future. Finally, we combined the two models mentioned above and fine-tuned the spatial decoder with semantic segmentation while freezing the first two parts.