Anticipating future events is an important prerequisite towards intelligent behavior. Video forecasting has been studied as a proxy task towards this goal. Recent work has shown that to predict semantic segmentation of future frames, forecasting at the semantic level is more effective than forecasting RGB frames and then segmenting these. In this paper we consider the more challenging problem of future instance segmentation, which additionally segments out individual objects. To deal with a varying number of output labels per image, we develop a predictive model in the space of fixed-sized convolutional features of the Mask R-CNN instance segmentation model. We apply the "detection head" of Mask R-CNN on the predicted features to produce the instance segmentation of future frames. Experiments show that this approach significantly improves over baselines based on optical flow.
Here we show the videos corresponding to the figures in our paper. In each example, we display the mid term predictions (0.5 seconds ahead) for future instance segmentation of the following methods: the oracle (Mask RCNN), our Warp baseline, and our F2F method. We also show the future semantic segmentations predicted by of our re-implementation of the S2S method, as well as the conversion to semantic segmentation of the predictions of Warp and of F2F. We superimpose the predictions with the real corresponding RGB frames and frame them in red. We also precede each sequence with the two last input images, superimposed with the oracle predictions for that task.
We provide additional examples here.
You can also find our models' predictions here. (Code coming soon !)
Here we explore the limits of our F2F model by predicting 10 frames, i.e. just under 1.8 seconds. For each example, we show the instance segmentation prediction. In these examples, we have implemented a simple tracking algorithm yielding to yield temporally consistent predictions.