Predicting Future Instance Segmentations by Forecasting Convolutional Features

Pauline Luc, Camille Couprie, Yann LeCun and Jakob Verbeek

Anticipating future events is an important prerequisite towards intelligent behavior. Video forecasting has been studied as a proxy task towards this goal. Recent work has shown that to predict semantic segmentation of future frames, forecasting at the semantic level is more effective than forecasting RGB frames and then segmenting these. In this paper we consider the more challenging problem of future instance segmentation, which additionally segments out individual objects. To deal with a varying number of output labels per image, we develop a predictive model in the space of fixed-sized convolutional features of the Mask R-CNN instance segmentation model. We apply the "detection head" of Mask R-CNN on the predicted features to produce the instance segmentation of future frames. Experiments show that this approach significantly improves over baselines based on optical flow.

Here we show the videos corresponding to the figures in our paper. In each example, we display the mid term predictions (0.5 seconds ahead) for future instance segmentation of the following methods: the oracle (Mask RCNN), our Warp baseline, and our F2F method. We also show the future semantic segmentations predicted by of our re-implementation of the S2S method, as well as the conversion to semantic segmentation of the predictions of Warp and of F2F. We superimpose the predictions with the real corresponding RGB frames and frame them in red. We also precede each sequence with the two last input images, superimposed with the oracle predictions for that task.

We provide additional examples here.

You can also find our models' predictions here. (Code coming soon !)

Figures 1 & 4 - Qualitative comparison

Oracle

Warp

F2F

S2S

Warp

F2F



Oracle

Warp

F2F

S2S

Warp

F2F



Oracle

Warp

F2F

S2S

Warp

F2F



Oracle

Warp

F2F

S2S

Warp

F2F



Oracle

Warp

F2F

S2S

Warp

F2F



Figure 5 - Incorrect instance masks can lead to acceptable semantic segmentation

Oracle

Warp

F2F

S2S

Warp

F2F



Oracle

Warp

F2F

S2S

Warp

F2F



Figure 6 - Failure modes of F2F

Oracle

Warp

F2F

S2S

Warp

F2F



Oracle

Warp

F2F



S2S

Warp

F2F



Oracle

Warp

F2F

S2S

Warp

F2F



Figure 7 - Longer predictions (1.5 seconds) from our F2F model

Here we explore the limits of our F2F model by predicting 10 frames, i.e. just under 1.8 seconds. For each example, we show the instance segmentation prediction. In these examples, we have implemented a simple tracking algorithm yielding to yield temporally consistent predictions.