Will person detection help bag-of-features action recognition?

One limitations of the bag-of-features (BoF) approach is that it has no explicit notion of objects or actors due to its orderless representation. Consequently, this lack of explicit object knowledge prevents modeling of spatial layout information can help to increase performance. Furthermore, BoF provides a global video representation which is inherently sensitive to background clutter. On the other hand, human-centric (or holistic) approaches inherently model spatial layout information and are robust to background variations since they are based on human detections or tracks.

In this collaborative work with Marcin Marszałek and Ivan Laptev, we explore a method that combines a "loose" bag-of-features model with a human centric approach in order to benefit from the strength of both approaches. For this, we investigate how tracking of human actors can address the aforementioned deficiencies of the BoF representation and to which extent it can improve action recognition performance.

Our experiments show that action recognition can benefit from human localizations in videos. Quite surprisingly, it turns out that this gain is not due to suppressing background clutter. Only in the case of simple scenarios, background suppression helps to improve classification results. However, for realistic settings, removing background can lead to removal of valuable context. Therefore background suppression resulted in general in only minor recognition accuracy improvement. In the case of a few action classes (getting out of a car, kissing) we observed even a performance degradation. In contrast, we observed that narrowing down the attention to human actors allows to incorporate more layout information into the learned model. In general, this positively benefited recognition accuracy. However, for some action classes, we observed no or only minor improvement.

For more information, see also:

Our technical report
My PhD thesis (chapter 6)

by Alexander Kläser 2010