In this collaborative work with Marcin Marszałek, we address the problem of localizing human actions both in space (the 2D image region) and in time (the temporal range). As type of data, we employ real-world movies with crowded, dynamic environment, partial occlusion and cluttered background. As is well known from the results of the PASCAL Visual Object Classes challenges localization is a much more demanding task than classification.
To accomplish this task, we propose an approach which explicitly splits the action localization into two stages. In the first stage, humans are detected and tracked; this determines the spatial localization of the action. Given human tracks, we determine in a second stage if the action occurs and when (temporal localization) by using a sliding window classifier based on a novel spatio-temporal track-adapted 3D-HOG descriptor.
The advantage of such a method is that for each dataset, the same tracks support localization of different types of actions. This allows natural human actions to be effectively recognized in challenging environments. Our results are able to outperform the current state-of-the-art. They are illustrated in the plots and detection samples below.
The first plot is a precision-recall plot which compares our results to previously reported results for the actions drinking and smoking:
Here the top 12 drinking detections:
And the top drinking detections also as movie, note that we obtained these results completely automatically:
The top 12 smoking detections:
This precision-recall plot show results for answering phone and standing up on our Hollywood-Localization dataset:
The top 12 answering phone detections:
And the top detections as movie:
The top 12 standing up detections:
And the top detections as movie:
For more information, see also: