This paper addresses the task of group activity recognition in multi-person
videos. Existing approaches decompose this task into feature learning and
relational reasoning. Despite showing progress, these methods only rely on
appearance features for people and overlook the available contextual
information, which can play an important role in group activity understanding.
In this work, we focus on the feature learning aspect and propose a two-stream
architecture that not only considers person-level appearance features, but also
makes use of contextual information present in videos for group activity
recognition. In particular, we propose to use two types of contextual
information beneficial for two different scenarios: pose context and scene
context that provide crucial cues for group activity understanding. We combine
appearance and contextual features to encode each person with an enriched
representation. Finally, these combined features are used in relational
reasoning for predicting group activities. We evaluate our method on two
benchmarks, Volleyball and Collective Activity and show that joint modeling of
contextual information with appearance features benefits in group activity
understanding.