Current state-of-the art approaches to action recognition emphasize learning
ConvNets on large amounts of training data, using 3D convolutions to process
the temporal dimension. This approach is expensive in terms of memory usage and
constitutes a major performance bottleneck of existing approaches. Further,
video input data points typically include irrelevant information, along with
useful features, which limits the level of detail that networks can process,
regardless of the quality of the original video. Hence, models that can focus
computational resources on relevant training signal are desirable. To address
this problem, we rely on network-specific saliency outputs to drive an
attention model that provides tighter crops around relevant video regions. We
experimentally validate this approach and show how this strategy improves
performance for the action recognition task.