|9:45||registration + coffee|
|10:00 - 11:00||Thomas Brox||Bored by Classification ConvNets? End-to-end Learning of other Computer Vision Tasks Abstract [slides]|
Convolutional Networks have suddenly become very popular in computer vision since they ticked off some major challenges of recent years: feature design, transfer learning, object classification. Will the conquest of ConvNets stop here? Most likely not. I will present our latest networks for three very different computer vision tasks: image generation, image segmentation, and optical flow estimation. All three networks can do surprising things although they have a disarmingly simple structure.
|11:00 - 11:15||coffee|
|11:15 - 12:15||Jason Corso||Toward the Who and Where of Action Recognition Abstract [slides]|
Action recognition has been hotly studied in computer vision for more than two decades. Recent action recognition systems are adept at classifying web videos in a closed-world of action categories. But, next generation cognitive systems will require far more than action classification. Full action recognition requires not only classifying the action, but also localizing it and potentially even finely segmenting its boundaries. It requires not only focusing on human action but also the action of other agents in the environment, such as animals or vehicles. In this talk, I will describe our recent work in moving toward these more rigorous aspects of action recognition. Our work is the first effort in the computer vision community to jointly consider various types of actors undergoing various actions. We consider seven actor-types and eight action-types in three action understanding problems including single-label action classification, multi-label action classification and actor-action joint semantic segmentation. We propose graduated strata of models for this task and analyze the performance of each in all three tasks. The talk will thoroughly discuss these models, the results, and a new dataset that we released to support these more rigorous action understanding problems. This talk involves work appearing in both CVPR 2015 and new material.
|12:15 - 14:00||lunch (for registered participants)|
|14:00 - 15:00||Marco Baroni||Grounding word representations in the visual world [slides]|
|15:00 - 15:30||coffee|
|15:30 - 16:30||Andrew Zisserman||Human Pose Estimation in Videos and Spatial Transformers [slides: part 1 part 2]|
|16:30 - 19:30||Kinovis demo & Posters |
(with wine + cheese)
Rejla Arandjelovic, PowerPCA: Dimensionality Reduction for Nearest Neighbour Search
Guilhem Cheron, P-CNN: Pose-based CNN Features for Action Recognition
Minsu Cho, Unsupervised Object Discovery and Localization in the Wild
Bumsub Ham, Robust Image Filtering Using Joint Static and Dynamic Guidance
Yang Hua, Online Object Tracking with Proposal Selection
Vicky Kalogeiton, Analysing domain shift factors between videos and images for object detection
Suha Kwak, Unsupervised Object Discovery and Tracking in Video Collections
Diane Larlus, Fisher Vectors Meet Neural Networks: A Hybrid Classification Architecture
Hongzhu Lin, A Universal Catalyst for First-Order Optimization
Julien Mairal, Convolutional Kernel Networks
Jerome Revaud, EpicFlow: Edge-Preserving Interpolation of Correspondences for Optical Flow
Gregory Rogez, First-Person Pose Recognition Using Egocentric Workspaces
Guillaume Seguin, Multi-instance video segmentation from object tracks
Matthew Trager, Visual hulls and duality
Tuan-Hung Vu, Context-aware CNNs for person detection
Philippe Weinzaepfel, Learning to Detect Motion Boundaries
|9:45||welcome + coffee|
|10:00 - 11:00||Cees Snoek||What objects tell about actions Abstract [slides]|
This talk is about automatic classification and localization of human actions in video. Whereas motion is the key ingredient in modern approaches, we assess the benefits of having objects in the video representation. Rather than considering a handful of carefully selected and localized objects, we conduct an empirical study on the benefit of encoding 15,000 object categories for action using 6 datasets totaling more than 200 hours of video and covering 180 action classes. Our key contributions are i) the first in-depth study of encoding objects for actions, ii) we show that objects matter for actions, and are often semantically relevant as well. iii) We establish that actions have object preferences. Rather than using all objects, selection is advantageous for action recognition. iv) We reveal that object-action relations are generic, which allows to transferring these relationships from the one domain to the other. And, v) objects, when combined with motion, improve the state-of-the-art for both action classification and localization.
This is joint work with Mihir Jain and Jan van Gemert.
|11:00 - 11:15||coffee|
|11:15 - 12:15||Patrick Perez||On learned visual embedding Abstract [slides]|
Once described with state-of-art techniques, images and image fragments are turned into fixed size, high-dimensional real-valued vectors that can be used in a number of ways. In particular, they can be compared or analyzed in a meaningful way. Yet, it is often beneficial to further encode these descriptors: Such a final encoding is learned to get speed and/or performance gains. We shall put such a generic mechanism at work for three distinct problems: image search by non-linear similarity; image search and classification based on Euclidean distance; face track verification. Corresponding encoding are respectively based on kernel PCA, exemplar SVM and latent metric learning.