ERC ALLEGRO workshop

Tuesday, 21st July

9:45	registration + coffee
10:00 - 11:00	Thomas Brox	Bored by Classification ConvNets? End-to-end Learning of other Computer Vision Tasks Abstract [slides]
		Convolutional Networks have suddenly become very popular in computer vision since they ticked off some major challenges of recent years: feature design, transfer learning, object classification. Will the conquest of ConvNets stop here? Most likely not. I will present our latest networks for three very different computer vision tasks: image generation, image segmentation, and optical flow estimation. All three networks can do surprising things although they have a disarmingly simple structure.
11:00 - 11:15	coffee
11:15 - 12:15	Jason Corso	Toward the Who and Where of Action Recognition Abstract [slides]
		Action recognition has been hotly studied in computer vision for more than two decades. Recent action recognition systems are adept at classifying web videos in a closed-world of action categories. But, next generation cognitive systems will require far more than action classification. Full action recognition requires not only classifying the action, but also localizing it and potentially even finely segmenting its boundaries. It requires not only focusing on human action but also the action of other agents in the environment, such as animals or vehicles. In this talk, I will describe our recent work in moving toward these more rigorous aspects of action recognition. Our work is the first effort in the computer vision community to jointly consider various types of actors undergoing various actions. We consider seven actor-types and eight action-types in three action understanding problems including single-label action classification, multi-label action classification and actor-action joint semantic segmentation. We propose graduated strata of models for this task and analyze the performance of each in all three tasks. The talk will thoroughly discuss these models, the results, and a new dataset that we released to support these more rigorous action understanding problems. This talk involves work appearing in both CVPR 2015 and new material.
12:15 - 14:00	lunch (for registered participants)
14:00 - 15:00	Marco Baroni	Grounding word representations in the visual world [slides]
15:00 - 15:30	coffee
15:30 - 16:30	Andrew Zisserman	Human Pose Estimation in Videos and Spatial Transformers [slides: part 1 part 2]

16:30 - 19:30	Kinovis demo & Posters (with wine + cheese)	Poster presentations
		Rejla Arandjelovic, PowerPCA: Dimensionality Reduction for Nearest Neighbour Search Guilhem Cheron, P-CNN: Pose-based CNN Features for Action Recognition Minsu Cho, Unsupervised Object Discovery and Localization in the Wild Bumsub Ham, Robust Image Filtering Using Joint Static and Dynamic Guidance Yang Hua, Online Object Tracking with Proposal Selection Vicky Kalogeiton, Analysing domain shift factors between videos and images for object detection Suha Kwak, Unsupervised Object Discovery and Tracking in Video Collections Diane Larlus, Fisher Vectors Meet Neural Networks: A Hybrid Classification Architecture Hongzhu Lin, A Universal Catalyst for First-Order Optimization Julien Mairal, Convolutional Kernel Networks Jerome Revaud, EpicFlow: Edge-Preserving Interpolation of Correspondences for Optical Flow Gregory Rogez, First-Person Pose Recognition Using Egocentric Workspaces Guillaume Seguin, Multi-instance video segmentation from object tracks Matthew Trager, Visual hulls and duality Tuan-Hung Vu, Context-aware CNNs for person detection Philippe Weinzaepfel, Learning to Detect Motion Boundaries

Wednesday, 22nd July

9:45	welcome + coffee
10:00 - 11:00	Cees Snoek	What objects tell about actions Abstract [slides]
		This talk is about automatic classification and localization of human actions in video. Whereas motion is the key ingredient in modern approaches, we assess the benefits of having objects in the video representation. Rather than considering a handful of carefully selected and localized objects, we conduct an empirical study on the benefit of encoding 15,000 object categories for action using 6 datasets totaling more than 200 hours of video and covering 180 action classes. Our key contributions are i) the first in-depth study of encoding objects for actions, ii) we show that objects matter for actions, and are often semantically relevant as well. iii) We establish that actions have object preferences. Rather than using all objects, selection is advantageous for action recognition. iv) We reveal that object-action relations are generic, which allows to transferring these relationships from the one domain to the other. And, v) objects, when combined with motion, improve the state-of-the-art for both action classification and localization. This is joint work with Mihir Jain and Jan van Gemert.
11:00 - 11:15	coffee
11:15 - 12:15	Patrick Perez	On learned visual embedding Abstract [slides]
		Once described with state-of-art techniques, images and image fragments are turned into fixed size, high-dimensional real-valued vectors that can be used in a number of ways. In particular, they can be compared or analyzed in a meaningful way. Yet, it is often beneficial to further encode these descriptors: Such a final encoding is learned to get speed and/or performance gains. We shall put such a generic mechanism at work for three distinct problems: image search by non-linear similarity; image search and classification based on Euclidean distance; face track verification. Corresponding encoding are respectively based on kernel PCA, exemplar SVM and latent metric learning.