ERC ALLEGRO workshop


Venue: Grand Amphithéâtre, Inria Grenoble - Rhône-Alpes (Montbonnot/Inovallée site: Directions)

Wednesday, 23rd July

9:45registration + coffee
10:00 - 11:00   Deva Ramanan, UC Irvine Hands, Objects, and Videotape: Recognizing object interactions from streaming wearable cameras Abstract [slides]
This talk will look at various technical issues motivated by the application of recognizing actions from wearable cameras. Such an application raises many technical challenges, including processing of (possibly) infinitely-long temporal streams and the characterization of hand-object manipulations which typically requires detailed 3D understanding. The first part of the talk will consider the task of general action recognition from streaming video footage, casting this problem as one of online parsing of temporal grammars. The second part of the talk will consider the task of detailed 3D object reconstruction, with a focus on manipulatable objects. The talk will conclude with in-progress work that combines both approaches into a system for wearable action recognition. Notably, this last part will also explore the use of wearable depth cameras as a (somewhat) novel sensor for extracting near-field 3D geometry.
11:00 - 11:15   coffee
11:15 - 12:00   Armand Joulin, Stanford University Efficient weakly supervised learning methods in large video collections Abstract [slides]
Natural language descriptions of videos provide a potentially rich and vast source of supervision. However, the highly-varied nature of language presents a major barrier to its effective use. What is needed are models that can reason over uncertainty over both videos and text. In this paper, we tackle the core task of person naming: assigning names of people in the cast to human tracks in TV videos. Screenplay scripts accompanying the video provide some crude supervision about who's in the video. However, even the basic problem of knowing who is mentioned in the script is often difficult, since language often refers to people using pronouns (e.g., "he") and nominals (e.g., "man") rather than actual names (e.g., "Susan"). Resolving the identity of these mentions is the task of coreference resolution, which is an active area of research in natural language processing. We develop a joint model for person naming and coreference resolution, and in the process, infer a latent alignment between tracks and mentions. We evaluate our model on both vision and NLP tasks on a new dataset of 19 TV episodes. On both tasks, we significantly outperform the independent baselines. In the second part of the talk, we will briefly discuss methods for co-localization of objects in large datasets.
12:00 - 12:45   Ivan Laptev, InriaWeakly-supervised learning from videos and scripts [slides]
12:45 - 14:15   lunch
14:15 - 15:15   Thomas Brox, University of Freiburg Unsupervised Feature Learning and Benchmarking Video Segmentation Abstract [slides]
The talk will consist of two independent parts. The first is about unsupervised feature learning by convolutional neural networks. I will present our latest, still unpublished concept to learn invariant features and show that it clearly outperforms all previous unsupervised feature learning techniques. There will be another surprising result for those who have not seen this talk or the underlying arXiv papers yet. In the second part, I will present two new benchmark datasets that we recently published. One dataset is on motion segmentation and extends the earlier Berkeley Motion Segmentation benchmark. We also improved the evaluation metric. The second benchmark is on general video segmentation, that is, objects do not necessarily move. I will present the metric and show a comparison of current video segmentation methods. This also includes a straightforward method that surprisingly outperforms previous video segmentation approaches on this benchmark.
15:15 - 15:30   coffee
15:30 - 16:15   Karteek Alahari, Inria Occlusion, motion reasoning for tracking and Human Pose Estimation in Videos Abstract [slides]
Video provides not only rich visual cues such as motion and appearance, but also much less explored long-range temporal interactions among objects. The first part of the talk will present a method to capture such interactions and to construct an intermediate-level video representation. We also use them for tracking objects, and develop a tracking-by-detection approach that exploits occlusion and motion reasoning. This reasoning is based on long-term trajectories, which are labelled as object or background tracks with an energy-based formulation. In the second part of the talk we show the use of temporal constraints for estimating articulated human poses, which is cast as an optimization problem. We present a new approximate scheme to solve it, with two steps dedicated to pose estimation. First, our approach takes into account temporal links with subsequent frames for the less-certain parts, namely elbows and wrists. Second, our method decomposes poses into limbs, generates limb sequences across time, and recomposes poses by mixing these body part sequences.
16:15 - 16:30   break
16:30 - 17:30   Florent Perronnin, XRCE Output Embedding for Large-Scale Visual Recognition Abstract [slides]
The focus of the computer vision community has long been on input embedding: how to transform an image into a suitable descriptor which can be subsequently used as input to simple classifiers such as linear SVMs? In this talk, we will consider the problem of output embedding: how to embed classes in a Euclidean space? We will show that such an embedding is a must for large-scale visual recognition as it enables parameter sharing: this yields classifiers which are more accurate when training data is scarce (including zero-shot recognition) and which are faster to train and evaluate. We will provide a taxonomy of output embeddings: data-independent embeddings, embeddings based on a priori information or learned embeddings. We will also explain how to measure the compatibility between input embeddings and output embeddings.
17:30 - 19:00   Posters (with wine + cheese)Poster presentations
Anoop Cherian, Mixing Body-part Sequences for Human Pose Estimation
Minsu Cho, Finding Matches in a Haystack: A Max-Pooling Strategy for Graph Matching in the Presence of Outliers
R. Gokberk Cinbis, Multi-fold MIL Training for Weakly Supervised Object Localization
Yang Hua, Occlusion and motion reasoning for long-term tracking
Naila Murray, Generalized Max Pooling
Dan Oneata, Efficient action localization with approximately normalized Fisher vectors
Mattis Paulin, Transformation pursuit for image classification
Danila Potapov, Category-specific video summarization
Eleonora Vig, Large-Scale Optimization of Hierarchical Features for Saliency Prediction in Natural Images
Tuan-Hung Vu, Predicting Actions from Static Scenes
Philippe Weinzaepfel, DeepFlow: Large displacement optical flow with deep matching

Thursday, 24th July

9:45welcome + coffee
10:00 - 11:00   Martial Hebert, CMULearning from 3D Data for Image Interpretation [slides]
11:00 - 11:15   coffee
11:15 - 11:45   Maxime Oquab, Inria Weakly Supervised Object Recognition with Convolutional Neural Networks Abstract [slides]
Successful visual object recognition methods typically rely on training datasets containing lots of richly annotated images. Annotating object bounding boxes is both expensive and subjective. We describe a weakly supervised convolutional neural network (CNN) for object recognition that does not rely on detailed object annotation and yet returns 86.3% mAP on the Pascal VOC classification task, outperforming previous fully supervised systems by a sizeable margin. Despite the lack of bounding box supervision, the network produces maps that clearly localize the objects in cluttered scenes. We also show that adding fully supervised object examples to our weakly-supervised setup does not increase the classification performance.
11:45 - 12:45   John Canny, UC Berkeley Interactive Machine Learning and a new INRIA Project Abstract
Machine learning is now an essential tool in business and the sciences. Much of the value from Big Data is untapped, and will require interactive tools that support rapid exploration and hypothesis-testing. The current wave of analytic tools for Big Data rely primarily on cluster computing for acceleration. The BID Data project focuses on single-node performance and has developed new, hardware-accelerated learning tools (BIDMach) with much higher efficiency and real-time performance. Single machine speeds for BIDMach on many common learning tasks exceed that of any reported cluster system. Furthermore, we show that these gains can be scaled up on clusters of machines using new, faster communication primitives. Interactive analysis of massive datasets is now possible, and has many advantages. The first part of the talk with overview BIDMach, and describe some of the key technologies. The second part of the talk will describe a new INRIA project: language learning for young children by interacting with vision- and speech-enabled agents. This is a work-in-progress that will involve co-recognition of visual and linguistic structures, and should fit well with the current wave of DNN approaches in both domains.