9:45 | registration + coffee | |
10:00 - 11:00 | Deva Ramanan, UC Irvine | Hands, Objects, and Videotape: Recognizing object interactions from streaming wearable cameras Abstract [slides] |
This talk will look at various technical issues motivated by the application of recognizing actions from wearable cameras. Such an application raises many technical challenges, including processing of (possibly) infinitely-long temporal streams and the characterization of hand-object manipulations which typically requires detailed 3D understanding. The first part of the talk will consider the task of general action recognition from streaming video footage, casting this problem as one of online parsing of temporal grammars. The second part of the talk will consider the task of detailed 3D object reconstruction, with a focus on manipulatable objects. The talk will conclude with in-progress work that combines both approaches into a system for wearable action recognition. Notably, this last part will also explore the use of wearable depth cameras as a (somewhat) novel sensor for extracting near-field 3D geometry.
| ||
11:00 - 11:15 | coffee | |
11:15 - 12:00 | Armand Joulin, Stanford University | Efficient weakly supervised learning methods in large video collections Abstract [slides] |
Natural language descriptions of videos provide a potentially rich and vast source of supervision. However, the highly-varied nature of language presents a major barrier to its effective use. What is needed are models that can reason over uncertainty over both videos and text. In this paper, we tackle the core task of person naming: assigning names of people in the cast to human tracks in TV videos. Screenplay scripts accompanying the video provide some crude supervision about who's in the video. However, even the basic problem of knowing who is mentioned in the script is often difficult, since language often refers to people using pronouns (e.g., "he") and nominals (e.g., "man") rather than actual names (e.g., "Susan"). Resolving the identity of these mentions is the task of coreference resolution, which is an active area of research in natural language processing. We develop a joint model for person naming and coreference resolution, and in the process, infer a latent alignment between tracks and mentions. We evaluate our model on both vision and NLP tasks on a new dataset of 19 TV episodes. On both tasks, we significantly outperform the independent baselines.
In the second part of the talk, we will briefly discuss methods for co-localization of objects in large datasets.
| ||
12:00 - 12:45 | Ivan Laptev, Inria | Weakly-supervised learning from videos and scripts [slides] |
12:45 - 14:15 | lunch | |
14:15 - 15:15 | Thomas Brox, University of Freiburg | Unsupervised Feature Learning and Benchmarking Video Segmentation Abstract [slides] |
The talk will consist of two independent parts. The first is about unsupervised feature learning by convolutional neural networks. I will present our latest, still unpublished concept to learn invariant features and show that it clearly outperforms all previous unsupervised feature learning techniques. There will be another surprising result for those who have not seen this talk or the underlying arXiv papers yet.
In the second part, I will present two new benchmark datasets that we recently published. One dataset is on motion segmentation and extends the earlier Berkeley Motion Segmentation benchmark. We also improved the evaluation metric. The second benchmark is on general video segmentation, that is, objects do not necessarily move. I will present the metric and show a comparison of current video segmentation methods. This also includes a straightforward method that surprisingly outperforms previous video segmentation approaches on this benchmark.
| ||
15:15 - 15:30 | coffee | |
15:30 - 16:15 | Karteek Alahari, Inria | Occlusion, motion reasoning for tracking and Human Pose Estimation in Videos Abstract [slides] |
Video provides not only rich visual cues such as motion and appearance, but also much less explored long-range temporal interactions among objects. The first part of the talk will present a method to capture such interactions and to construct an intermediate-level video representation. We also use them for tracking objects, and develop a tracking-by-detection approach that exploits occlusion and motion reasoning. This reasoning is based on long-term trajectories, which are labelled as object or background tracks with an energy-based formulation.
In the second part of the talk we show the use of temporal constraints for estimating articulated human poses, which is cast as an optimization problem. We present a new approximate scheme to solve it, with two steps dedicated to pose estimation. First, our approach takes into account temporal links with subsequent frames for the less-certain parts, namely elbows and wrists. Second, our method decomposes poses into limbs, generates limb sequences across time, and recomposes poses by mixing these body part sequences.
| ||
16:15 - 16:30 | break | |
16:30 - 17:30 | Florent Perronnin, XRCE | Output Embedding for Large-Scale Visual Recognition Abstract [slides] |
The focus of the computer vision community has long been on input embedding: how to transform an image into a suitable descriptor which can be subsequently used as input to simple classifiers such as linear SVMs? In this talk, we will consider the problem of output embedding: how to embed classes in a Euclidean space? We will show that such an embedding is a must for large-scale visual recognition as it enables parameter sharing: this yields classifiers which are more accurate when training data is scarce (including zero-shot recognition) and which are faster to train and evaluate. We will provide a taxonomy of output embeddings: data-independent embeddings, embeddings based on a priori information or learned embeddings. We will also explain how to measure the compatibility between input embeddings and output embeddings.
| ||
17:30 - 19:00 | Posters (with wine + cheese) | Poster presentations |
Anoop Cherian, Mixing Body-part Sequences for Human Pose Estimation
Minsu Cho, Finding Matches in a Haystack: A Max-Pooling Strategy for Graph Matching in the Presence of Outliers R. Gokberk Cinbis, Multi-fold MIL Training for Weakly Supervised Object Localization Yang Hua, Occlusion and motion reasoning for long-term tracking Naila Murray, Generalized Max Pooling Dan Oneata, Efficient action localization with approximately normalized Fisher vectors Mattis Paulin, Transformation pursuit for image classification Danila Potapov, Category-specific video summarization Eleonora Vig, Large-Scale Optimization of Hierarchical Features for Saliency Prediction in Natural Images Tuan-Hung Vu, Predicting Actions from Static Scenes Philippe Weinzaepfel, DeepFlow: Large displacement optical flow with deep matching |
9:45 | welcome + coffee | |
10:00 - 11:00 | Martial Hebert, CMU | Learning from 3D Data for Image Interpretation [slides] |
11:00 - 11:15 | coffee | |
11:15 - 11:45 | Maxime Oquab, Inria | Weakly Supervised Object Recognition with Convolutional Neural Networks Abstract [slides] |
Successful visual object recognition methods typically rely on training datasets containing lots of richly annotated images. Annotating object bounding boxes is both expensive and subjective. We describe a weakly supervised convolutional neural network (CNN) for object recognition that does not rely on detailed object annotation and yet returns 86.3% mAP on the Pascal VOC classification task, outperforming previous fully supervised systems by a sizeable margin. Despite the lack of bounding box supervision, the network produces maps that clearly localize the objects in cluttered scenes. We also show that adding fully supervised object examples to our weakly-supervised setup does not increase the classification performance.
| ||
11:45 - 12:45 | John Canny, UC Berkeley | Interactive Machine Learning and a new INRIA Project Abstract |
Machine learning is now an essential tool in business and the sciences. Much of the value from Big Data is untapped, and will require interactive tools that support rapid exploration and hypothesis-testing. The current wave of analytic tools for Big Data rely primarily on cluster computing for acceleration. The BID Data project focuses on single-node performance and has developed new, hardware-accelerated learning tools (BIDMach) with much higher efficiency and real-time performance. Single machine speeds for BIDMach on many common learning tasks exceed that of any reported cluster system. Furthermore, we show that these gains can be scaled up on clusters of machines using new, faster communication primitives. Interactive analysis of massive datasets is now possible, and has many advantages. The first part of the talk with overview BIDMach, and describe some of the key technologies. The second part of the talk will describe a new INRIA project: language learning for young children by interacting with vision- and speech-enabled agents. This is a work-in-progress that will involve co-recognition of visual and linguistic structures, and should fit well with the current wave of DNN approaches in both domains.
|