PhD thesis: Large-scale machine learning for video analysis
Supervisors:
Jakob Verbeek and Cordelia Schmid
Duration:
3 years, preferrably starting September 2011.
Topics:
statistical machine learning, computer vision
Keywords:
classification, ranking, local descriptors, compression
Expected skills:
strong knowledge in machine learning and/or computer vision, good skills
in programming in python and/or C, ability to make things work
Context:
Video interpretation and understanding is one of the long-term research goals in computer vision.
Realistic videos such as movies [LMSR08, MLS09, KMSZ10] present a variety of challenging machine learning
problems, such as action classification/action retrieval, human tracking, human/object interaction
classification, etc.
Recently robust visual descriptors for video classification have been developed, and have shown that it is possible to learn visual classifiers in realistic difficult settings [GMS09, WUK+09].
However, in order to deploy visual recognition systems on large-scale in practice it becomes important to address the scalability of the techniques.
Goals
The main goal is this thesis is to develop scalable methods for video content analysis (eg for ranking, or classification).
In order to address scalability, a variety of topics are of interest:
The number of training samples needs to be large in order to learn good models, therefor the visual descriptors must be relatively small in order to fit all video representations in memory when training the models (as disk access would be too slow).
The computation of visual representations of videos is costly, and is a bottleneck when large video collections have to be analysed.
Development of expressive, but efficiently computable, video representations is therefore important.
To encode spatial and temporal structure more efficient methods than rigid spatial-pyramid [LSP06a] like structures are needed.
If videos need to be classified over a large number of classes, it is desirable to have methods that do not evaluate a separate model for each class. Instead, models need to share parameters, and computation to more efficiently classify the videos.
When training models for many classes, it becomes inpractical to manually annotate training data for all classes.
Instead, it becomes important to develop methods that automatically harvest (noisy) training examples for video archives based on associated textual (eg such as subtitles in movies or broadcast TV).
All of these topics require the design of novel machine learning methods, and large-scale experimental evaluation (for which we have the required infrastructure).
Therefor, it is important that applicants have both a very good understanding of diverse machine learning techniques, as well as excellent programming skills.
Application:
Please send applications via email both to Jakob Verbeek and Cordelia Schmid (firstname.lastname@inria.fr), along with:
a complete resume
1 or 2 letters of recommendation, preferably by your master thesis supervisor and sent directly by him/her to us
topic and details about your master thesis (include pdf of thesis if possible)
if you already have research experience, please include a publication list and references
graduation marks
References:
[GMS09] Adrien Gaidon, Marcin Marszalek, and Cordelia Schmid. Mining visual actions from
movies. In BMVC, 2009.
[KMSZ10] A. Klaeser, M. Marszalek, C. Schmid, and A. Zisserman. Human
focused action localization in video. In International Workshop on Sign, Gesture, and
Activity (SGA) in Conjunction with ECCV, 2010.
[LMSR08] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld. Learning realistic human
actions from movies. In CVPR, 2008.
[LSP06a] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid
matching for recognizing natural scene categories. In Proc. CVPR, 2006.
[MLS09] M. Marszalek, I. Laptev, and C. Schmid. Actions in contexts. In CVPR, 2009.
[WUK+09] Heng Wang, Muhammad Muneeb Ullah, Alexander Klaeser, Ivan Laptev, and Cordelia
Schmid. Evaluation of local spatio-temporal features for action recognition. In British
Machine Vision Conference, 2009.