Master internship on learning from video and text (jointly with the AMA team at LIG Grenoble)

The LEAR research group at INRIA Grenoble and the AMA team at LIG Grenoble are looking for a Master student. The candidate will be jointly supervised and will spend time in both institutions (http://lear.inrialpes.fr/; http://mrim.imag.fr/eric.gaussier/).

Topic: The goal of this project is to better understand videos by exploiting associated textual data. For still images some success has been achieved in learning correspondences between objects and textual keywords using techniques from statistical machine translation [1]. This research direction has successfully been extended to learning correspondence between face detections in images and names extracted from the surrounding text on web-pages [2]. In video, it has been demonstrated that transcripts aligned with the video can be a very useful source of weak supervision for learning the appearance of characters [3] and human actions [4]. This research direction has been extended to allow for imprecise alignment [5]. Existing work does not attempt to use text as a form of supervision for learning spatio-temporal constraints between scenes, humans, objects and their interactions in video. In addition, the text is typically considered as a supervisory signal for visual learning and the opposite direction, where visual information would help disambiguate text interpretation, is not considered. In this project, we propose to go beyond the state-of-the-art and turn textual annotations into a more complete and accurate supervisory signal for the different stages of the scene/object/human action interpretation process. In particular, we want to develop spatio-temporal correspondences between videos and the available text annotations, and exploit these correspondences as constraints for learning actions in videos. During this internship will be defined the form these constraints can take and the associated learning framework to use. Furthermore, the approach will be validated through experiments conducted on real data sets.

Your profile:

Duration: 3 to 6 months

Start date: As soon as possible

Location: This is a joint project between INRIA Grenoble, Montbonnot, and LIG Grenoble, Saint Martin d'Heres. The candidate will be required to spend time in both institutions.

Contacts:

Res. Dir. Cordelia Schmid, schmid@inrialpes.fr

Prof. Eric Gaussier, Eric.Gaussier@imag.fr

Please send applications via email, including:

Literature:

[1] K. Barnard, P. Duygulu, N. de Freitas, D. Forsyth, D. Blei, and M. I. Jordan. Matching words and pictures. Journal of Machine Learning Research, 2003.

[2] M. Guillaumin, T. Mensink, J. Verbeek, and C. Schmid. Automatic face naming with caption-based supervision. In Conference on Computer Vision & Pattern Recognition 2008.

[3] M. Everingham, J. Sivic, and A. Zisserman. Taking the bite out of automatic naming of characters in TV video. Image and Vision Computing, 2009.

[4] O. Duchenne, I. Laptev, J. Sivic, F. Bach, and J. Ponce. Automatic annotation of human actions in video. In International Conference on Computer Vision, 2009.

[5] T. Cour, C. Jordan, E. Miltsakaki, and B. Taskar. Movie/script: Alignment and parsing of video and text transcription. In Proc. European Conference on Computer Vision, 2008.