MED Summaries

The "MED Summaries" is a new dataset for evaluation of dynamic video summaries. It contains annotations of 160 videos: a validation set of 60 videos and a test set of 100 videos. There are 10 event categories in the test set.

The videos come from the Trecvid MED 2011 challenge. They are available for registered participants.

Download

Annotation, subsets, descriptors, segmentations (954 Mb)
Segment descriptors for the training set (12 Gb)
Evaluation code
Temporal segmentation code

Paper

Category-specific video summarization D.Potapov, M.Douze, Z.Harchaoui, C.Schmid, ECCV 2014
Pdf ,   Supplementary material ,   Poster
@inproceedings{potapov2014category,
    url = {http://hal.inria.fr/hal-01022967},
    title = {{Category-specific video summarization}},
    author = {Potapov, Danila and Douze, Matthijs and Harchaoui, Zaid and Schmid, Cordelia},
    booktitle = {{ECCV 2014 - European Conference on Computer Vision}},
    year = {2014},
}

Annotation protocol

First, we asked a user to annotate temporal segments. Temporal segments should be semantically consistent, i.e. long enough for a user to grasp what is going on, but it must be possible to describe it in a short sentence. For example it can be "a group of people marching in the street" for a video of the class "Parade", or "putting one slice of bread onto another" for the class "Making a sandwich".

Then, for each semantic segment, we ask a user: "Does the segment contain evidence of the given event category?" The possible answers are:
  1. No evidence
  2. Some hints suggest that the whole video could belong to the category
  3. The segment contains significant evidence of the category
  4. The segment alone classifies the video to the category

Annotation of semantic segments and their importance

Annotation of semantic segments

This task consists in annotating temporal segments in video. For a given video, annotating temporal segments corresponds to finding time-stamps called "change-points" such that the video chunk between two consecutive time-stamps is "semantically consistent".

A video chunk between two consecutive annotated time-stamps is called a "segment". We consider that a segment is "semantically consistent" (or semantic segment for short) if a human can describe it with a short sentence. Yet, the segment should be delimited so that watching the segment is sufficient for a user to be able to grasp what is going on. For example it can be "a group of people marching in the street" for a "Parade" video or "putting one slice of bread onto another" for a "Making a sandwich" video (like in the examples).

The whole video have to be covered by non-intersecting semantic segments without gaps. Annotating semantic segments corresponds to specifying segments' change-points. We require that all shot boundaries to be annotated as change-points (a "shot" is a part of video continuous in time and space). Note that change-points do not necessarily correspond to shot boundaries, but all shot boundaries should be change-points.

Gradual transition (non-abrupt shot boundary) has a change-point in the middle if it lasts less than 1 second. Otherwise the gradual transition must be treated as a separate segment. A video below shows an example of a short gradual transition.

If a shot is long and contains several actions, you must annotate starting and ending frames of these actions. Often a shot contains a single action, but the main part is shorter than the whole segment. In this case you also should annotate the main part. See example video below.

Too long Ok

Some actions are repetitive or homogenous, e.g. running, sewing, etc. In that case you should specify the "minimum duration" of a subsegment that fully represents the whole segment. For example, watching 2-3 seconds of a running person is sufficient to understand what is going on and describe the segment as "a person is running". We require the "minimum duration" of the segment to be at least 2 seconds. On the other hand, 10 seconds of a running person is too long of a segment to concisely represent the sentence "a person is running".

Therefore, we give the following durations as a recommendation:

  • segments longer 5-10 seconds can be usually split in subsegments (when non-repetitive)
  • "minimum duration" is usually smaller than a half of the segment
  • "minimum duration" is usually smaller than 10 seconds
  • the whole duration of a repetitive segment is usually smaller than 20 seconds

The interface allows to navigate the video with a step of 5 frames. You should specify change-points as accurate as possible. A change-point is inserted just before the frame that you see.

Annotation of importance

For each semantic segment we ask you to annotate importance. You should answer the question: "Does the segment contain evidence of the given event category?" Please choose one of the answers:

  • 0 - No evidence
  • 1 - Some hints suggest that the whole video could belong to the category
  • 2 - The segment contains significant evidence that suggest that the video belongs to the category
  • 3 - The segment alone is sufficient to decide that the video belongs to the category

If something is only mentioned in text or speech, then do not report it as important.

Annotation statistics

Subset Training Validation NULL Test
MED dataset (our split)
Total videos 1338 (520) 1311 (877) 9600 31820
Total duration, hours 60 (25) 57 (39) 408 980
MED-Summaries (subset)
Annotated videos 60 (40) 100
Total duration, hours 3 (2) 4
Annotators per video 1 2-4
Total annotated segments 1680 (1122) 8904

There are 15 event categories in the training set and 10 in the test set, therefore numbers in brackets.
Most of the test videos in MED-Summaries are between 1 and 5 minutes long. In total 12 people participated in the annotation.

Examples of video summaries

Questions and Answers

Contact

Danila Potapov