Alexander Kläser

Tool for computing 3D descriptors in videos

You can download the tool for computing the 3D gradient descriptor as statically linked binary for x86_64 machines (64bit) (size 13 MB) running Linux or as source code (1.1 MB) to compile also under Linux.

Features:

Notes:

Before using the tool make sure that your video is read correctly. Simply use the option --dump-frame for a couple of consecutive frames and inspect them visually. If you encounter problems, try to convert your video to a different format (e.g., with mencoder). Since we use the ffmpeg library, various formats should work with the tool. Please note also that this tool is mentioned only for scientific or personal use. If you have problems running the tool, if you found bugs, if you need support for different platforms than Linux, feel free to contact me.

History

19 August 2010: Release of version 1.3.0 as binary [64bit] and source code [tar.gz]. Most importantly, I added a new set of descriptor parameters that I optimized on the Hollywood2 training set and another one optimized on the KTH training set.

10 May 2010: Added some info about how to use and convert Harris3D points (by Ivan Laptev) with HOG3D.

1 September 2009: Release of version 1.2.0 [64bit].
This version includes more possibilities to work with dense sampling. Sorry, the 32bit version is not supported for now. I added a point to the FAQ regarding dense sampling.

6 Februray 2009: Update of version 1.1.0 [64bit, 32bit].
There were some problems with the video access at the first 25 frames in the video.

20. January 2009: Release of version 1.1.0.
The video processing is completely rewritten using ffmpeg (it should work much better now, let me know if there are problems!), there is also support for using video shot-boundaries now, and there are some features for computing descriptors within bounding boxes.

10. October 2008: Bug fix (version 1.0.1) [64bit, 32bit].
In the output t-scale and xy-scale were swapped. BTW, over the last time I updated the FAQ based on the emails that I got. Thanks to those who sent me an email :).

31. August 2008: First version (version 1.0) of our descriptor online.

FAQ

How can I compute HOG3D descriptors from Harris3D interest points?

You can download the Harris3D detector + HOGHOF descriptor from the website of Ivan Laptev. Follow his instructions on the website to compile and run his tool (called stipdet).

Then, I usually apply the following set of commands to compute HOG3D descriptors for Harris3D. First, create a bash script "convert.sh" that uses awk to convert Harris3D in a valid position file for HOG3D:

### file: convert.sh ###
#!/bin/bash
# Ivan's format: point-type x y t sigma2 tau2 confidence desc.
awk '{
print $2 " " $3 " " $4 " " sqrt($5) " " sqrt($6)
}'
### end file ###

Note that "stipdet" has squared values for the scale (sigma and tau), this needs to be adapted for HOG3D. The following is a sample pipleine. You can even use gzip/gunzip to compress data:

# Harris3D + HOGHOF points
# ... '>()' creates a new sub process and pipes the output there, very handy :)
./stipdet -f video_file -o >(gzip > harris_hoghof.gz) -vis no

# create position list
zcat harris_hoghof.gz | convert.sh > harris.pos

# compute HOG3D
./extractFeatures -p harris.pos -S start_frame -E end_frame video | gzip > harris_hog3d.gz

Hope that helps to get you started :) .

Is there a version for Windows?

No, unfortunately not yet, we are developing on Linux, but it we are looking at it and hope to have a Windows binary at one point. But hey, why not using Linux, too :) ? Check out one of the many free Linux distributions. You can even run some of them directly from CD without installation.

I have all my video sequences in .avi format. I discovered that it doesn't work with your tool. Which format do you advise to use?

This should be fixed by now, we are using ffmpeg, so this should support a lot of different video formats.

How can I extract a feature for a given specific point (x,y,t)

Yep, you can do this, you need to create a file with the (x, y, t, sigma, tau) positions of the descriptors you would like to compute in the video, this should exactly do what you are looking for. Use for this the option -p (or --position-file) .. here the explanation from --help:

position file for sampling features; each line represents one feature position 
and consists of 5 elements (they may be floating points) which are: 
'<x-position> <y-position> <frame-number> <xy-scale> <t-scale>'
lines beginning with '#' are ignored

(x-position, y-position, frame-number) are the position of the center point of the descriptor you would like to compute. Sigma (or xy-scale) is the characteristic spatial scale and tau (or t-scale) is the characteristic temporal scale. For more information on the scales, please have a look at the FAQ point "In the output format, what exactely is xy-scale and t-scale?", this will explain how we deal with them.

Where can I find the output file which includes all the descriptors, etc.?

All the information about the descriptor, its xyt-position, etc. is printed to the terminal, to the standard output stream. Simply pipe the data into a file of your choice, e.g.:

extractFeatures ... myvideo.vob > mydescriptors.txt
And if you would like to save memory, you can compress the data directly:
extractFeatures ... myvideo.vob | gzip > mydescriptors.txt.gz
There are tools like zcat, zless that deal with compressed text files and that are handy to work with.

Is there any kind of overlapping and which parameter is responsible for that?

Depends to which part you refer to. Some answers...

(a) If you refer to the descriptor cuboid itself, you can either extract the descriptor at given (x, y, t, xy-scale, t-scale) positions (with a position file) or in a dense manner. If you do dense sampling, than there is a parameter that controls the overlap of neigboring descriptors which is --scale-overlap.

(b) If you refer to the orientation histogram cuboids (i.e., histogram cells) within the descriptor (controlled by --xy-ncells, --t-ncells .. note that x- and y-dimensions are coupled, that is if you choose 4, then you will have 4 cells in x and 4 in y direction), there is no overlap. The histogram cells are just aligned next to each other. Same applys to the mean gradients that are computed, there is no overlap.

How do you know the dimensionality of the descriptors in advance?

The dimensionality depends on the quantization type (icosahedron, dodecahedron), there is also the possibility to use polar coordinates (then the dimensionality depends on the bins for the polar coordinates):

nOrientations * xy-ncells * xy-ncells * t-ncells

with nOrientations depending on --quantization-type and --half-orientation:

--quantization-type == 'icosahedron' => nOrientations = 20
--quantization-type == 'dodecahedron' => nOrientations = 12
--half-orientation => nOrientations = nOrientations / 2

--quantization-type == 'polar' && --polar-bins-xy == N && --polar-bins-t == M
    => nOrientations = M * N

Note that --half-orientation does not apply to polar quantization since there you give the number of bins directly.

In the output format, do x, y, frame refer to the descriptor center point?

Yes :).

In the output format, how can I have non-integer numbers for the frames (e.g., 7.38461....)?

Internally, the tool is working with floating points, also for dense sampling. At the moment of access to pixel values, the coordinates are rounded.

In the output format, what exactely is xy-scale and t-scale?

We have sample points with five parameters (x, y, t, xy-scale, t-scale). xy-scale and t-scale denote the characteristic scale. xy-scale is also referred to as sigma, and t-scale as tau. We work in scale-space, i.e., we can have two points at the exact same position, but at different scales. Different scales means that the cuboid size of the descriptor is different. However, the characteristic scale does not give directly the size of the descriptor, for that one needs to define a support size. More concrete, given a point (x, y, t, xy-scale, t-scale), the descriptor center position is at (x, y, t), its width, height, and depth are computed as:

width = sigma-support * xy-scale
height = sigma-support * xy-scale
depth = tau-support * t-scale

where --sigma-support and --tau-support are two parameters that need to be chosen (we evaluated them in our paper). If we consider the original video unscaled, i.e., xy-scale = t-scale = 1, the support refers directly to the size of the descriptor:

width = sigma-support
height = sigma-support
depth = tau-support

A first note on dense sampling...

... how can you do dense sampling with the tool? It's simple, you have 5 dimensions which you need to sample in a dense manner: x, y, t, spatial-scale, temporal-scale. --xy-stride/--xy-nstride cover x and y, --t-stride/--t-nstride cover t, --xy-scale covers spatiala-scale (i.e., increasing spatial size of the descriptor), --t-scale covers the temporal scale (i.e., increasing length of the descriptor). --xy-scale and --t-scale give the step size for dense sampling along the spatial and temporal scale space. For example: set the --xy-scale to 2, then the initial width/height of your descriptor will be doubled every step. (Therefore you shouldn't set the scale parameters to 1, something * 1 = something.)

Another note, doing dense sampling, --tau-support and --sigma-support have no influence on the descriptor size since its size will be determined by -xy-stride/--xy-nstride, --t-stride/--t-nstride, and --scale-overlap.

I cannot figure out the dense sampling parameters. ...

... I am using the following parameters:

extractFeatures --xy-ncells 4 --t-ncells 3 --npix 3 --xy-stride 8 --t-stride 6 --sigma-support 8 --tau-support 6 --cut-zero 0.25 -P icosahedron <video>

With these parameters, I get a 450MB descriptor file. What are the optimal parameters?

First note that the *-support parameters are actually of no use for dense sampling since the size of the descriptors is induced from the sampling parameters. Note also that the default descriptor parmeters (--xy-ncells, --t-ncells, ...) correspond to the optimized ones from our BMVC08 experiments correspond. So there is no need to set them by hand.

There are several parameters for dense sampling, see 'dense sampling options' in the '--help' output below. *-stride is absolute stride in pixel, *-nstride is stride relative to the length/size of the video sequence. There is also *-max-scale (or *-max-stride) that controls at which temporal/spatial scale the sampling is stopped. --scale-overlap is by default set to 2, i.e., 50% overlap with neighboring descriptors (this is kind of standard in the literature). Yes, pay attention since dense sampling in videos will give you a lot of descriptors (you sample in 5 dimensions: x, y, t, spatial scale, temporal scale). It helps to use the *-max-scale options (or *-max-stride).

For dense sampling parameters, our BMVC09 video descriptor/detector evaluation might be of interest to you. Here an example according to our experiments in that evaluation. We use 8 spatial scales {1, sqrt(2), 2, sqrt(2)*2, 4, sqrt(2)*4, 8, sqrt(2)*8} and 2 temporal ones {1, sqrt(2)}. The lowest spatial scale (=1) refers to a stride of 9x9 pixels (i.e., descriptor size of 18x18 pixels). The lowest temporal scale (=1) refers to a stride of 5 frames (i.e., a descriptor length of 10 frames). Note that we subsampled videos with standard sizes (around 600-800 width and ~400 height) in our experiments by factor 2 (i.e., 300-400 width and ~200 height). For this setup you can use the following parameters:

extractFeatures --xy-stride 9 --xy-max-scale 12 --t-stride 5 --t-max-scale 2 <videoFile>

You can use the parameter "--simg 0.5" to scale down the video frames by factor 2. Hope these comments help getting you started :) .

Command line options

The following is the output with --help:

usage:
    extractFeatures [options] <video-file>
or (for testing):
    extractFeatures --dump-frame <frame-number> <video-file>

output format is:
    <x> <y> <frame> <x-normalized> <y-norm.> <t-norm.> <xy-scale> <t-scale> <descriptor>

version: 1.3.0

For license information use the option '--license'.


command line arguments:
  --video-file arg      video file to process

general options:
  -h [ --help ]               produce help message
  --license                   print license information
  -v [ --verbose ]            verbose mode, the 3D boxes for which the 
                              descriptor is computed will be printed to stderr
  -p [ --position-file ] arg  position file for sampling features; each line 
                              represents one feature position and consists of 5
                              elements (they may be floating points) which are:
                              '<x-position> <y-position> <frame-number> 
                              <xy-scale> <t-scale>'; lines beginning with '#' 
                              are ignored
  -q [ --position-file2 ] arg similar to --position-file, however with a 
                              different format; each line consists of 6 
                              elements and describes a spatio-temporal cuboid 
                              for which the descriptor is computed: '<x> <y> 
                              <frame-start> <width> <height> <length>'; lines 
                              beginning with '#' are ignored
  -t [ --track-file ] arg     track file for sampling features; each line 
                              represents a bounding box at a given frame; a 
                              line consists of 5 elements (they may be floating
                              points) which are: '<frame-number> <top-left-x-po
                              sition> <top-left-y-position> <width> <height>'; 
                              lines beginning with '#' are ignored; for a given
                              position file, features with a center point that 
                              lies outside the set of given bounding boxes are 
                              ignored; for dense sampling, the --xy-nstride and
                              --t-nstride options will be relative to the track
                              (length and bounding boxes)
  --loose-track               denotes that the track file is a collection of 
                              bounding boxes; for dense sampling, descriptors 
                              will be computed for each given bounding box in 
                              the track file
  --shot-file arg             simple file with shot boundaries (i.e., frame 
                              number of beginning of a new shot) separate by 
                              any whitespace character; features crossing shot 
                              boundaries will be suppressed
  -f [ --force ]              force computation of features, no suppression if 
                              features do not fit completely in the full 
                              video/buffer
  --n-subsample arg (=1)      subsample every n-th feautre (in average)
  --seed arg (=timestamp)     seed for subsampling (default current time)
  --dump-frame arg            this option lets you double check whether the 
                              video is read correctly, give it a frame number 
                              and it will save this frame from the video as 
                              'frame<x>.png' in the curent working directory, 
                              then it will quit

descriptor options:
  --kth-optimized                         by default, the parameter setting 
                                          optimized on the Hollywood2 training 
                                          dataset (see explanations at the end)
                                          is employed; if this flag is set, the
                                          parameter settings optimized on the 
                                          KTH training dataset are being used.
  --xy-ncells arg (=2)                    number of HOG3D cells in x and y 
                                          direction
  --t-ncells arg (=5)                     number of HOG3D cells in time
  --npix arg (=4)                         number of hyper pixel support for 
                                          each HOG3D cell, i.e., the histogram 
                                          of a cell is computed for SxSxS 
                                          subblocks
  -n [ --norm-threshold ] arg (=0.1)      suppress features with a descriptor 
                                          L2-norm lower than the given 
                                          threshold
  --cut-zero arg (=0.25)                  descriptor vector is normalized, its 
                                          values are limited to given value, 
                                          and the vector is renormalized again
  --sigma-support arg (=24)               on the original scale, sigma 
                                          determines the size of the region 
                                          around a sampling point for which the
                                          descriptor is computed; for a 
                                          different scale, this size is given 
                                          by: 
                                          <characteristic-scale>*<sigma-support
                                          >; sigma is the support in x- and 
                                          y-direction
  --tau-support arg (=12)                 similar to 'sigma-support'; tau is 
                                          the support in time
  -P [ --quantization-type ] arg (=polar) method that shall be taken for 3D 
                                          gradient quantization: 'icosahedron' 
                                          (20 faces=bins), 'dodecahedron' (12 
                                          faces=bins), 'polar' (binning on 
                                          polar coordinates, you can specify 
                                          the binning)
  --polar-bins-xy arg (=5)                number of bins for the XY-plane 
                                          orientation using polar coordinate 
                                          quantization (has either full or half
                                          orientation)
  --polar-bins-t arg (=3)                 number of bins for the XT-plane 
                                          orientation using polar coordinate 
                                          quantization (has always half 
                                          orientation
  -F [ --full-orientation ] arg (=0)      By default, the half orientation is 
                                          used (thus resulting in half the 
                                          number of bins); if this flag is set,
                                          only the full sphere is used for 
                                          quantization, thus doubling the 
                                          number of bins
  -G [ --gauss-weight ] arg (=0)          By default, each (hyper) pixel has a 
                                          weight = 1; this flag enables 
                                          Gaussian weighting similar to the 
                                          SIFT descriptor
  -O [ --overlap-cells ] arg (=0)         Given this flag, cells in the 
                                          descriptor will be 50% overlapping
  -N [ --norm-global ] arg (=0)           By default, each cell in the 
                                          descriptor is normalized; given this 
                                          flag, normalization is carried out on
                                          the complete descriptor
  --l1-norm arg (=0)                      Given this flag, the each cell 
                                          desriptor (or the full descriptor if 
                                          given '--norm-global') will be 
                                          normalized with L1 norm; by default 
                                          normalization is done using L2-norm

dense sampling options:
  --xy-nstride arg          how many features are sampled in x/y direction on 
                            the smallest scale (specify either xy-nstride or 
                            xy-stride)
  --xy-stride arg           specifies the stride in x/y direction (in pixel) on
                            the smallest scale (specify either xy-nstride or 
                            xy-stride)
  --xy-max-stride arg       specifies the maximum stride (and indirectly its 
                            scale) for x/y
  --xy-max-scale arg        specifies the maximum scale for x/y
  --xy-scale arg (=sqrt(2)) scale factor for different scales in x/y direction
  --t-nstride arg           how many features are sampled in time on the 
                            smallest scale (specify either t-nstride or 
                            t-stride)
  --t-stride arg            specifies the stride in t direction (in frames) on 
                            the smalles scale (specify either t-nstride or 
                            t-stride)
  --t-max-stride arg        specifies the maximum stride (and indirectly its 
                            scale) for t
  --t-max-scale arg         specifies the maximum scale for t
  --t-scale arg (=sqrt(2))  scale factor for different scales in time
  --scale-overlap arg (=2)  controls overlap of adjacent features, scales size 
                            of 3D box; for a factor of 1, features will be 
                            adjacent, any factor greater than 1 will cause 
                            overlapping features; a factor of 2 will double the
                            size of the box (along each dimension) and thus 
                            will result in an overlap of 50%

video options:
  -S [ --start-frame ] arg          if given, feature extraction starts at give
                                    n frame
  -E [ --end-frame ] arg            if given, feature extraction ends at given 
                                    frame (including this frame)
  -s [ --start-time ] arg           if given, feature extraction starts at the 
                                    first frame at or after the given time (in 
                                    seconds)
  -e [ --end-time ] arg             if given, feature extraction ends at the 
                                    last frame before or at the given time (in 
                                    seconds)
  -b [ --buffer-length ] arg (=100) length of the internal video buffer .. cont
                                    rols the memory usage as well as the maxima
                                    l scale in time for features
  --simg arg (=1)                   scale the input video by this given factor

descriptor parameters:
  By default, descriptor parameters are employed that have been learned on the
  training set of the Hollywood2 actions database. The setting is as follows:
    xy-ncells=2
    t-ncells=5
    npix=4
    sigma-support=24
    tau-support=12
    quantization-type=polar
    polar-bins-xy=5
    polar-bins-t=3
    half-orientation
    normalization of cell histograms independently with L2 norm

  Optionally, a parameter setting learned on the KTH training set can be 
  chosen by setting the option '--kth-optimized'. The parameters are then:
    xy-ncells=5
    t-ncells=4
    npix=4
    sigma-support=16
    tau-support=4
    quantization-type=icosahedron
    half-orientation
    normalization of cell histograms independently with L2 norm

  More information can be obtained in my PhD thesis: 
    Alexander Klaeser, Learning Human Actions in Video, July 2010

by Alexander Kläser 2010