Alexander Kläser

You can download the tool for computing the 3D gradient descriptor as statically linked binary for x86_64 machines (64bit) (size 13 MB) running Linux or as source code (1.1 MB) to compile also under Linux.

**Features:**

- orientation quantization using icosahedon, dodecahedron, polar-coordinates
- descriptor computation either for given positions in the video or via dense sampling
- control over all parameters
- testing option for dumping video frames

Before using the tool make sure that your video is read correctly. Simply use the option --dump-frame for a couple of consecutive frames and inspect them visually. If you encounter problems, try to convert your video to a different format (e.g., with mencoder). Since we use the ffmpeg library, various formats should work with the tool. Please note also that this tool is mentioned only for scientific or personal use. If you have problems running the tool, if you found bugs, if you need support for different platforms than Linux, feel free to contact me.

**19 August 2010:** Release of version 1.3.0 as binary [64bit] and source code [tar.gz]. Most importantly, I added a new set of descriptor parameters that I optimized on the Hollywood2 training set and another one optimized on the KTH training set.

**10 May 2010:** Added some info about how to use and convert Harris3D points (by Ivan Laptev) with HOG3D.

**1 September 2009:** Release of version 1.2.0 [64bit].

This version includes more possibilities to work with dense sampling. Sorry, the 32bit version is not supported for now. I added a point to the FAQ regarding dense sampling.

**6 Februray 2009:** Update of version 1.1.0 [64bit, 32bit].

There were some problems with the video access at the first 25 frames in the video.

**20. January 2009:** Release of version 1.1.0.

The video processing is completely rewritten using ffmpeg (it should work much better now, let me know if there are problems!), there is also support for using video shot-boundaries now, and there are some features for computing descriptors within bounding boxes.

**10. October 2008:** Bug fix (version 1.0.1) [64bit, 32bit].

In the output t-scale and xy-scale were swapped. BTW, over the last time I updated the FAQ based on the emails that I got. Thanks to those who sent me an email :).

**31. August 2008:** First version (version 1.0) of our descriptor online.

You can download the Harris3D detector + HOGHOF descriptor from the website of Ivan Laptev. Follow his instructions on the website to compile and run his tool (called stipdet).

Then, I usually apply the following set of commands to compute HOG3D descriptors for Harris3D. First, create a bash script "convert.sh" that uses awk to convert Harris3D in a valid position file for HOG3D:

### file: convert.sh ### #!/bin/bash # Ivan's format: point-type x y t sigma2 tau2 confidence desc. awk '{ print $2 " " $3 " " $4 " " sqrt($5) " " sqrt($6) }' ### end file ###

Note that "stipdet" has squared values for the scale (sigma and tau), this needs to be adapted for HOG3D. The following is a sample pipleine. You can even use gzip/gunzip to compress data:

# Harris3D + HOGHOF points # ... '>()' creates a new sub process and pipes the output there, very handy :) ./stipdet -f video_file -o >(gzip > harris_hoghof.gz) -vis no # create position list zcat harris_hoghof.gz | convert.sh > harris.pos # compute HOG3D ./extractFeatures -p harris.pos -Sstart_frame-Eend_framevideo | gzip > harris_hog3d.gz

Hope that helps to get you started :) .

No, unfortunately not yet, we are developing on Linux, but it we are looking at it and hope to have a Windows binary at one point. But hey, why not using Linux, too :) ? Check out one of the many free Linux distributions. You can even run some of them directly from CD without installation.

This should be fixed by now, we are using ffmpeg, so this should support a lot of different video formats.

Yep, you can do this, you need to create a file with the (x, y, t, sigma, tau) positions of the descriptors you would like to compute in the video, this should exactly do what you are looking for. Use for this the option -p (or --position-file) .. here the explanation from --help:

position file for sampling features; each line represents one feature position and consists of 5 elements (they may be floating points) which are: '<x-position> <y-position> <frame-number> <xy-scale> <t-scale>' lines beginning with '#' are ignored

(x-position, y-position, frame-number) are the position of the center point of the descriptor you would like to compute. Sigma (or xy-scale) is the characteristic spatial scale and tau (or t-scale) is the characteristic temporal scale. For more information on the scales, please have a look at the FAQ point "In the output format, what exactely is xy-scale and t-scale?", this will explain how we deal with them.

All the information about the descriptor, its xyt-position, etc. is printed to the terminal, to the standard output stream. Simply pipe the data into a file of your choice, e.g.:

extractFeatures ... myvideo.vob > mydescriptors.txtAnd if you would like to save memory, you can compress the data directly:

extractFeatures ... myvideo.vob | gzip > mydescriptors.txt.gzThere are tools like zcat, zless that deal with compressed text files and that are handy to work with.

Depends to which part you refer to. Some answers...

**(a)** If you refer to the descriptor cuboid itself, you can either extract the descriptor at given (x, y, t, xy-scale, t-scale) positions (with a position file) or in a dense manner. If you do dense sampling, than there is a parameter that controls the overlap of neigboring descriptors which is --scale-overlap.

**(b)** If you refer to the orientation histogram cuboids (i.e., histogram cells) within the descriptor (controlled by --xy-ncells, --t-ncells .. note that x- and y-dimensions are coupled, that is if you choose 4, then you will have 4 cells in x and 4 in y direction), there is no overlap. The histogram cells are just aligned next to each other. Same applys to the mean gradients that are computed, there is no overlap.

The dimensionality depends on the quantization type (icosahedron, dodecahedron), there is also the possibility to use polar coordinates (then the dimensionality depends on the bins for the polar coordinates):

nOrientations * xy-ncells * xy-ncells * t-ncells

with nOrientations depending on --quantization-type and --half-orientation:

--quantization-type == 'icosahedron' => nOrientations = 20

--quantization-type == 'dodecahedron' => nOrientations = 12

--half-orientation => nOrientations = nOrientations / 2

--quantization-type == 'polar' && --polar-bins-xy == N && --polar-bins-t == M

=> nOrientations = M * N

Note that --half-orientation does not apply to polar quantization since there you give the number of bins directly.

Yes :).

Internally, the tool is working with floating points, also for dense sampling. At the moment of access to pixel values, the coordinates are rounded.

We have sample points with five parameters (x, y, t, xy-scale, t-scale). xy-scale and t-scale denote the characteristic scale. xy-scale is also referred to as sigma, and t-scale as tau. We work in scale-space, i.e., we can have two points at the exact same position, but at different scales. Different scales means that the cuboid size of the descriptor is different. However, the characteristic scale does not give directly the size of the descriptor, for that one needs to define a support size. More concrete, given a point (x, y, t, xy-scale, t-scale), the descriptor center position is at (x, y, t), its width, height, and depth are computed as:

width = sigma-support * xy-scale

height = sigma-support * xy-scale

depth = tau-support * t-scale

where --sigma-support and --tau-support are two parameters that need to be chosen (we evaluated them in our paper). If we consider the original video unscaled, i.e., xy-scale = t-scale = 1, the support refers directly to the size of the descriptor:

width = sigma-support

height = sigma-support

depth = tau-support

... how can you do dense sampling with the tool? It's simple, you have 5 dimensions which you need to sample in a dense manner: x, y, t, spatial-scale, temporal-scale. --xy-stride/--xy-nstride cover x and y, --t-stride/--t-nstride cover t, --xy-scale covers spatiala-scale (i.e., increasing spatial size of the descriptor), --t-scale covers the temporal scale (i.e., increasing length of the descriptor). --xy-scale and --t-scale give the step size for dense sampling along the spatial and temporal scale space. For example: set the --xy-scale to 2, then the initial width/height of your descriptor will be doubled every step. (Therefore you shouldn't set the scale parameters to 1, something * 1 = something.)

Another note, doing dense sampling, --tau-support and --sigma-support have no influence on the descriptor size since its size will be determined by -xy-stride/--xy-nstride, --t-stride/--t-nstride, and --scale-overlap.

*... I am using the following parameters:*

extractFeatures --xy-ncells 4 --t-ncells 3 --npix 3 --xy-stride 8 --t-stride 6 --sigma-support 8 --tau-support 6 --cut-zero 0.25 -P icosahedron <video>

*With these parameters, I get a 450MB descriptor file. What are the optimal parameters?*

First note that the *-support parameters are actually of no use for dense sampling since the size of the descriptors is induced from the sampling parameters. Note also that the default descriptor parmeters (--xy-ncells, --t-ncells, ...) correspond to the optimized ones from our BMVC08 experiments correspond. So there is no need to set them by hand.

There are several parameters for dense sampling, see 'dense sampling options' in the '--help' output below. *-stride is absolute stride in pixel, *-nstride is stride relative to the length/size of the video sequence. There is also *-max-scale (or *-max-stride) that controls at which temporal/spatial scale the sampling is stopped. --scale-overlap is by default set to 2, i.e., 50% overlap with neighboring descriptors (this is kind of standard in the literature). Yes, pay attention since dense sampling in videos will give you **a lot of** descriptors (you sample in 5 dimensions: x, y, t, spatial scale, temporal scale). It helps to use the *-max-scale options (or *-max-stride).

For dense sampling parameters, our BMVC09 video descriptor/detector evaluation might be of interest to you. Here an example according to our experiments in that evaluation. We use 8 spatial scales {1, sqrt(2), 2, sqrt(2)*2, 4, sqrt(2)*4, 8, sqrt(2)*8} and 2 temporal ones {1, sqrt(2)}. The lowest spatial scale (=1) refers to a stride of 9x9 pixels (i.e., descriptor size of 18x18 pixels). The lowest temporal scale (=1) refers to a stride of 5 frames (i.e., a descriptor length of 10 frames). Note that we subsampled videos with standard sizes (around 600-800 width and ~400 height) in our experiments by factor 2 (i.e., 300-400 width and ~200 height). For this setup you can use the following parameters:

extractFeatures --xy-stride 9 --xy-max-scale 12 --t-stride 5 --t-max-scale 2 <videoFile>

You can use the parameter "--simg 0.5" to scale down the video frames by factor 2. Hope these comments help getting you started :) .

The following is the output with --help:

usage:extractFeatures [options] <video-file> or (for testing): extractFeatures --dump-frame <frame-number> <video-file> output format is: <x> <y> <frame> <x-normalized> <y-norm.> <t-norm.> <xy-scale> <t-scale> <descriptor> version: 1.3.0 For license information use the option '--license'.command line arguments:--video-file arg video file to processgeneral options:-h [ --help ] produce help message --license print license information -v [ --verbose ] verbose mode, the 3D boxes for which the descriptor is computed will be printed to stderr -p [ --position-file ] arg position file for sampling features; each line represents one feature position and consists of 5 elements (they may be floating points) which are: '<x-position> <y-position> <frame-number> <xy-scale> <t-scale>'; lines beginning with '#' are ignored -q [ --position-file2 ] arg similar to --position-file, however with a different format; each line consists of 6 elements and describes a spatio-temporal cuboid for which the descriptor is computed: '<x> <y> <frame-start> <width> <height> <length>'; lines beginning with '#' are ignored -t [ --track-file ] arg track file for sampling features; each line represents a bounding box at a given frame; a line consists of 5 elements (they may be floating points) which are: '<frame-number> <top-left-x-po sition> <top-left-y-position> <width> <height>'; lines beginning with '#' are ignored; for a given position file, features with a center point that lies outside the set of given bounding boxes are ignored; for dense sampling, the --xy-nstride and --t-nstride options will be relative to the track (length and bounding boxes) --loose-track denotes that the track file is a collection of bounding boxes; for dense sampling, descriptors will be computed for each given bounding box in the track file --shot-file arg simple file with shot boundaries (i.e., frame number of beginning of a new shot) separate by any whitespace character; features crossing shot boundaries will be suppressed -f [ --force ] force computation of features, no suppression if features do not fit completely in the full video/buffer --n-subsample arg (=1) subsample every n-th feautre (in average) --seed arg (=timestamp) seed for subsampling (default current time) --dump-frame arg this option lets you double check whether the video is read correctly, give it a frame number and it will save this frame from the video as 'frame<x>.png' in the curent working directory, then it will quitdescriptor options:--kth-optimized by default, the parameter setting optimized on the Hollywood2 training dataset (see explanations at the end) is employed; if this flag is set, the parameter settings optimized on the KTH training dataset are being used. --xy-ncells arg (=2) number of HOG3D cells in x and y direction --t-ncells arg (=5) number of HOG3D cells in time --npix arg (=4) number of hyper pixel support for each HOG3D cell, i.e., the histogram of a cell is computed for SxSxS subblocks -n [ --norm-threshold ] arg (=0.1) suppress features with a descriptor L2-norm lower than the given threshold --cut-zero arg (=0.25) descriptor vector is normalized, its values are limited to given value, and the vector is renormalized again --sigma-support arg (=24) on the original scale, sigma determines the size of the region around a sampling point for which the descriptor is computed; for a different scale, this size is given by: <characteristic-scale>*<sigma-support >; sigma is the support in x- and y-direction --tau-support arg (=12) similar to 'sigma-support'; tau is the support in time -P [ --quantization-type ] arg (=polar) method that shall be taken for 3D gradient quantization: 'icosahedron' (20 faces=bins), 'dodecahedron' (12 faces=bins), 'polar' (binning on polar coordinates, you can specify the binning) --polar-bins-xy arg (=5) number of bins for the XY-plane orientation using polar coordinate quantization (has either full or half orientation) --polar-bins-t arg (=3) number of bins for the XT-plane orientation using polar coordinate quantization (has always half orientation -F [ --full-orientation ] arg (=0) By default, the half orientation is used (thus resulting in half the number of bins); if this flag is set, only the full sphere is used for quantization, thus doubling the number of bins -G [ --gauss-weight ] arg (=0) By default, each (hyper) pixel has a weight = 1; this flag enables Gaussian weighting similar to the SIFT descriptor -O [ --overlap-cells ] arg (=0) Given this flag, cells in the descriptor will be 50% overlapping -N [ --norm-global ] arg (=0) By default, each cell in the descriptor is normalized; given this flag, normalization is carried out on the complete descriptor --l1-norm arg (=0) Given this flag, the each cell desriptor (or the full descriptor if given '--norm-global') will be normalized with L1 norm; by default normalization is done using L2-normdense sampling options:--xy-nstride arg how many features are sampled in x/y direction on the smallest scale (specify either xy-nstride or xy-stride) --xy-stride arg specifies the stride in x/y direction (in pixel) on the smallest scale (specify either xy-nstride or xy-stride) --xy-max-stride arg specifies the maximum stride (and indirectly its scale) for x/y --xy-max-scale arg specifies the maximum scale for x/y --xy-scale arg (=sqrt(2)) scale factor for different scales in x/y direction --t-nstride arg how many features are sampled in time on the smallest scale (specify either t-nstride or t-stride) --t-stride arg specifies the stride in t direction (in frames) on the smalles scale (specify either t-nstride or t-stride) --t-max-stride arg specifies the maximum stride (and indirectly its scale) for t --t-max-scale arg specifies the maximum scale for t --t-scale arg (=sqrt(2)) scale factor for different scales in time --scale-overlap arg (=2) controls overlap of adjacent features, scales size of 3D box; for a factor of 1, features will be adjacent, any factor greater than 1 will cause overlapping features; a factor of 2 will double the size of the box (along each dimension) and thus will result in an overlap of 50%video options:-S [ --start-frame ] arg if given, feature extraction starts at give n frame -E [ --end-frame ] arg if given, feature extraction ends at given frame (including this frame) -s [ --start-time ] arg if given, feature extraction starts at the first frame at or after the given time (in seconds) -e [ --end-time ] arg if given, feature extraction ends at the last frame before or at the given time (in seconds) -b [ --buffer-length ] arg (=100) length of the internal video buffer .. cont rols the memory usage as well as the maxima l scale in time for features --simg arg (=1) scale the input video by this given factordescriptor parameters:By default, descriptor parameters are employed that have been learned on the training set of the Hollywood2 actions database. The setting is as follows: xy-ncells=2 t-ncells=5 npix=4 sigma-support=24 tau-support=12 quantization-type=polar polar-bins-xy=5 polar-bins-t=3 half-orientation normalization of cell histograms independently with L2 norm Optionally, a parameter setting learned on the KTH training set can be chosen by setting the option '--kth-optimized'. The parameters are then: xy-ncells=5 t-ncells=4 npix=4 sigma-support=16 tau-support=4 quantization-type=icosahedron half-orientation normalization of cell histograms independently with L2 norm More information can be obtained in my PhD thesis: Alexander Klaeser, Learning Human Actions in Video, July 2010

by Alexander Kläser 2010