Together with Marcin Marszałek, we published a novel spatio-temporal descriptor which we evaluated for action recognition. The descriptor is based on orientation histograms of 3D gradient orientations and is called HOG3D. So similar in spirit to the popular SIFT descriptor.
However, working in 3D rises the question how to actually bin a gradient in 3D. A straight forward way is to compute magnitude and spherical coordinates of the gradient. Then binning can be done in the spherical coordinate space. The result looks similar to longitudes and latitudes on a globe. However, this also results in singularities at the poles. Therefore we proposed to use convex regular polyhedrons (or platonic solids). They have the property that all faces are evenly distributed over the 2D polar coordinates and all faces are congruent. Furthermore we introduced integral videos which enable us to compute the descriptor for any arbitrary spatial and temporal scale without additional memory-cost.
To determine a suitable set of descriptor parameters, we optimized them via cross-validation on the training set of common action recognition databases. For my PhD manuscript, we updated the results and learned two sets of parameters on the training sets of the KTH and the Hollywood2 (HW2) actions datasets. Below the final results using Harris3D as interest point detector and comparing to Ivan Laptev's HOG/HOF descriptor. The evaluation has been carried out on four different action recognition datasets (KTH, Weizmann, UCF, Hollywood2).
KTH | Weizmann | UCF | Hollywood2 | |
---|---|---|---|---|
HOG3D (KTH optimized) | 92.6% | 90.7% | 68.3% | 45.1% |
HOG3D (HW2 optimized) | 89.5% | 85.6% | 68.1% | 48.6% |
HOGHOF | 91.1% | 85.6% | 71.2% | 47.7% |
HOG | 81.9% | 75.3% | 68.0% | 38.2% |
HOF | 92.7% | 88.8% | 63.9% | 43.8% |
For more information, see also: