We provide a two implementations of the product quantizer search method described in this report.
This technique allows the search in a large vector dataset indexed in a limited amount of memory.
Matlab version
Only the ADC version is given in this package.
This toy implementation is provided for the sake of exposition only.
It is significantly slower than the C version.
This package was written by Herve Jegou and Matthijs Douze in July 2009.
Copyright (C) INRIA 2009.
This version is provided for research purposes only, without any support and with no guarantee.
Pre-compiled library version with sample test programe
The ADC, SDC and IVFADC variants are implemented.
This package is a pre-compiled version of our C library.
It is provided for three architectures: MacOS i386 (32bits), ELF32 (Linux 32bits) and ELF64 (Linux 64 bits)
This package was written by Herve Jegou and Matthijs Douze in July 2009.
Copyright (C) INRIA 2009.
This version is provided for research purposes only, without any support and with no guarantee.
Please contact us if you are interested in a commercial version.
Contact: Herve Jegou and Matthijs Douze
Evaluation datasets for approximate nearest neighbor search
Overview: We provide three evaluation sets: the two datasets used in the evaluation of
our Technical report no 7020:
and a small dataset used to quickly test the sample programs provided on this web page.
All sets are gzipped archives and each of them comprises 4 files of vectors:
base vectors: the vectors in which the search is performed
query vectors
learning vectors: to find the parameters involved in a particular method without overfitting to the base set
a groundtruth file: contains the pre-computed 100 nearest neighbors
The vector files are stored in .fvecs format, while the groundtruth file in is .ivecs format.
.fvecs and .ivecs vector file formats:
The vectors are stored in raw little endian. Each vector takes 4+d*4 bytes,
where d is the dimensionality of the vector, as shown below.
field
field type
description
d
int
the vector dimension
components
(float | int)*d
the vector components
The only difference between .fvecs and .ivecs is the base type for the
vector components, which is float or int, respectively.
In the matlab section below, we provide two functions
to read such files in matlab.
The groundtruth files contain, for each query, the
identifers (vector number, starting at 0) of its 100 nearest
neighbors, ordered by increasing distance.
Therefore, the first element of each integer vector is the nearest
neighbor identifier associated with the query.
Matlab ressources: I/O, kmeans and exact k-nearest neighbor search
We provide several matlab functions:
to read the .fvecs and .ivecs files formats mentionned above ;
to optimize some critical steps involved in nearest neighbor search.
Some of them are pre-compiled matlab packages created using MEX-files.
They replace several functions that have slow matlab implementations.
In particular, the proposed k-means is significantly faster than the one provided
in matlab (based on Blas/Lapack and multi-threaded).
These mex-files are available for Linux only (both 32 and 64 bits).