Top

falconn module

Python wrapper for FALCONN.

This is a Python wrapper for FALCONN that is meant to be easy to use. It exposes classes LSHIndex, LSHConstructionParameters and QueryStatistics together with helper functions get_default_parameters() and compute_number_of_hash_functions().

For now, the Python wrapper supports only static dense datasets, but more to come.

FALCONN is based on Locality-Sensitive Hashing (LSH), which is briefly covered here and here.

The main class is LSHIndex, which takes a dataset and builds an LSH data structure. The dataset is represented as a two-dimensional NumPy array with data points being rows. Since FALCONN uses NumPy arrays only as a convenient way to store and pass around data, it does not matter if NumPy is compiled with a fancy BLAS library: this has zero effect on the performance of FALCONN.

To construct an instance of LSHIndex, one needs to prepare an instance of LSHConstructionParameters, which stores parameters used to build the data structure. To get a sense about the parameters used, see here. To get a reasonable setting of parameters, one can (and should!) use two helper functions we provide: get_default_parameters() and compute_number_of_hash_functions().

Besides the documentation, we provide two examples:

  • here the LSH data structure for a random dataset is built and used;
  • here we show how to use LSH to perform similarity search on a GloVe dataset.

An intended workflow is as follows:

  • first, use get_default_parameters() to get a reasonable setting of parameters; (later, tune them if necessary);
  • second, create an instance of LSHIndex passing the LSHConstructionParameters object we've just built;
  • third, call setup() method of the LSHIndex object, passing the dataset;
  • then, construct a query object by calling construct_query_object() method of the LSHIndex object;
  • then, increase the number of table probes using the set_num_probes() method of the query object until you get the desired accuracy on a sample set of queries;
  • finally, use the other methods of the query object to execute queries.

If you would like to query LSHIndex from several threads, create a dedicated query object per thread using construct_query_object. Alternatively, one can build a query pool using construct_query_pool method of LSHIndex, which one can query in parallel.

A couple of useful tricks:

  • Cast your dataset to numpy.float32 (by doing dataset = dataset.astype(numpy.float32));
  • Center your dataset (by doing dataset -= numpy.mean(dataset, axis=0)) and queries: this has no effect on the correctness, but greatly improves the performance of our LSH families. Don't forget to use the Euclidean distance in this case.
"""Python wrapper for FALCONN.

This is a Python wrapper for [FALCONN](http://falconn-lib.org/) that is
meant to be easy to use. It exposes classes `LSHIndex`,
`LSHConstructionParameters` and `QueryStatistics` together with
helper functions `get_default_parameters()` and
`compute_number_of_hash_functions()`.

For now, the Python wrapper supports only _static dense_ datasets,
but more to come.

FALCONN is based on Locality-Sensitive Hashing (LSH), which is briefly
covered [here](https://github.com/FALCONN-LIB/FALCONN/wiki/LSH-Primer)
and [here](https://github.com/FALCONN-LIB/FALCONN/wiki/LSH-Families).

The main class is `LSHIndex`, which takes a dataset and builds an LSH
data structure. The dataset is represented as a two-dimensional
[NumPy](http://www.numpy.org/) array with data points being _rows_.
Since FALCONN uses NumPy arrays
only as a convenient way to store and pass around data, it does not
matter if NumPy is compiled with a fancy BLAS library: this has zero
effect on the performance of FALCONN.

To construct an instance of `LSHIndex`, one needs to prepare an instance
of `LSHConstructionParameters`, which stores parameters used to build the
data structure. To get a sense about the parameters used, see
[here](https://github.com/FALCONN-LIB/FALCONN/wiki/How-to-Use-FALCONN).
To get a reasonable setting of parameters, one can (and should!) use
two helper functions we provide: `get_default_parameters()` and
`compute_number_of_hash_functions()`.

Besides the documentation, we provide two examples:

* [here](https://github.com/FALCONN-LIB/FALCONN/tree/master/src/benchmark)
the LSH data structure for a random dataset is built and used;
* [here](https://github.com/FALCONN-LIB/FALCONN/tree/master/examples/glove)
we show how to use LSH to perform similarity search on a GloVe dataset.

An intended workflow is as follows:

* first, use `get_default_parameters()` to get a reasonable setting of
parameters; (later, tune them if necessary);
* second, create an instance of `LSHIndex` passing the
`LSHConstructionParameters` object we've just built;
* third, call `setup()` method of the `LSHIndex` object,
passing the dataset;
* then, construct a query object by calling `construct_query_object()`
method of the `LSHIndex` object;
* then, increase the number of table probes using the `set_num_probes()`
method of the query object until you get the desired accuracy on
a sample set of queries;
* finally, use the other methods of the query object
to execute queries.

If you would like to query `LSHIndex` from several threads, create
a dedicated query object per thread using `construct_query_object`.
Alternatively, one can build a query pool using
`construct_query_pool` method of `LSHIndex`, which one can query
in parallel.

A couple of useful tricks:

* Cast your dataset to `numpy.float32` (by doing
`dataset = dataset.astype(numpy.float32)`);
* Center your dataset (by doing
`dataset -= numpy.mean(dataset, axis=0)`) and queries:
this has no effect on the correctness, but greatly improves the
performance of our LSH families. Don't forget to use the Euclidean
distance in this case.
"""
import numpy as _numpy
import _falconn as _internal
from _falconn import LSHConstructionParameters, QueryStatistics, DistanceFunction, LSHFamily, StorageHashTable, get_default_parameters, compute_number_of_hash_functions


class Queryable:
    """A simple wrapper for query objects and query pools.
    
    Instances of `Queryable` are returned by the methods
    `construct_query_object` and `construct_query_pool` of `LSHIndex`.
    You are not expected to construct instances of `Queryable` directly.
    """

    def __init__(self, inner_entity, parent):
        self._inner_entity = inner_entity
        self._parent = parent

    def _check_query(self, query):
        if not isinstance(query, _numpy.ndarray):
            raise TypeError('query must be an instance of numpy.ndarray')
        if len(query.shape) != 1:
            raise ValueError('query must be one-dimensional')
        if self._parent._dataset.dtype != query.dtype:
            raise TypeError('dataset and query must have the same dtype')
        if query.shape[0] != self._parent._params.dimension:
            raise ValueError(
                'query dimension mismatch: {} expected, but {} found'.format(
                    self._parent._params.dimension, query.shape[0]))

    def find_k_nearest_neighbors(self, query, k):
        """Retrieve the closest `k` neighbors to `query`.

        Find the keys of the `k` closest candidates in the probing
        sequence for query. The keys are returned in order of
        increasing distance to query.

        Arguments:

        * `query`: a query given as a one-dimension NumPy array of the same
        `dtype` as the dataset; the dimension of `query` much match
        the second dimension of the dataset;
        * `k`: the number of closest candidates to retrieve.
        """
        self._check_query(query)
        if k <= 0:
            raise ValueError('k must be positive rather than {}'.format(k))
        return self._inner_entity.find_k_nearest_neighbors(query, k)

    def find_near_neighbors(self, query, threshold):
        """Find all the points within `threshold` distance from `query`.

        Returns the keys corresponding to candidates in the probing
        sequence for query that have distance to `query` at most `threshold`.

        Arguments:

        * `query`: a query given as a one-dimension NumPy array of the same
        `dtype` as the dataset; the dimension of `query` much match
        the second dimension of the dataset;
        * `threshold`: a distance threshold up to which we enumerate
        the candidates; note that it can be negative, and for the distance
        function `'negative_inner_product'` it actually makes sense.
        """
        self._check_query(query)
        if threshold < 0:
            raise ValueError('threshold must be non-negative rather than {}'.
                             format(threshold))
        return self._inner_entity.find_near_neighbors(query, threshold)

    def find_nearest_neighbor(self, query):
        """Find the key of the closest candidate.

        Finds the key of the closest candidate in the probing sequence
        for a query.

        Arguments:

        * `query`: a query given as a one-dimension NumPy array of the same
        `dtype` as the dataset; the dimension of `query` much match
        the second dimension of the dataset.
        """
        self._check_query(query)
        return self._inner_entity.find_nearest_neighbor(query)

    def get_candidates_with_duplicates(self, query):
        """Retrieve all the candidates for a given query.

        Returns the keys of all candidates in the probing sequence for
        query. If a candidate key is found in multiple tables, it will
        appear multiple times in the result. The candidates are
        returned in the order in which they appear in the probing
        sequence.

        Arguments:

        * `query`: a query given as a one-dimension NumPy array of the same
        `dtype` as the dataset; the dimension of `query` much match
        the second dimension of the dataset.
        """
        self._check_query(query)
        return self._inner_entity.get_candidates_with_duplicates(query)

    def get_max_num_candidates(self):
        """Get the maximum number of candidates considered in each query."""
        return self._inner_entity.get_max_num_candidates()

    def get_num_probes(self):
        """Get the number of probes used for each query."""
        return self._inner_entity.get_num_probes()

    def get_query_statistics(self):
        """Return the query statistics.

        Returns an instance of `QueryStatistics`
        that summarizes the statistics for queries
        made so far (after the last `reset_query_statistics()`
        or the construction).
        """
        return self._inner_entity.get_query_statistics()

    def get_unique_candidates(self, query):
        """Retrieve all the candidates (each at most once) for a query.

        Returns the keys of all candidates in the probing sequence for
        query. If a candidate key is found in multiple tables, it will
        appear once in the result.

        Arguments:

        * `query`: a query given as a one-dimension NumPy array of the same
        `dtype` as the dataset; the dimension of `query` much match
        the second dimension of the dataset.
        """
        self._check_query(query)
        return self._inner_entity.get_unique_candidates(query)

    def reset_query_statistics(self):
        """Reset the query statistics."""
        self._inner_entity.reset_query_statistics()

    def set_max_num_candidates(self, max_num_candidates=-1):
        """Set the maximum number of candidates considered in each query.

        The value `-1` indicates that all candidates retrieved
        in the probing sequence should be considered. This is the
        default and usually a good setting. A maximum number of
        candidates is mainly useful to give worst case running time
        guarantees for every query.

        Arguments:

        * `max_num_candidates` (default `-1`): the maximum number of
        candidates.
        """
        if max_num_candidates < -1:
            raise ValueError(
                'invalid max_num_candidates: {}'.format(max_num_candidates))
        self._inner_entity.set_max_num_candidates(max_num_candidates)

    def set_num_probes(self, num_probes):
        """Set the number of probes used for each query.

        The default setting is `self._params.l` (the number of hash tables),
        which
        effectively disables multiprobing (the probing sequence only
        contains a single candidate per table).

        Arguments:

        * `num_probes`: the total number of probes per query.
        """
        if num_probes < self._parent._params.l:
            raise ValueError(
                'number of probes must be at least the number of tables ({})'.
                format(self._parent._params.l))
        self._inner_entity.set_num_probes(num_probes)


class LSHIndex:
    """The main class that represents the LSH data structure.

    To construct an instance of `LSHIndex`, one needs to pass an instance
    of `LSHConstructionParameters`. During the construction, the
    parameters are not (yet) checked for correctness.

    After creating an instance of `LSHIndex`, one needs to call
    `setup()` passing a dataset. A dataset must be a two-dimensional
    NumPy array with dtype `numpy.float32` or `numpy.float64`. Rows
    of the array are interpreted as data points. Thus, the second
    dimension of the array must match the dimension from parameters.

    We recommend converting a dataset to `numpy.float32` and centering
    it: both tricks usually improve the performance a lot.

    After building the LSH data structures, one needs to construct
    a query object by calling `construct_query_object()`.
    Then, one needs to set the number
    of probes via calling the `set_num_probes()` method of the query object
    to achieve the desired accuracy
    (the more probes one does the better accuracy is). This can be done
    empirically on a sample set of queries using binary search.

    Then, one can use the following member functions of the query object:

    * `find_k_nearest_neighbors()`
    * `find_near_neighbors()`
    * `find_nearest_neighbor()`
    * `get_candidates_with_duplicates()`
    * `get_unique_candidates()`

    to execute queries.

    In short, the LSH data structure works as follows. First, it generates
    a (hopefully, short) list of candidate points, and then filters this list
    according to the distance. Query object provides low-level functions
    that just generate the list of candidates:

    * `get_candidates_with_duplicates()`
    * `get_unique_candidates()`

    as well as high-level functions that filter this list for you:

    * `find_nearest_neighbor()`
    * `find_k_nearest_neighbors()`
    * `find_near_neighbors()`

    In addition to this, the query object exposes the
    following helper functions:

    * `get_max_num_candidates()`
    * `get_num_probes()`
    * `get_query_statistics()`
    * `reset_query_statistics()`
    * `set_max_num_candidates()`.

    See their respective documentation for help.

    Alternatively, one can call the method `construct_query_pool()`
    to construct a *pool* of query objects, which can be safely queried
    in parallel. The vanilla query object is not thread-safe, so each
    thread must have its own query object.
    """

    def __init__(self, params):
        """Initialize with an instance of `LSHConstructionParameters`.

        Arguments:

        * `params`: an instance of `LSHConstructionParameters`
        """
        #TODO check params for correctness
        self._params = params
        self._dataset = None
        self._table = None

    def setup(self, dataset):
        """Build the LSH data structure from a given dataset.

        The method builds the LSH data structure using the parameters
        passed during the construction and a given dataset (stored as
        `self._params`).

        A dataset must be a two-dimensional
        NumPy array with `dtype` `numpy.float32` or `numpy.float64`.
        The rows
        of the array are interpreted as data points. Thus, the second
        dimension of the array must match the dimension from parameters
        (`self._params.dimension`).

        An important caveat: **DON'T DELETE THE DATASET WHILE USING
        `LSHIndex`**. This can lead to silent crashes and can be very
        confusing.

        Arguments:

        * `dataset`: a two-dimensional NumPy array with dtype
        `numpy.float32` or `numpy.float64`; the second dimension must match
        dimension from the LSH parameters (`self._params.dimension`)
        """
        if self._dataset is not None or self._table is not None:
            raise RuntimeError('setup() has already been called')
        if not isinstance(dataset, _numpy.ndarray):
            raise TypeError('dataset must be an instance of numpy.ndarray')
        if len(dataset.shape) != 2:
            raise ValueError('dataset must be a two-dimensional array')
        if dataset.dtype != _numpy.float32 and dataset.dtype != _numpy.float64:
            raise TypeError('dataset must consist of floats or doubles')
        if dataset.shape[1] != self._params.dimension:
            raise ValueError(
                'dataset dimension mismatch: {} expected, but {} found'.format(
                    self._params.dimension, dataset.shape[1]))
        self._dataset = dataset
        if dataset.dtype == _numpy.float32:
            self._table = _internal.construct_table_dense_float(
                dataset, self._params)
        else:
            self._table = _internal.construct_table_dense_double(
                dataset, self._params)

    def _check_built(self):
        if self._dataset is None or self._table is None:
            raise RuntimeError('LSH table is not built (use setup())')

    def construct_query_object(self, num_probes=-1, max_num_candidates=-1):
        """Construct a query object.

        This method constructs and returns a query object, which can be used
        to query the LSH data structure. The query object is not thread-safe,
        for a thread-safe version, see the `construct_query_pool` method. Alternatively,
        you can construct a separate query object per thread.

        Arguments:

        * `num_probes` (default `-1`): the number of buckets the query algorithm
        probes. This number can later be modified using the `set_num_probes` method of
        the query object. The higher number of probes is, the better accuracy one gets,
        but the slower queries are. If `num_probes` is equal to `-1`, then we probe
        one bucket per (each of the `params.L`) table;
        * `max_num_candidates` (default `-1`): the maximum number of candidate points
        we retrieve. The value `-1` means that the said number is unbounded.
        """
        self._check_built()
        return Queryable(
            self._table.construct_query_object(num_probes, max_num_candidates),
            self)

    def construct_query_pool(self,
                             num_probes=-1,
                             max_num_candidates=-1,
                             num_query_objects=0):
        """Construct a pool of query objects.

        This method constructs and returns a pool of query objects, which
        can be used to query the LSH data structure from several threads.

        Arguments:

        * `num_probes` (default `-1`): the number of buckets the query algorithm
        probes. This number can later be modified using the `set_num_probes` method of
        the query object. The higher number of probes is, the better accuracy one gets,
        but the slower queries are. If `num_probes` is equal to `-1`, then we probe
        one bucket per (each of the `params.L`) table;
        * `max_num_candidates` (default `-1`): the maximum number of candidate points
        we retrieve. The value `-1` means that the said number is unbounded;
        * `num_query_objects` (default `0`): the number of query objects in the pool.
        The value `0` makes the number of query objects to be twice the number of hardware
        threads on the machine.
        """
        self._check_built()
        return Queryable(
            self._table.construct_query_pool(num_probes, max_num_candidates,
                                             num_query_objects), self)

Classes

class LSHIndex

The main class that represents the LSH data structure.

To construct an instance of LSHIndex, one needs to pass an instance of LSHConstructionParameters. During the construction, the parameters are not (yet) checked for correctness.

After creating an instance of LSHIndex, one needs to call setup() passing a dataset. A dataset must be a two-dimensional NumPy array with dtype numpy.float32 or numpy.float64. Rows of the array are interpreted as data points. Thus, the second dimension of the array must match the dimension from parameters.

We recommend converting a dataset to numpy.float32 and centering it: both tricks usually improve the performance a lot.

After building the LSH data structures, one needs to construct a query object by calling construct_query_object(). Then, one needs to set the number of probes via calling the set_num_probes() method of the query object to achieve the desired accuracy (the more probes one does the better accuracy is). This can be done empirically on a sample set of queries using binary search.

Then, one can use the following member functions of the query object:

  • find_k_nearest_neighbors()
  • find_near_neighbors()
  • find_nearest_neighbor()
  • get_candidates_with_duplicates()
  • get_unique_candidates()

to execute queries.

In short, the LSH data structure works as follows. First, it generates a (hopefully, short) list of candidate points, and then filters this list according to the distance. Query object provides low-level functions that just generate the list of candidates:

  • get_candidates_with_duplicates()
  • get_unique_candidates()

as well as high-level functions that filter this list for you:

  • find_nearest_neighbor()
  • find_k_nearest_neighbors()
  • find_near_neighbors()

In addition to this, the query object exposes the following helper functions:

  • get_max_num_candidates()
  • get_num_probes()
  • get_query_statistics()
  • reset_query_statistics()
  • set_max_num_candidates().

See their respective documentation for help.

Alternatively, one can call the method construct_query_pool() to construct a pool of query objects, which can be safely queried in parallel. The vanilla query object is not thread-safe, so each thread must have its own query object.

class LSHIndex:
    """The main class that represents the LSH data structure.

    To construct an instance of `LSHIndex`, one needs to pass an instance
    of `LSHConstructionParameters`. During the construction, the
    parameters are not (yet) checked for correctness.

    After creating an instance of `LSHIndex`, one needs to call
    `setup()` passing a dataset. A dataset must be a two-dimensional
    NumPy array with dtype `numpy.float32` or `numpy.float64`. Rows
    of the array are interpreted as data points. Thus, the second
    dimension of the array must match the dimension from parameters.

    We recommend converting a dataset to `numpy.float32` and centering
    it: both tricks usually improve the performance a lot.

    After building the LSH data structures, one needs to construct
    a query object by calling `construct_query_object()`.
    Then, one needs to set the number
    of probes via calling the `set_num_probes()` method of the query object
    to achieve the desired accuracy
    (the more probes one does the better accuracy is). This can be done
    empirically on a sample set of queries using binary search.

    Then, one can use the following member functions of the query object:

    * `find_k_nearest_neighbors()`
    * `find_near_neighbors()`
    * `find_nearest_neighbor()`
    * `get_candidates_with_duplicates()`
    * `get_unique_candidates()`

    to execute queries.

    In short, the LSH data structure works as follows. First, it generates
    a (hopefully, short) list of candidate points, and then filters this list
    according to the distance. Query object provides low-level functions
    that just generate the list of candidates:

    * `get_candidates_with_duplicates()`
    * `get_unique_candidates()`

    as well as high-level functions that filter this list for you:

    * `find_nearest_neighbor()`
    * `find_k_nearest_neighbors()`
    * `find_near_neighbors()`

    In addition to this, the query object exposes the
    following helper functions:

    * `get_max_num_candidates()`
    * `get_num_probes()`
    * `get_query_statistics()`
    * `reset_query_statistics()`
    * `set_max_num_candidates()`.

    See their respective documentation for help.

    Alternatively, one can call the method `construct_query_pool()`
    to construct a *pool* of query objects, which can be safely queried
    in parallel. The vanilla query object is not thread-safe, so each
    thread must have its own query object.
    """

    def __init__(self, params):
        """Initialize with an instance of `LSHConstructionParameters`.

        Arguments:

        * `params`: an instance of `LSHConstructionParameters`
        """
        #TODO check params for correctness
        self._params = params
        self._dataset = None
        self._table = None

    def setup(self, dataset):
        """Build the LSH data structure from a given dataset.

        The method builds the LSH data structure using the parameters
        passed during the construction and a given dataset (stored as
        `self._params`).

        A dataset must be a two-dimensional
        NumPy array with `dtype` `numpy.float32` or `numpy.float64`.
        The rows
        of the array are interpreted as data points. Thus, the second
        dimension of the array must match the dimension from parameters
        (`self._params.dimension`).

        An important caveat: **DON'T DELETE THE DATASET WHILE USING
        `LSHIndex`**. This can lead to silent crashes and can be very
        confusing.

        Arguments:

        * `dataset`: a two-dimensional NumPy array with dtype
        `numpy.float32` or `numpy.float64`; the second dimension must match
        dimension from the LSH parameters (`self._params.dimension`)
        """
        if self._dataset is not None or self._table is not None:
            raise RuntimeError('setup() has already been called')
        if not isinstance(dataset, _numpy.ndarray):
            raise TypeError('dataset must be an instance of numpy.ndarray')
        if len(dataset.shape) != 2:
            raise ValueError('dataset must be a two-dimensional array')
        if dataset.dtype != _numpy.float32 and dataset.dtype != _numpy.float64:
            raise TypeError('dataset must consist of floats or doubles')
        if dataset.shape[1] != self._params.dimension:
            raise ValueError(
                'dataset dimension mismatch: {} expected, but {} found'.format(
                    self._params.dimension, dataset.shape[1]))
        self._dataset = dataset
        if dataset.dtype == _numpy.float32:
            self._table = _internal.construct_table_dense_float(
                dataset, self._params)
        else:
            self._table = _internal.construct_table_dense_double(
                dataset, self._params)

    def _check_built(self):
        if self._dataset is None or self._table is None:
            raise RuntimeError('LSH table is not built (use setup())')

    def construct_query_object(self, num_probes=-1, max_num_candidates=-1):
        """Construct a query object.

        This method constructs and returns a query object, which can be used
        to query the LSH data structure. The query object is not thread-safe,
        for a thread-safe version, see the `construct_query_pool` method. Alternatively,
        you can construct a separate query object per thread.

        Arguments:

        * `num_probes` (default `-1`): the number of buckets the query algorithm
        probes. This number can later be modified using the `set_num_probes` method of
        the query object. The higher number of probes is, the better accuracy one gets,
        but the slower queries are. If `num_probes` is equal to `-1`, then we probe
        one bucket per (each of the `params.L`) table;
        * `max_num_candidates` (default `-1`): the maximum number of candidate points
        we retrieve. The value `-1` means that the said number is unbounded.
        """
        self._check_built()
        return Queryable(
            self._table.construct_query_object(num_probes, max_num_candidates),
            self)

    def construct_query_pool(self,
                             num_probes=-1,
                             max_num_candidates=-1,
                             num_query_objects=0):
        """Construct a pool of query objects.

        This method constructs and returns a pool of query objects, which
        can be used to query the LSH data structure from several threads.

        Arguments:

        * `num_probes` (default `-1`): the number of buckets the query algorithm
        probes. This number can later be modified using the `set_num_probes` method of
        the query object. The higher number of probes is, the better accuracy one gets,
        but the slower queries are. If `num_probes` is equal to `-1`, then we probe
        one bucket per (each of the `params.L`) table;
        * `max_num_candidates` (default `-1`): the maximum number of candidate points
        we retrieve. The value `-1` means that the said number is unbounded;
        * `num_query_objects` (default `0`): the number of query objects in the pool.
        The value `0` makes the number of query objects to be twice the number of hardware
        threads on the machine.
        """
        self._check_built()
        return Queryable(
            self._table.construct_query_pool(num_probes, max_num_candidates,
                                             num_query_objects), self)

Ancestors (in MRO)

Methods

def __init__(

self, params)

Initialize with an instance of LSHConstructionParameters.

Arguments:

  • params: an instance of LSHConstructionParameters
def __init__(self, params):
    """Initialize with an instance of `LSHConstructionParameters`.
    Arguments:
    * `params`: an instance of `LSHConstructionParameters`
    """
    #TODO check params for correctness
    self._params = params
    self._dataset = None
    self._table = None

def construct_query_object(

self, num_probes=-1, max_num_candidates=-1)

Construct a query object.

This method constructs and returns a query object, which can be used to query the LSH data structure. The query object is not thread-safe, for a thread-safe version, see the construct_query_pool method. Alternatively, you can construct a separate query object per thread.

Arguments:

  • num_probes (default -1): the number of buckets the query algorithm probes. This number can later be modified using the set_num_probes method of the query object. The higher number of probes is, the better accuracy one gets, but the slower queries are. If num_probes is equal to -1, then we probe one bucket per (each of the params.L) table;
  • max_num_candidates (default -1): the maximum number of candidate points we retrieve. The value -1 means that the said number is unbounded.
def construct_query_object(self, num_probes=-1, max_num_candidates=-1):
    """Construct a query object.
    This method constructs and returns a query object, which can be used
    to query the LSH data structure. The query object is not thread-safe,
    for a thread-safe version, see the `construct_query_pool` method. Alternatively,
    you can construct a separate query object per thread.
    Arguments:
    * `num_probes` (default `-1`): the number of buckets the query algorithm
    probes. This number can later be modified using the `set_num_probes` method of
    the query object. The higher number of probes is, the better accuracy one gets,
    but the slower queries are. If `num_probes` is equal to `-1`, then we probe
    one bucket per (each of the `params.L`) table;
    * `max_num_candidates` (default `-1`): the maximum number of candidate points
    we retrieve. The value `-1` means that the said number is unbounded.
    """
    self._check_built()
    return Queryable(
        self._table.construct_query_object(num_probes, max_num_candidates),
        self)

def construct_query_pool(

self, num_probes=-1, max_num_candidates=-1, num_query_objects=0)

Construct a pool of query objects.

This method constructs and returns a pool of query objects, which can be used to query the LSH data structure from several threads.

Arguments:

  • num_probes (default -1): the number of buckets the query algorithm probes. This number can later be modified using the set_num_probes method of the query object. The higher number of probes is, the better accuracy one gets, but the slower queries are. If num_probes is equal to -1, then we probe one bucket per (each of the params.L) table;
  • max_num_candidates (default -1): the maximum number of candidate points we retrieve. The value -1 means that the said number is unbounded;
  • num_query_objects (default 0): the number of query objects in the pool. The value 0 makes the number of query objects to be twice the number of hardware threads on the machine.
def construct_query_pool(self,
                         num_probes=-1,
                         max_num_candidates=-1,
                         num_query_objects=0):
    """Construct a pool of query objects.
    This method constructs and returns a pool of query objects, which
    can be used to query the LSH data structure from several threads.
    Arguments:
    * `num_probes` (default `-1`): the number of buckets the query algorithm
    probes. This number can later be modified using the `set_num_probes` method of
    the query object. The higher number of probes is, the better accuracy one gets,
    but the slower queries are. If `num_probes` is equal to `-1`, then we probe
    one bucket per (each of the `params.L`) table;
    * `max_num_candidates` (default `-1`): the maximum number of candidate points
    we retrieve. The value `-1` means that the said number is unbounded;
    * `num_query_objects` (default `0`): the number of query objects in the pool.
    The value `0` makes the number of query objects to be twice the number of hardware
    threads on the machine.
    """
    self._check_built()
    return Queryable(
        self._table.construct_query_pool(num_probes, max_num_candidates,
                                         num_query_objects), self)

def setup(

self, dataset)

Build the LSH data structure from a given dataset.

The method builds the LSH data structure using the parameters passed during the construction and a given dataset (stored as self._params).

A dataset must be a two-dimensional NumPy array with dtype numpy.float32 or numpy.float64. The rows of the array are interpreted as data points. Thus, the second dimension of the array must match the dimension from parameters (self._params.dimension).

An important caveat: DON'T DELETE THE DATASET WHILE USING LSHIndex. This can lead to silent crashes and can be very confusing.

Arguments:

  • dataset: a two-dimensional NumPy array with dtype numpy.float32 or numpy.float64; the second dimension must match dimension from the LSH parameters (self._params.dimension)
def setup(self, dataset):
    """Build the LSH data structure from a given dataset.
    The method builds the LSH data structure using the parameters
    passed during the construction and a given dataset (stored as
    `self._params`).
    A dataset must be a two-dimensional
    NumPy array with `dtype` `numpy.float32` or `numpy.float64`.
    The rows
    of the array are interpreted as data points. Thus, the second
    dimension of the array must match the dimension from parameters
    (`self._params.dimension`).
    An important caveat: **DON'T DELETE THE DATASET WHILE USING
    `LSHIndex`**. This can lead to silent crashes and can be very
    confusing.
    Arguments:
    * `dataset`: a two-dimensional NumPy array with dtype
    `numpy.float32` or `numpy.float64`; the second dimension must match
    dimension from the LSH parameters (`self._params.dimension`)
    """
    if self._dataset is not None or self._table is not None:
        raise RuntimeError('setup() has already been called')
    if not isinstance(dataset, _numpy.ndarray):
        raise TypeError('dataset must be an instance of numpy.ndarray')
    if len(dataset.shape) != 2:
        raise ValueError('dataset must be a two-dimensional array')
    if dataset.dtype != _numpy.float32 and dataset.dtype != _numpy.float64:
        raise TypeError('dataset must consist of floats or doubles')
    if dataset.shape[1] != self._params.dimension:
        raise ValueError(
            'dataset dimension mismatch: {} expected, but {} found'.format(
                self._params.dimension, dataset.shape[1]))
    self._dataset = dataset
    if dataset.dtype == _numpy.float32:
        self._table = _internal.construct_table_dense_float(
            dataset, self._params)
    else:
        self._table = _internal.construct_table_dense_double(
            dataset, self._params)

class Queryable

A simple wrapper for query objects and query pools.

Instances of Queryable are returned by the methods construct_query_object and construct_query_pool of LSHIndex. You are not expected to construct instances of Queryable directly.

class Queryable:
    """A simple wrapper for query objects and query pools.
    
    Instances of `Queryable` are returned by the methods
    `construct_query_object` and `construct_query_pool` of `LSHIndex`.
    You are not expected to construct instances of `Queryable` directly.
    """

    def __init__(self, inner_entity, parent):
        self._inner_entity = inner_entity
        self._parent = parent

    def _check_query(self, query):
        if not isinstance(query, _numpy.ndarray):
            raise TypeError('query must be an instance of numpy.ndarray')
        if len(query.shape) != 1:
            raise ValueError('query must be one-dimensional')
        if self._parent._dataset.dtype != query.dtype:
            raise TypeError('dataset and query must have the same dtype')
        if query.shape[0] != self._parent._params.dimension:
            raise ValueError(
                'query dimension mismatch: {} expected, but {} found'.format(
                    self._parent._params.dimension, query.shape[0]))

    def find_k_nearest_neighbors(self, query, k):
        """Retrieve the closest `k` neighbors to `query`.

        Find the keys of the `k` closest candidates in the probing
        sequence for query. The keys are returned in order of
        increasing distance to query.

        Arguments:

        * `query`: a query given as a one-dimension NumPy array of the same
        `dtype` as the dataset; the dimension of `query` much match
        the second dimension of the dataset;
        * `k`: the number of closest candidates to retrieve.
        """
        self._check_query(query)
        if k <= 0:
            raise ValueError('k must be positive rather than {}'.format(k))
        return self._inner_entity.find_k_nearest_neighbors(query, k)

    def find_near_neighbors(self, query, threshold):
        """Find all the points within `threshold` distance from `query`.

        Returns the keys corresponding to candidates in the probing
        sequence for query that have distance to `query` at most `threshold`.

        Arguments:

        * `query`: a query given as a one-dimension NumPy array of the same
        `dtype` as the dataset; the dimension of `query` much match
        the second dimension of the dataset;
        * `threshold`: a distance threshold up to which we enumerate
        the candidates; note that it can be negative, and for the distance
        function `'negative_inner_product'` it actually makes sense.
        """
        self._check_query(query)
        if threshold < 0:
            raise ValueError('threshold must be non-negative rather than {}'.
                             format(threshold))
        return self._inner_entity.find_near_neighbors(query, threshold)

    def find_nearest_neighbor(self, query):
        """Find the key of the closest candidate.

        Finds the key of the closest candidate in the probing sequence
        for a query.

        Arguments:

        * `query`: a query given as a one-dimension NumPy array of the same
        `dtype` as the dataset; the dimension of `query` much match
        the second dimension of the dataset.
        """
        self._check_query(query)
        return self._inner_entity.find_nearest_neighbor(query)

    def get_candidates_with_duplicates(self, query):
        """Retrieve all the candidates for a given query.

        Returns the keys of all candidates in the probing sequence for
        query. If a candidate key is found in multiple tables, it will
        appear multiple times in the result. The candidates are
        returned in the order in which they appear in the probing
        sequence.

        Arguments:

        * `query`: a query given as a one-dimension NumPy array of the same
        `dtype` as the dataset; the dimension of `query` much match
        the second dimension of the dataset.
        """
        self._check_query(query)
        return self._inner_entity.get_candidates_with_duplicates(query)

    def get_max_num_candidates(self):
        """Get the maximum number of candidates considered in each query."""
        return self._inner_entity.get_max_num_candidates()

    def get_num_probes(self):
        """Get the number of probes used for each query."""
        return self._inner_entity.get_num_probes()

    def get_query_statistics(self):
        """Return the query statistics.

        Returns an instance of `QueryStatistics`
        that summarizes the statistics for queries
        made so far (after the last `reset_query_statistics()`
        or the construction).
        """
        return self._inner_entity.get_query_statistics()

    def get_unique_candidates(self, query):
        """Retrieve all the candidates (each at most once) for a query.

        Returns the keys of all candidates in the probing sequence for
        query. If a candidate key is found in multiple tables, it will
        appear once in the result.

        Arguments:

        * `query`: a query given as a one-dimension NumPy array of the same
        `dtype` as the dataset; the dimension of `query` much match
        the second dimension of the dataset.
        """
        self._check_query(query)
        return self._inner_entity.get_unique_candidates(query)

    def reset_query_statistics(self):
        """Reset the query statistics."""
        self._inner_entity.reset_query_statistics()

    def set_max_num_candidates(self, max_num_candidates=-1):
        """Set the maximum number of candidates considered in each query.

        The value `-1` indicates that all candidates retrieved
        in the probing sequence should be considered. This is the
        default and usually a good setting. A maximum number of
        candidates is mainly useful to give worst case running time
        guarantees for every query.

        Arguments:

        * `max_num_candidates` (default `-1`): the maximum number of
        candidates.
        """
        if max_num_candidates < -1:
            raise ValueError(
                'invalid max_num_candidates: {}'.format(max_num_candidates))
        self._inner_entity.set_max_num_candidates(max_num_candidates)

    def set_num_probes(self, num_probes):
        """Set the number of probes used for each query.

        The default setting is `self._params.l` (the number of hash tables),
        which
        effectively disables multiprobing (the probing sequence only
        contains a single candidate per table).

        Arguments:

        * `num_probes`: the total number of probes per query.
        """
        if num_probes < self._parent._params.l:
            raise ValueError(
                'number of probes must be at least the number of tables ({})'.
                format(self._parent._params.l))
        self._inner_entity.set_num_probes(num_probes)

Ancestors (in MRO)

Methods

def __init__(

self, inner_entity, parent)

def __init__(self, inner_entity, parent):
    self._inner_entity = inner_entity
    self._parent = parent

def find_k_nearest_neighbors(

self, query, k)

Retrieve the closest k neighbors to query.

Find the keys of the k closest candidates in the probing sequence for query. The keys are returned in order of increasing distance to query.

Arguments:

  • query: a query given as a one-dimension NumPy array of the same dtype as the dataset; the dimension of query much match the second dimension of the dataset;
  • k: the number of closest candidates to retrieve.
def find_k_nearest_neighbors(self, query, k):
    """Retrieve the closest `k` neighbors to `query`.
    Find the keys of the `k` closest candidates in the probing
    sequence for query. The keys are returned in order of
    increasing distance to query.
    Arguments:
    * `query`: a query given as a one-dimension NumPy array of the same
    `dtype` as the dataset; the dimension of `query` much match
    the second dimension of the dataset;
    * `k`: the number of closest candidates to retrieve.
    """
    self._check_query(query)
    if k <= 0:
        raise ValueError('k must be positive rather than {}'.format(k))
    return self._inner_entity.find_k_nearest_neighbors(query, k)

def find_near_neighbors(

self, query, threshold)

Find all the points within threshold distance from query.

Returns the keys corresponding to candidates in the probing sequence for query that have distance to query at most threshold.

Arguments:

  • query: a query given as a one-dimension NumPy array of the same dtype as the dataset; the dimension of query much match the second dimension of the dataset;
  • threshold: a distance threshold up to which we enumerate the candidates; note that it can be negative, and for the distance function 'negative_inner_product' it actually makes sense.
def find_near_neighbors(self, query, threshold):
    """Find all the points within `threshold` distance from `query`.
    Returns the keys corresponding to candidates in the probing
    sequence for query that have distance to `query` at most `threshold`.
    Arguments:
    * `query`: a query given as a one-dimension NumPy array of the same
    `dtype` as the dataset; the dimension of `query` much match
    the second dimension of the dataset;
    * `threshold`: a distance threshold up to which we enumerate
    the candidates; note that it can be negative, and for the distance
    function `'negative_inner_product'` it actually makes sense.
    """
    self._check_query(query)
    if threshold < 0:
        raise ValueError('threshold must be non-negative rather than {}'.
                         format(threshold))
    return self._inner_entity.find_near_neighbors(query, threshold)

def find_nearest_neighbor(

self, query)

Find the key of the closest candidate.

Finds the key of the closest candidate in the probing sequence for a query.

Arguments:

  • query: a query given as a one-dimension NumPy array of the same dtype as the dataset; the dimension of query much match the second dimension of the dataset.
def find_nearest_neighbor(self, query):
    """Find the key of the closest candidate.
    Finds the key of the closest candidate in the probing sequence
    for a query.
    Arguments:
    * `query`: a query given as a one-dimension NumPy array of the same
    `dtype` as the dataset; the dimension of `query` much match
    the second dimension of the dataset.
    """
    self._check_query(query)
    return self._inner_entity.find_nearest_neighbor(query)

def get_candidates_with_duplicates(

self, query)

Retrieve all the candidates for a given query.

Returns the keys of all candidates in the probing sequence for query. If a candidate key is found in multiple tables, it will appear multiple times in the result. The candidates are returned in the order in which they appear in the probing sequence.

Arguments:

  • query: a query given as a one-dimension NumPy array of the same dtype as the dataset; the dimension of query much match the second dimension of the dataset.
def get_candidates_with_duplicates(self, query):
    """Retrieve all the candidates for a given query.
    Returns the keys of all candidates in the probing sequence for
    query. If a candidate key is found in multiple tables, it will
    appear multiple times in the result. The candidates are
    returned in the order in which they appear in the probing
    sequence.
    Arguments:
    * `query`: a query given as a one-dimension NumPy array of the same
    `dtype` as the dataset; the dimension of `query` much match
    the second dimension of the dataset.
    """
    self._check_query(query)
    return self._inner_entity.get_candidates_with_duplicates(query)

def get_max_num_candidates(

self)

Get the maximum number of candidates considered in each query.

def get_max_num_candidates(self):
    """Get the maximum number of candidates considered in each query."""
    return self._inner_entity.get_max_num_candidates()

def get_num_probes(

self)

Get the number of probes used for each query.

def get_num_probes(self):
    """Get the number of probes used for each query."""
    return self._inner_entity.get_num_probes()

def get_query_statistics(

self)

Return the query statistics.

Returns an instance of QueryStatistics that summarizes the statistics for queries made so far (after the last reset_query_statistics() or the construction).

def get_query_statistics(self):
    """Return the query statistics.
    Returns an instance of `QueryStatistics`
    that summarizes the statistics for queries
    made so far (after the last `reset_query_statistics()`
    or the construction).
    """
    return self._inner_entity.get_query_statistics()

def get_unique_candidates(

self, query)

Retrieve all the candidates (each at most once) for a query.

Returns the keys of all candidates in the probing sequence for query. If a candidate key is found in multiple tables, it will appear once in the result.

Arguments:

  • query: a query given as a one-dimension NumPy array of the same dtype as the dataset; the dimension of query much match the second dimension of the dataset.
def get_unique_candidates(self, query):
    """Retrieve all the candidates (each at most once) for a query.
    Returns the keys of all candidates in the probing sequence for
    query. If a candidate key is found in multiple tables, it will
    appear once in the result.
    Arguments:
    * `query`: a query given as a one-dimension NumPy array of the same
    `dtype` as the dataset; the dimension of `query` much match
    the second dimension of the dataset.
    """
    self._check_query(query)
    return self._inner_entity.get_unique_candidates(query)

def reset_query_statistics(

self)

Reset the query statistics.

def reset_query_statistics(self):
    """Reset the query statistics."""
    self._inner_entity.reset_query_statistics()

def set_max_num_candidates(

self, max_num_candidates=-1)

Set the maximum number of candidates considered in each query.

The value -1 indicates that all candidates retrieved in the probing sequence should be considered. This is the default and usually a good setting. A maximum number of candidates is mainly useful to give worst case running time guarantees for every query.

Arguments:

  • max_num_candidates (default -1): the maximum number of candidates.
def set_max_num_candidates(self, max_num_candidates=-1):
    """Set the maximum number of candidates considered in each query.
    The value `-1` indicates that all candidates retrieved
    in the probing sequence should be considered. This is the
    default and usually a good setting. A maximum number of
    candidates is mainly useful to give worst case running time
    guarantees for every query.
    Arguments:
    * `max_num_candidates` (default `-1`): the maximum number of
    candidates.
    """
    if max_num_candidates < -1:
        raise ValueError(
            'invalid max_num_candidates: {}'.format(max_num_candidates))
    self._inner_entity.set_max_num_candidates(max_num_candidates)

def set_num_probes(

self, num_probes)

Set the number of probes used for each query.

The default setting is self._params.l (the number of hash tables), which effectively disables multiprobing (the probing sequence only contains a single candidate per table).

Arguments:

  • num_probes: the total number of probes per query.
def set_num_probes(self, num_probes):
    """Set the number of probes used for each query.
    The default setting is `self._params.l` (the number of hash tables),
    which
    effectively disables multiprobing (the probing sequence only
    contains a single candidate per table).
    Arguments:
    * `num_probes`: the total number of probes per query.
    """
    if num_probes < self._parent._params.l:
        raise ValueError(
            'number of probes must be at least the number of tables ({})'.
            format(self._parent._params.l))
    self._inner_entity.set_num_probes(num_probes)