ePrints@IIScePrints@IISc Home | About | Browse | Latest Additions | Advanced Search | Contact | Help

Fast Scalable Approximate Nearest Neighbor Search for High-dimensional Data

Renga Bashyam, KG and Vadhiyar, S (2020) Fast Scalable Approximate Nearest Neighbor Search for High-dimensional Data. In: 22nd IEEE International Conference on Cluster Computing, CLUSTER 2020, 14-17 September 2020, Kobe; Japan, pp. 294-302.

[img] PDF
pro_iee_int_con_clu_com_icc_2020_294-302_2020.pdf - Published Version
Restricted to Registered users only

Download (8MB) | Request a copy
Official URL: http://dx.doi.org/10.1109/CLUSTER49012.2020.00040

Abstract

K-Nearest Neighbor (k-NN) search is one of the most commonly used approaches for similarity search. It finds extensive applications in machine learning and data mining. This era of big data warrants efficiently scaling k-NN search algorithms for billion-scale datasets with high dimensionality. In this paper, we propose a solution towards this end where we use vantage point trees for partitioning the dataset across multiple processes and exploit an existing graph-based sequential approximate k-NN search algorithm called HNSW (Hierarchical Navigable Small World) for searching locally within a process. Our hybrid MPI-OpenMP solution employs techniques including exploiting MPI one-sided communication for reducing communication times and partition replication for better load balancing across processes. We demonstrate computation of k-NN for 10,000 queries in the order of seconds using our approach on �8000 cores on a dataset with billion points in an 128-dimensional space. We also show 10X speedup over a completely k-d tree-based solution for the same dataset, thus demonstrating better suitability of our solution for high dimensional datasets. Our solution shows almost linear strong scaling, © 2020 IEEE.

Item Type: Conference Paper
Publication: Proceedings - IEEE International Conference on Cluster Computing, ICCC
Publisher: Institute of Electrical and Electronics Engineers Inc.
Additional Information: The copyright of this paper belongs to Institute of Electrical and Electronics Engineers Inc.
Keywords: Application programming interfaces (API); Balancing; Cluster computing; Clustering algorithms; Data mining; Graph algorithms; Graphic methods; Large dataset; Learning algorithms; Trees (mathematics), High dimensional data; High dimensional datasets; High dimensionality; K-nearest neighbors; Multiple process; One sided communication; Similarity search; Vantage-point trees, Nearest neighbor search
Department/Centre: Division of Interdisciplinary Sciences > Computational and Data Sciences
Date Deposited: 19 Jan 2021 05:36
Last Modified: 19 Jan 2021 05:36
URI: http://eprints.iisc.ac.in/id/eprint/67386

Actions (login required)

View Item View Item