Fast Scalable Approximate Nearest Neighbor Search for High-dimensional Data

Renga Bashyam, KG and Vadhiyar, S (2020) Fast Scalable Approximate Nearest Neighbor Search for High-dimensional Data. In: 22nd IEEE International Conference on Cluster Computing, CLUSTER 2020, 14-17 September 2020, Kobe; Japan, pp. 294-302.

PDF
pro_iee_int_con_clu_com_icc_2020_294-302_2020.pdf - Published Version
Restricted to Registered users only
Download (8MB) | Request a copy

Official URL: http://dx.doi.org/10.1109/CLUSTER49012.2020.00040

Abstract

K-Nearest Neighbor (k-NN) search is one of the most commonly used approaches for similarity search. It finds extensive applications in machine learning and data mining. This era of big data warrants efficiently scaling k-NN search algorithms for billion-scale datasets with high dimensionality. In this paper, we propose a solution towards this end where we use vantage point trees for partitioning the dataset across multiple processes and exploit an existing graph-based sequential approximate k-NN search algorithm called HNSW (Hierarchical Navigable Small World) for searching locally within a process. Our hybrid MPI-OpenMP solution employs techniques including exploiting MPI one-sided communication for reducing communication times and partition replication for better load balancing across processes. We demonstrate computation of k-NN for 10,000 queries in the order of seconds using our approach on â�¼8000 cores on a dataset with billion points in an 128-dimensional space. We also show 10X speedup over a completely k-d tree-based solution for the same dataset, thus demonstrating better suitability of our solution for high dimensional datasets. Our solution shows almost linear strong scaling, Â© 2020 IEEE.

Item Type:	Conference Paper
Publication:	Proceedings - IEEE International Conference on Cluster Computing, ICCC
Publisher:	Institute of Electrical and Electronics Engineers Inc.
Additional Information:	The copyright of this paper belongs to Institute of Electrical and Electronics Engineers Inc.
Keywords:	Application programming interfaces (API); Balancing; Cluster computing; Clustering algorithms; Data mining; Graph algorithms; Graphic methods; Large dataset; Learning algorithms; Trees (mathematics), High dimensional data; High dimensional datasets; High dimensionality; K-nearest neighbors; Multiple process; One sided communication; Similarity search; Vantage-point trees, Nearest neighbor search
Department/Centre:	Division of Interdisciplinary Sciences > Computational and Data Sciences
Date Deposited:	19 Jan 2021 05:36
Last Modified:	19 Jan 2021 05:36
URI:	http://eprints.iisc.ac.in/id/eprint/67386

Actions (login required)

View Item


	Powered by EPrints		A service from The J.R.D. Tata Memorial Library Indian Institute of Science, Bengaluru-560012, India