Renga Bashyam, KG and Vadhiyar, S (2020) Fast Scalable Approximate Nearest Neighbor Search for High-dimensional Data. In: 22nd IEEE International Conference on Cluster Computing, CLUSTER 2020, 14-17 September 2020, Kobe; Japan, pp. 294-302.
PDF
pro_iee_int_con_clu_com_icc_2020_294-302_2020.pdf - Published Version Restricted to Registered users only Download (8MB) | Request a copy |
Abstract
K-Nearest Neighbor (k-NN) search is one of the most commonly used approaches for similarity search. It finds extensive applications in machine learning and data mining. This era of big data warrants efficiently scaling k-NN search algorithms for billion-scale datasets with high dimensionality. In this paper, we propose a solution towards this end where we use vantage point trees for partitioning the dataset across multiple processes and exploit an existing graph-based sequential approximate k-NN search algorithm called HNSW (Hierarchical Navigable Small World) for searching locally within a process. Our hybrid MPI-OpenMP solution employs techniques including exploiting MPI one-sided communication for reducing communication times and partition replication for better load balancing across processes. We demonstrate computation of k-NN for 10,000 queries in the order of seconds using our approach on �8000 cores on a dataset with billion points in an 128-dimensional space. We also show 10X speedup over a completely k-d tree-based solution for the same dataset, thus demonstrating better suitability of our solution for high dimensional datasets. Our solution shows almost linear strong scaling, © 2020 IEEE.
Item Type: | Conference Paper |
---|---|
Publication: | Proceedings - IEEE International Conference on Cluster Computing, ICCC |
Publisher: | Institute of Electrical and Electronics Engineers Inc. |
Additional Information: | The copyright of this paper belongs to Institute of Electrical and Electronics Engineers Inc. |
Keywords: | Application programming interfaces (API); Balancing; Cluster computing; Clustering algorithms; Data mining; Graph algorithms; Graphic methods; Large dataset; Learning algorithms; Trees (mathematics), High dimensional data; High dimensional datasets; High dimensionality; K-nearest neighbors; Multiple process; One sided communication; Similarity search; Vantage-point trees, Nearest neighbor search |
Department/Centre: | Division of Interdisciplinary Sciences > Computational and Data Sciences |
Date Deposited: | 19 Jan 2021 05:36 |
Last Modified: | 19 Jan 2021 05:36 |
URI: | http://eprints.iisc.ac.in/id/eprint/67386 |
Actions (login required)
View Item |