Fu, Xiao and Huang, Kejun and Papalexakis, Evangelos E and Song, Hyun Ah and Talukdar, Partha and Sidiropoulos, Nicholas D and Faloutsos, Christos and Mitchell, Tom (2019) Efficient and Distributed Generalized Canonical Correlation Analysis for Big Multiview Data. In: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 31 (12). pp. 2304-2318.
PDF
Iee_Tra_Kno_Dat_Eng_31-12_2304.pdf - Published Version Restricted to Registered users only Download (639kB) | Request a copy |
Abstract
Generalized canonical correlation analysis (GCCA) integrates information from data samples that are acquired at multiple feature spaces (or `views') to produce low-dimensional representations-which is an extension of classical two-view CCA. Since the 1960s, (G)CCA has attracted much attention in statistics, machine learning, and data mining because of its importance in data analytics. Despite these efforts, the existing GCCA algorithms have serious complexity issues. The memory and computational complexities of the existing algorithms usually grow as a quadratic and cubic function of the problem dimension (the number of samples / features), respectively-e.g., handling views with approximate to 1,000 features using such algorithms already occupies approximate to 10(6) memory and the per-iteration complexity is approximate to 10(9) flops-which makes it hard to push these methods much further. To circumvent such difficulties, we first propose a GCCA algorithm whose memory and computational costs scale linearly in the problem dimension and the number of nonzero data elements, respectively. Consequently, the proposed algorithm can easily handle very large sparse views whose sample and feature dimensions both exceed approximate to 100,000. Our second contribution lies in proposing two distributed algorithms for GCCA, which compute the canonical components of different views in parallel and thus can further reduce the runtime significantly if multiple computing agents are available. We provide detailed convergence analyses of the proposed algorithms and show that all the large-scale GCCA algorithms converge to a Karush-Kuhn-Tucker (KKT) point at least sublinearly. Judiciously designed synthetic and real-data experiments are employed to showcase the effectiveness of the proposed algorithms.
Item Type: | Journal Article |
---|---|
Publication: | IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING |
Publisher: | IEEE COMPUTER SOC |
Additional Information: | Copyright of this article belongs to IEEE COMPUTER SOC |
Keywords: | Distributed algorithms; Sparse matrices; Correlation; Machine learning algorithms; Electronic mail; Data mining; Machine learning; Generalized canonical correlation analysis; multiview learning; multilingual word embedding; distributed GCCA |
Department/Centre: | Division of Electrical Sciences > Computer Science & Automation Division of Interdisciplinary Sciences > Computational and Data Sciences |
Date Deposited: | 23 Dec 2019 09:40 |
Last Modified: | 23 Dec 2019 09:40 |
URI: | http://eprints.iisc.ac.in/id/eprint/64098 |
Actions (login required)
View Item |