ePrints@IIScePrints@IISc Home | About | Browse | Latest Additions | Advanced Search | Contact | Help

Learning and Evaluating Contextual Embedding of Source Code

Kanade, A and Maniatis, P and Balakrishnan, G and Shi, K (2020) Learning and Evaluating Contextual Embedding of Source Code. In: 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, pp. 5066-5077.

Full text not available from this repository.


Recent research has achieved impressive results on understanding and improving source code by building up on machine-learning techniques developed for natural languages. A significant advancement in natural-language understanding has come with the development of pre-Trained contextual embeddings, such as BERT, which can be fine-Tuned for downstream tasks with less labeled data and training budget, while achieving better accuracies. However, there is no attempt yet to obtain a high-quality contextual embedding of source code, and to evaluate it on multiple program-understanding tasks simultaneously; that is the gap that this paper aims to mitigate. Specifically, first, we curate a massive, deduplicated corpus of 7.4M Python files from GitHub, which we use to pre-Train CuBERT, an open-sourced codeunderstanding BERT model; and, second, we create an open-sourced benchmark that comprises five classification tasks and one program-repair task, akin to code-understanding tasks proposed in the literature before. We fine-Tune CuBERT on our benchmark tasks, and compare the resulting models to different variants of Word2Vec token embeddings, BiLSTM and Transformer models, as well as published state-of-The-Art models, showing that CuBERT outperforms them all, even with shorter training, and with fewer labeled examples. Future work on source-code embedding can benefit from reusing our benchmark, and from comparing against CuBERT models as a strong baseline. © 2020 by the Authors.

Item Type: Conference Paper
Publication: 37th International Conference on Machine Learning, ICML 2020
Publisher: International Machine Learning Society (IMLS)
Additional Information: The copyright for this article belongs to International Machine Learning Society (IMLS)
Keywords: Budget control; Computer programming languages; Computer software; Machine learning, Classification tasks; Code understanding; Multiple program; Natural language understanding; Natural languages; Recent researches; Training budgets; Transformer models, Embeddings
Department/Centre: Division of Electrical Sciences > Computer Science & Automation
Date Deposited: 04 Aug 2021 06:22
Last Modified: 04 Aug 2021 06:22
URI: http://eprints.iisc.ac.in/id/eprint/68987

Actions (login required)

View Item View Item