Runtime Programmable and Memory Bandwidth Optimized FPGA-Based Coprocessor for Deep Convolutional Neural Network

Shah, Nimish and Chaudhari, Paragkumar and Varghese, Kuruvilla (2018) Runtime Programmable and Memory Bandwidth Optimized FPGA-Based Coprocessor for Deep Convolutional Neural Network. In: IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 29 (12). pp. 5922-5934.

PDF
Iee_Tra_Neu_Net_Lea_Sys_29-12_5922_2018.pdf - Published Version
Restricted to Registered users only
Download (3MB) | Request a copy

Official URL: http://dx.doi.org/10.1109/TNNLS.2018.2815085

Abstract

The deep convolutional neural network (DCNN) is a class of machine learning algorithms based on feed-forward artificial neural network and is widely used for image processing applications. Implementation of DCNN in real-world problems needs high computational power and high memory bandwidth, in a power-constrained environment. A general purpose CPU cannot exploit different parallelisms offered by these algorithms and hence is slow and energy inefficient for practical use. We propose a field-programmable gate array (FPGA)-based runtime programmable coprocessor to accelerate feed-forward computation of DCNNs. The coprocessor can be programmed for a new network architecture at runtime without resynthesizing the FPGA hardware. Hence, it acts as a plug-and-use peripheral for the host computer. Caching is implemented for input features and filter weights using on-chip memory to reduce the external memory bandwidth requirement. Data are prefetched at several stages to avoid stalling of computational units and different optimization techniques are used to efficiently reuse the fetched data. Dataflow is dynamically adjusted in runtime for each DCNN layer to achieve consistent computational throughput across a wide range of input feature sizes and filter sizes. The coprocessor is prototyped using Xilinx Virtex-7 XC7VX485T FPGA-based VC707 board and operates at 150 MHz. Experimental results show that our implementation is 15x energy efficient than highly optimized CPU implementation and achieves consistent computational throughput of more than 140 G operations/s for a wide range of input feature sizes and filter sizes. Off-chip memory transactions decrease by 111x due to the use of the on-chip cache.

Item Type:	Journal Article
Publication:	IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS
Publisher:	IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
Additional Information:	Copyright for this article belongs to IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
Keywords:	Accelerator; coprocessor; deep convolutional neural network (DCNN); deep learning; field-programmable gate array (FPGA); runtime programmable
Department/Centre:	Division of Electrical Sciences > Electronic Systems Engineering (Formerly Centre for Electronic Design & Technology)
Date Deposited:	11 Dec 2018 12:12
Last Modified:	11 Dec 2018 12:12
URI:	http://eprints.iisc.ac.in/id/eprint/61218

Actions (login required)

View Item


	Powered by EPrints		A service from The J.R.D. Tata Memorial Library Indian Institute of Science, Bengaluru-560012, India