Sparse Convolution FPGA Accelerator Based on Multi-Bank Hash Selection

Reconfigurable processor-based acceleration of deep convolutional neural network (DCNN) algorithms has emerged as a widely adopted technique, with particular attention on sparse neural network acceleration as an active research area. However, many computing devices that claim high computational powe...

Full description

Saved in:

Bibliographic Details
Main Authors:	Jia Xu, Han Pu, Dong Wang
Format:	Article
Language:	English
Published:	MDPI AG 2024-12-01
Series:	Micromachines
Subjects:	deep convolutional neural network FPGA heterogeneous computing high-level synthesis cache memory
Online Access:	https://www.mdpi.com/2072-666X/16/1/22
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832587946979491840
author	Jia Xu Han Pu Dong Wang
author_facet	Jia Xu Han Pu Dong Wang
author_sort	Jia Xu
collection	DOAJ
description	Reconfigurable processor-based acceleration of deep convolutional neural network (DCNN) algorithms has emerged as a widely adopted technique, with particular attention on sparse neural network acceleration as an active research area. However, many computing devices that claim high computational power still struggle to execute neural network algorithms with optimal efficiency, low latency, and minimal power consumption. Consequently, there remains significant potential for further exploration into improving the efficiency, latency, and power consumption of neural network accelerators across diverse computational scenarios. This paper investigates three key techniques for hardware acceleration of sparse neural networks. The main contributions are as follows: (1) Most neural network inference tasks are typically executed on general-purpose computing devices, which often fail to deliver high energy efficiency and are not well-suited for accelerating sparse convolutional models. In this work, we propose a specialized computational circuit for the convolutional operations of sparse neural networks. This circuit is designed to detect and eliminate the computational effort associated with zero values in the sparse convolutional kernels, thereby enhancing energy efficiency. (2) The data access patterns in convolutional neural networks introduce significant pressure on the high-latency off-chip memory access process. Due to issues such as data discontinuity, the data reading unit often fails to fully exploit the available bandwidth during off-chip read and write operations. In this paper, we analyze bandwidth utilization in the context of convolutional accelerator data handling and propose a strategy to improve off-chip access efficiency. Specifically, we leverage a compiler optimization plugin developed for Vitis HLS, which automatically identifies and optimizes on-chip bandwidth utilization. (3) In coefficient-based accelerators, the synchronous operation of individual computational units can significantly hinder efficiency. Previous approaches have achieved asynchronous convolution by designing separate memory units for each computational unit; however, this method consumes a substantial amount of on-chip memory resources. To address this issue, we propose a shared feature map cache design for asynchronous convolution in the accelerators presented in this paper. This design resolves address access conflicts when multiple computational units concurrently access a set of caches by utilizing a hash-based address indexing algorithm. Moreover, the shared cache architecture reduces data redundancy and conserves on-chip resources. Using the optimized accelerator, we successfully executed ResNet50 inference on an Intel Arria 10 1150GX FPGA, achieving a throughput of 497 GOPS, or an equivalent computational power of 1579 GOPS, with a power consumption of only 22 watts.
format	Article
id	doaj-art-b34647b6453447daae408a7c4d304a83
institution	Kabale University
issn	2072-666X
language	English
publishDate	2024-12-01
publisher	MDPI AG
record_format	Article
series	Micromachines
spelling	doaj-art-b34647b6453447daae408a7c4d304a832025-01-24T13:41:52ZengMDPI AGMicromachines2072-666X2024-12-011612210.3390/mi16010022Sparse Convolution FPGA Accelerator Based on Multi-Bank Hash SelectionJia Xu0Han Pu1Dong Wang2Institute of Information Science, Beijing Jiaotong University, Beijing 100044, ChinaInstitute of Information Science, Beijing Jiaotong University, Beijing 100044, ChinaInstitute of Information Science, Beijing Jiaotong University, Beijing 100044, ChinaReconfigurable processor-based acceleration of deep convolutional neural network (DCNN) algorithms has emerged as a widely adopted technique, with particular attention on sparse neural network acceleration as an active research area. However, many computing devices that claim high computational power still struggle to execute neural network algorithms with optimal efficiency, low latency, and minimal power consumption. Consequently, there remains significant potential for further exploration into improving the efficiency, latency, and power consumption of neural network accelerators across diverse computational scenarios. This paper investigates three key techniques for hardware acceleration of sparse neural networks. The main contributions are as follows: (1) Most neural network inference tasks are typically executed on general-purpose computing devices, which often fail to deliver high energy efficiency and are not well-suited for accelerating sparse convolutional models. In this work, we propose a specialized computational circuit for the convolutional operations of sparse neural networks. This circuit is designed to detect and eliminate the computational effort associated with zero values in the sparse convolutional kernels, thereby enhancing energy efficiency. (2) The data access patterns in convolutional neural networks introduce significant pressure on the high-latency off-chip memory access process. Due to issues such as data discontinuity, the data reading unit often fails to fully exploit the available bandwidth during off-chip read and write operations. In this paper, we analyze bandwidth utilization in the context of convolutional accelerator data handling and propose a strategy to improve off-chip access efficiency. Specifically, we leverage a compiler optimization plugin developed for Vitis HLS, which automatically identifies and optimizes on-chip bandwidth utilization. (3) In coefficient-based accelerators, the synchronous operation of individual computational units can significantly hinder efficiency. Previous approaches have achieved asynchronous convolution by designing separate memory units for each computational unit; however, this method consumes a substantial amount of on-chip memory resources. To address this issue, we propose a shared feature map cache design for asynchronous convolution in the accelerators presented in this paper. This design resolves address access conflicts when multiple computational units concurrently access a set of caches by utilizing a hash-based address indexing algorithm. Moreover, the shared cache architecture reduces data redundancy and conserves on-chip resources. Using the optimized accelerator, we successfully executed ResNet50 inference on an Intel Arria 10 1150GX FPGA, achieving a throughput of 497 GOPS, or an equivalent computational power of 1579 GOPS, with a power consumption of only 22 watts.https://www.mdpi.com/2072-666X/16/1/22deep convolutional neural networkFPGAheterogeneous computinghigh-level synthesiscache memory
spellingShingle	Jia Xu Han Pu Dong Wang Sparse Convolution FPGA Accelerator Based on Multi-Bank Hash Selection Micromachines deep convolutional neural network FPGA heterogeneous computing high-level synthesis cache memory
title	Sparse Convolution FPGA Accelerator Based on Multi-Bank Hash Selection
title_full	Sparse Convolution FPGA Accelerator Based on Multi-Bank Hash Selection
title_fullStr	Sparse Convolution FPGA Accelerator Based on Multi-Bank Hash Selection
title_full_unstemmed	Sparse Convolution FPGA Accelerator Based on Multi-Bank Hash Selection
title_short	Sparse Convolution FPGA Accelerator Based on Multi-Bank Hash Selection
title_sort	sparse convolution fpga accelerator based on multi bank hash selection
topic	deep convolutional neural network FPGA heterogeneous computing high-level synthesis cache memory
url	https://www.mdpi.com/2072-666X/16/1/22
work_keys_str_mv	AT jiaxu sparseconvolutionfpgaacceleratorbasedonmultibankhashselection AT hanpu sparseconvolutionfpgaacceleratorbasedonmultibankhashselection AT dongwang sparseconvolutionfpgaacceleratorbasedonmultibankhashselection

Sparse Convolution FPGA Accelerator Based on Multi-Bank Hash Selection

Similar Items