Sparse Convolution FPGA Accelerator Based on Multi-Bank Hash Selection

Reconfigurable processor-based acceleration of deep convolutional neural network (DCNN) algorithms has emerged as a widely adopted technique, with particular attention on sparse neural network acceleration as an active research area. However, many computing devices that claim high computational powe...

Full description

Saved in:
Bibliographic Details
Main Authors: Jia Xu, Han Pu, Dong Wang
Format: Article
Language:English
Published: MDPI AG 2024-12-01
Series:Micromachines
Subjects:
Online Access:https://www.mdpi.com/2072-666X/16/1/22
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832587946979491840
author Jia Xu
Han Pu
Dong Wang
author_facet Jia Xu
Han Pu
Dong Wang
author_sort Jia Xu
collection DOAJ
description Reconfigurable processor-based acceleration of deep convolutional neural network (DCNN) algorithms has emerged as a widely adopted technique, with particular attention on sparse neural network acceleration as an active research area. However, many computing devices that claim high computational power still struggle to execute neural network algorithms with optimal efficiency, low latency, and minimal power consumption. Consequently, there remains significant potential for further exploration into improving the efficiency, latency, and power consumption of neural network accelerators across diverse computational scenarios. This paper investigates three key techniques for hardware acceleration of sparse neural networks. The main contributions are as follows: (1) Most neural network inference tasks are typically executed on general-purpose computing devices, which often fail to deliver high energy efficiency and are not well-suited for accelerating sparse convolutional models. In this work, we propose a specialized computational circuit for the convolutional operations of sparse neural networks. This circuit is designed to detect and eliminate the computational effort associated with zero values in the sparse convolutional kernels, thereby enhancing energy efficiency. (2) The data access patterns in convolutional neural networks introduce significant pressure on the high-latency off-chip memory access process. Due to issues such as data discontinuity, the data reading unit often fails to fully exploit the available bandwidth during off-chip read and write operations. In this paper, we analyze bandwidth utilization in the context of convolutional accelerator data handling and propose a strategy to improve off-chip access efficiency. Specifically, we leverage a compiler optimization plugin developed for Vitis HLS, which automatically identifies and optimizes on-chip bandwidth utilization. (3) In coefficient-based accelerators, the synchronous operation of individual computational units can significantly hinder efficiency. Previous approaches have achieved asynchronous convolution by designing separate memory units for each computational unit; however, this method consumes a substantial amount of on-chip memory resources. To address this issue, we propose a shared feature map cache design for asynchronous convolution in the accelerators presented in this paper. This design resolves address access conflicts when multiple computational units concurrently access a set of caches by utilizing a hash-based address indexing algorithm. Moreover, the shared cache architecture reduces data redundancy and conserves on-chip resources. Using the optimized accelerator, we successfully executed ResNet50 inference on an Intel Arria 10 1150GX FPGA, achieving a throughput of 497 GOPS, or an equivalent computational power of 1579 GOPS, with a power consumption of only 22 watts.
format Article
id doaj-art-b34647b6453447daae408a7c4d304a83
institution Kabale University
issn 2072-666X
language English
publishDate 2024-12-01
publisher MDPI AG
record_format Article
series Micromachines
spelling doaj-art-b34647b6453447daae408a7c4d304a832025-01-24T13:41:52ZengMDPI AGMicromachines2072-666X2024-12-011612210.3390/mi16010022Sparse Convolution FPGA Accelerator Based on Multi-Bank Hash SelectionJia Xu0Han Pu1Dong Wang2Institute of Information Science, Beijing Jiaotong University, Beijing 100044, ChinaInstitute of Information Science, Beijing Jiaotong University, Beijing 100044, ChinaInstitute of Information Science, Beijing Jiaotong University, Beijing 100044, ChinaReconfigurable processor-based acceleration of deep convolutional neural network (DCNN) algorithms has emerged as a widely adopted technique, with particular attention on sparse neural network acceleration as an active research area. However, many computing devices that claim high computational power still struggle to execute neural network algorithms with optimal efficiency, low latency, and minimal power consumption. Consequently, there remains significant potential for further exploration into improving the efficiency, latency, and power consumption of neural network accelerators across diverse computational scenarios. This paper investigates three key techniques for hardware acceleration of sparse neural networks. The main contributions are as follows: (1) Most neural network inference tasks are typically executed on general-purpose computing devices, which often fail to deliver high energy efficiency and are not well-suited for accelerating sparse convolutional models. In this work, we propose a specialized computational circuit for the convolutional operations of sparse neural networks. This circuit is designed to detect and eliminate the computational effort associated with zero values in the sparse convolutional kernels, thereby enhancing energy efficiency. (2) The data access patterns in convolutional neural networks introduce significant pressure on the high-latency off-chip memory access process. Due to issues such as data discontinuity, the data reading unit often fails to fully exploit the available bandwidth during off-chip read and write operations. In this paper, we analyze bandwidth utilization in the context of convolutional accelerator data handling and propose a strategy to improve off-chip access efficiency. Specifically, we leverage a compiler optimization plugin developed for Vitis HLS, which automatically identifies and optimizes on-chip bandwidth utilization. (3) In coefficient-based accelerators, the synchronous operation of individual computational units can significantly hinder efficiency. Previous approaches have achieved asynchronous convolution by designing separate memory units for each computational unit; however, this method consumes a substantial amount of on-chip memory resources. To address this issue, we propose a shared feature map cache design for asynchronous convolution in the accelerators presented in this paper. This design resolves address access conflicts when multiple computational units concurrently access a set of caches by utilizing a hash-based address indexing algorithm. Moreover, the shared cache architecture reduces data redundancy and conserves on-chip resources. Using the optimized accelerator, we successfully executed ResNet50 inference on an Intel Arria 10 1150GX FPGA, achieving a throughput of 497 GOPS, or an equivalent computational power of 1579 GOPS, with a power consumption of only 22 watts.https://www.mdpi.com/2072-666X/16/1/22deep convolutional neural networkFPGAheterogeneous computinghigh-level synthesiscache memory
spellingShingle Jia Xu
Han Pu
Dong Wang
Sparse Convolution FPGA Accelerator Based on Multi-Bank Hash Selection
Micromachines
deep convolutional neural network
FPGA
heterogeneous computing
high-level synthesis
cache memory
title Sparse Convolution FPGA Accelerator Based on Multi-Bank Hash Selection
title_full Sparse Convolution FPGA Accelerator Based on Multi-Bank Hash Selection
title_fullStr Sparse Convolution FPGA Accelerator Based on Multi-Bank Hash Selection
title_full_unstemmed Sparse Convolution FPGA Accelerator Based on Multi-Bank Hash Selection
title_short Sparse Convolution FPGA Accelerator Based on Multi-Bank Hash Selection
title_sort sparse convolution fpga accelerator based on multi bank hash selection
topic deep convolutional neural network
FPGA
heterogeneous computing
high-level synthesis
cache memory
url https://www.mdpi.com/2072-666X/16/1/22
work_keys_str_mv AT jiaxu sparseconvolutionfpgaacceleratorbasedonmultibankhashselection
AT hanpu sparseconvolutionfpgaacceleratorbasedonmultibankhashselection
AT dongwang sparseconvolutionfpgaacceleratorbasedonmultibankhashselection