Convolutional Neural Networks (CNNs) are frequently used in a wide range of applications, including speech, image recognition and natural language processing. However, due to the computational complexity of CNNs, deploying these networks on resource-limited edge devices has become a significant challenge. Sparse CNNs use the sparsity in the weight matrices of the networks to minimize computations while maintaining accuracy. By storing only the nonzero values, the Compressed Sparse Row (CSR) format compresses the sparse matrix, lowering the memory requirement and computational complexity of the network. This work presents a novel approach for accelerating Sparse CNNs on Field-Programmable Gate Arrays (FPGAs) using the CSR format and systolic arrays. The proposed method takes advantage of systolic arrays' parallel processing capabilities to perform CSR-based sparse convolutions. Furthermore, an algorithm has been presented that optimizes the data layout to maximize data reuse and minimize data movement between different processing elements of the systolic array and external memory. The architecture is evaluated and compared to a state-of-the-art GPU implementation on several benchmark datasets. The proposed architecture outperformed the GPU-based implementation in terms of throughput and power efficiency by 1.42x and 22.4x, respectively. The presented approach provides a promising solution for accelerating Sparse CNNs on resource-constrained devices and enabling the deployment of these networks in a variety of applications.
|