FPGA-Based Architectures for Deep Learning Accelerators
| Authors: Zobov O.V., Shakhnov V.A. | Published: 23.01.2026 |
| Published in issue: #4(153)/2025 | |
| DOI: | |
| Category: Informatics, Computer Engineering and Control | Chapter: Computing Systems and their Elements | |
| Keywords: deep learning, hardware accelerators, structurally fixed accelerators, software-configurable hardware accelerators, computer system architecture | |
Abstract
The rapid development of deep learning technologies and their widespread adoption in various fields requires efficient solutions for hardware acceleration of computationally complex neural network models. As a hardware platform for accelerating deep learning tasks, field-programmable gate arrays are of particular interest, combining the flexibility of reprogramming with the efficiency of hardware implementation. They provide the ability to fine-tune computational pipelines and optimize memory hierarchy, which allows for significant reduction in latency and increase in energy efficiency when performing both the training phase and inference of models. The article presents theoretical and practical achievements in optimizing the architecture of field-programmable gate arrays for efficient acceleration of deep learning algorithms. Various approaches to building accelerators are considered --- from structurally fixed accelerators to software-configurable hardware accelerators that provide a balance between performance and flexibility. Special attention is paid to the improvement of classical components of field-programmable gate arrays and their specialization for the efficient implementation of basic deep learning operations, including matrix computations and multiply-accumulate operations of various precisions
Individual results of the work were obtained within the framework of the State Assignment (FSFN-2024-0086)
Please cite this article in English as:
Zobov O.V., Shakhnov V.A. FPGA-based architectures for deep learning accelerators. Herald of the Bauman Moscow State Technical University, Series Instrument Engineering, 2025, no. 4 (153), pp. 78--101 (in Russ.). EDN: KHNNVS
References
[1] LeCun Y., Bengio Y., Hinton G. Deep learning. Nature, 2015, vol. 521, no. 7553, pp. 436--444. DOI: https://doi.org/10.1038/nature14539
[2] Schmidhuber J. Deep learning in neural networks: an overview. Neural Netw., 2015, vol. 61, pp. 85--117. DOI: https://doi.org/10.1016/j.neunet.2014.09.003
[3] Rosenblatt F. The perceptron: a probabilistic model for information storage and organization in the brain. Psychol. Rev., 1958, vol. 65, no. 6, pp. 386--408. DOI: https://doi.org/10.1037/h0042519
[4] LeCun Y., Bottou L., Bengio Y., et al. Gradient-based learning applied to document recognition. Proc. IEEE, 1998, vol. 86, no. 11, pp. 2278--2324. DOI: https://doi.org/10.1109/5.726791
[5] Rumelhart D.E., Hinton G.E., Williams R.J. Learning internal representations by error propagation. Technical Report ICS-8504. San Diego, University of California, Institute for Cognitive Science, 1985.
[6] Vaswani A., Shazeer N., Parmar N., et al. Attention is all you need. Proc. 31st. NIPS, 2017, pp. 6000--6010. DOI: https://doi.org/10.1007/s11704-025-50480-3
[7] Shakhnov V.A., Vlasov A.I., Polyakov Yu.A., et al. Neyrokompyutery: arkhitektura i skhemotekhnika [Neurocomputers: architecture and circuitry]. Moscow, Mashinostroenie Publ., 2000. EDN: RVYJUX
[8] Levin I.I., Dordopulo A.I., Kalyaev I.A., et al. High-performance reconfigurable computing systems based on Virtex-7 FPGAs. Trudy Instituta matematiki i informatiki Natsionalnoy akademii nauk Belarusi [Proceedings of the Institute of Mathematics of the National Academy of Sciences of Belarus], 2014, no. 6, pp. 3--7 (in Russ.).
[9] Kalyaev I.A., Levin I.I. Reconfigurable multipipeline computing systems for data-driven tasks of information handling and control solution. Izvestiya Yuzhnogo federalnogo universiteta. Tekhnicheskie nauki [Journal of Information Technologies and Computing Systems], 2011, no. 2, pp. 12--22 (in Russ.). EDN: OZQLKX
[10] Akhmetov N.R., Vlasov A.I., Dimitrov D.A., et al. A promising element base for smart systems in a digital transformation of industry. Datchiki i sistemy [Sensors & Systems], 2021, no. 1, pp. 9--17 (in Russ.). DOI: https://doi.org/10.25728/datsys.2021.1.2
[11] Dordopulo A.I., Kalyaev I.A., Levin I.I., et al. High-performance reconfigurable computer systems of new generation. Vychislitelnye metody i programmirovanie [Numerical Methods and Programming], 2011, vol. 12, no. 4, pp. 82--89 (in Russ.). EDN: OJAZNN
[12] Vlasov A.I. Hardware implementation of neurocomputing control systems. Upravlenie, kontrol, diagnostika [Instruments and Systems: Monitoring, Control, and Diagnostics], 1999, no. 2, pp. 61--65 (in Russ.). EDN: TEKPVZ
[13] Sozzo E.D., Conficconi D., Zeni A., et al. Pushing the level of abstraction of digital system design: a survey on how to program FPGAs. ACM Comput. Surv., 2023, vol. 55, no. 5, art. 106. DOI: https://doi.org/10.1145/3532989
[14] Zhang X., Wang J., Zhu C., et al. DNNBuilder: an automated tool for building high-performance DNN hardware accelerators for FPGAs. ICCAD’18, 2018, art. 56. DOI: https://doi.org/10.1145/3240765.3240801
[15] Zhang X., Wang J., Zhu C., et al. AccDNN: an IP-Based DNN generator for FPGAs. IEEE 26th FCCM, 2018, p. 210. DOI: https://doi.org/10.1109/FCCM.2018.00044
[16] Guan Y., Liang H., Xu N., et al. FP-DNN: an automated framework for mapping deep neural networks onto FPGAs with RTL-HLS hybrid templates. IEEE 25th FCCM, 2017, pp. 152--159. DOI: https://doi.org/10.1109/FCCM.2017.25
[17] Zhang X., Ye H., Wang J., et al. DNNExplorer: a framework for modeling and exploring a novel paradigm of FPGA-based DNN accelerator. ICCAD’20, 2020, art. 61. DOI: https://doi.org/10.1145/3400302.3415609
[18] Feng L., Liu W., Guo C., et al. GANDSE: generative adversarial network-based design space exploration for neural network accelerator design. ACM TODAES, 2023, vol. 28, no. 3, art. 35. DOI: https://doi.org/10.1145/3570926
[19] Xu P., Zhang X., Hao C., et al. AutoDNNchip: an automated DNN chip predictor and builder for both FPGAs and ASICs. FPGA’20, 2020, pp. 40--50. DOI: https://doi.org/10.1145/3373087.3375306
[20] Venieris S.I., Bouganis C. fpgaConvNet: a framework for mapping convolutional neural networks on FPGAs. IEEE 24th FCCM, 2016, pp. 40--47. DOI: https://doi.org/10.1109/FCCM.2016.22
[21] Wang Y., Xu J., Han Y., et al. DeepBurning: automatic generation of FPGA-based learning accelerators for the neural network family. DAC’16, 2016, art. 110. DOI: https://doi.org/10.1145/2897937.2898003
[22] Guan Y., Liang H., Xu N., et al. FP-DNN: an automated framework for mapping deep neural networks onto FPGAs with RTL-HLS hybrid templates. IEEE 25th FCCM, 2017, pp. 152--159. DOI: https://doi.org/10.1109/FCCM.2017.25
[23] Ding Y., Wu J., Gao Y., et al. Model-platform optimized deep neural network accelerator generation through mixed-integer geometric programming. IEEE 31st FCCM, 2023, pp. 83--93. DOI: https://doi.org/10.1109/FCCM57271.2023.00018
[24] Wang C., Zhang X., Cong J., et al. Addressing architectural obstacles for overlay with stream network abstraction. ArXiv:2411.17966. URL: https://arxiv.org/abs/2411.17966v1
[25] Abdelfattah M., Han D., Bitar A., et al. DLA: compiler and FPGA overlay for neural network inference acceleration. 28th FPL, 2018, pp. 411--417. DOI: https://doi.org/10.1109/FPL.2018.00077
[26] Hu W., Xu D., Fan Z., et al. Vis-TOP: visual transformer overlay processor. ArXiv:2110.10957. DOI: https://doi.org/10.48550/arXiv.2110.10957
[27] Zhang X., Wang J., Zhu C., et al. DNNBuilder: an automated tool for building high-performance DNN hardware accelerators for FPGAs. ICCAD’18, 2018, art. 56. DOI: https://doi.org/10.1145/3240765.3240801
[28] Bai Y., Zhou H., Zhao K., et al. LTrans-OPU: a low-latency FPGA-based overlay processor for transformer networks. 33rd FPL, 2023, pp. 283--287. DOI: https://doi.org/10.1109/FPL60245.2023.00048
[29] Khan H., Khan A., Khan Z.F., et al. NPE: an FPGA-based overlay processor for natural language processing. FPGA’21, 2021, p. 227. DOI: https://doi.org/10.1145/3431920.3439477
[30] Zhao B.-B., Wang Y., Zhang H., et al. 4-bit CNN Quantization method with compact LUT-based multiplier implementation on FPGA. IEEE Trans. Instrum. Meas., 2023, vol. 72, art. 2008110. DOI: https://doi.org/10.1109/TIM.2023.3324357
[31] Gerlinghoff D., Choong B.C.M., Goh R., et al. Table-lookup MAC: scalable processing of quantised neural networks in FPGA soft logic. FPGA’24, 2024, pp. 235--245. DOI: https://doi.org/10.1145/3626202.3637576
[32] Vakili S., Vaziri M., Zarei A., et al. DyRecMul: fast and low-cost approximate multiplier for FPGAs using dynamic reconfiguration. arXiv:2310.10053. DOI: https://doi.org/10.48550/arXiv.2310.10053
[33] Awais M., Zahir A., Shah S.A.A., et al. Toward optimal softcore carry-aware approximate multipliers on xilinx FPGAs. ACM TECS, 2023, vol. 22, no. 4, art. 76. DOI: https://doi.org/10.1145/3564243
[34] Haghi P., Kamal M., Afzali-Kusha A., et al. O4-DNN: a hybrid DSP-LUT-based processing unit with operation packing and out-of-order execution for efficient realization of convolutional neural networks on FPGA devices. IEEE Trans. Circuits Syst. I: Regul. Pap., 2020, vol. 67, no. 9, pp. 3056--3069. DOI: https://doi.org/10.1109/TCSI.2020.2986350
[35] Wang E., Davis J.J., Cheung P., et al. LUTNet: rethinking inference in FPGA soft logic. IEEE 27th FCCM, 2019, pp. 26--34. DOI: https://doi.org/10.1109/FCCM.2019.00014
[36] Wang E., Davis J.J., Cheung P., et al. LUTNet: learning FPGA configurations for highly efficient neural network inference. IEEE Trans. Comput., 2020, vol. 69, no. 12, pp. 1795--1808. DOI: https://doi.org/10.1109/TC.2020.2978817
[37] Wang E., Auffret M., Stavrou G., et al. Logic shrinkage: learned connectivity sparsification for LUT-based neural networks. ACM TRETS, 2023, vol. 16, no. 4, art. 57. DOI: https://doi.org/10.1145/3583075
[38] Wang E., Davis J.J., Stavrou G., et al. Logic shrinkage: learned FPGA netlist sparsity for efficient neural network inference. FPGA’22, 2022, pp. 101--111. DOI: https://doi.org/10.1145/3490422.3502360
[39] Xie Y., Li Z., Diaconu D., et al. LUTMUL: exceed conventional FPGA roofline limit by LUT-based efficient multiplication for neural network inference. ArXiv2411.11852. DOI: https://doi.org/10.48550/arXiv.2411.11852
[40] Cao Y., Wang C., Tang Y. Explore efficient LUT-based architecture for quantized convolutional neural networks on FPGA. IEEE 28th FCCM, 2020, p. 232. DOI: https://doi.org/10.1109/FCCM48280.2020.00065
[41] Cao Y., Song C., Tang Y. Efficient LUT-based FPGA accelerator design for universal quantized CNN inference. ASSE’21, 2021, pp. 108--115. DOI: https://doi.org/10.1145/3456126.3456140
[42] Neda N., Ullah S., Ghanbari A., et al. Multi-precision deep neural network acceleration on FPGAs. 27th ASP-DAC, 2022, pp. 454--459. DOI: https://doi.org/10.1109/asp-dac52403.2022.9712485
[43] Lee S., Kim D., Nguyen D., et al. Double MAC on a DSP: boosting the performance of convolutional neural networks on FPGAs. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., 2019, vol. 38, no. 5, pp. 888--897. DOI: https://doi.org/10.1109/TCAD.2018.2824280
[44] Ding C. Dynamic precision multiplier for deep neural network accelerators. IEEE 33rd SOCC, 2020, pp. 180--184. DOI: https://doi.org/10.1109/socc49529.2020.9524752
[45] Neda N. Multi-precision deep neural network acceleration on FPGAs. 27th ASP-DAC, 2022, pp. 454--459. DOI: https://doi.org/10.1109/asp-dac52403.2022.9712485
[46] Raees P.C.M., Akshayraj M.R., Gopi Varun P., et al. Dynamic precision scaling in MAC units for energy-efficient computations in deep neural network accelerators. 28th VDAT, 2024. DOI: https://doi.org/10.1109/VDAT63601.2024.10705697
[47] Sommer J., Ozkan A., Keszocze O., et al. DSP-packing: squeezing low-precision arithmetic into FPGA DSP blocks. 32nd FPL, 2022, pp. 160--166. DOI: https://doi.org/10.1109/FPL57034.2022.00035
[48] Zhang J., Zhang M., Cao X., et al. Uint-packing: multiply your DNN accelerator performance via unsigned integer DSP packing. 60th ACM/IEEE DAC, 2023. DOI: https://doi.org/10.1109/DAC56929.2023.10247773
[49] Kalali E., van Leuken R. Near-precise parameter approximation for multiple multiplications on a single DSP block. IEEE Trans. Comput., 2022, vol. 71, no. 9, pp. 2036--2047. DOI: https://doi.org/10.1109/TC.2021.3119187
[50] Li R., Jiang B., Xu H. Mixed DSP packing method for convolutional neural network on FPGA. Proc. SPIE, 2023, vol. 12800. DOI: https://doi.org/10.1117/12.3004070
[51] Rasoulinezhad S., Zhou H., Wang L., et al. PIR-DSP: an FPGA DSP block architecture for multi-precision deep neural networks. IEEE 27th FCCM, 2019, pp. 35--44. DOI: https://doi.org/10.1109/FCCM.2019.00015
[52] Liu Y., Rai S., Ullah S., et al. High-flexibility designs of quantized runtime reconfigurable multi-precision multipliers. IEEE Embed. Syst. Lett., 2023, vol. 15, no. 4, pp. 194--197. DOI: https://doi.org/10.1109/LES.2023.3298736
[53] Liu X., Wu X., Shao H., et al. A flexible FPGA-based accelerator for efficient inference of multi-precision CNNs. IEEE ISCAS, 2024. DOI: https://doi.org/10.1109/ISCAS58744.2024.10557882
[54] Huang M., Liu Y., Huang S., et al. Multi-bit-width CNN accelerator with systolic-in-systolic dataflow and single DSP multiple multiplication scheme. FPGA’23, 2023, p. 229. DOI: https://doi.org/10.1145/3543622.3573209
[55] Huang M., Liu Y., Man C., et al. A high performance multi-bit-width booth vector systolic accelerator for NAS optimized deep learning neural networks. IEEE Trans. Circuits Syst. I: Regul. Pap., 2022, vol. 69, no. 9, pp. 3619--3631. DOI: https://doi.org/10.1109/TCSI.2022.3178474
[56] Zheng Y., Li Z., Sun K., et al. A 40 nm area-efficient effective-bit-combination-based DNN accelerator with the reconfigurable multiplier. IEEE 5th AICAS, 2023. DOI: https://doi.org/10.1109/AICAS57966.2023.10168550
[57] Ghavami B., Sajadi M., Shannon L., et al. Boosting multiple multipliers packing on FPGA DSP blocks via truncation and compensation-based approximation. IEEE ISVLSI, 2024, pp. 222--227. DOI: https://doi.org/10.1109/ISVLSI61997.2024.00049
[58] Rehman A., Vakili S. A cost-effective FPGA-based approximate multiplier for machine learning acceleration. IEEE 14th PAAP, 2023. DOI: https://doi.org/10.1109/PAAP60200.2023.10391619
[59] Ullah S., Rehman S., Prabakaran B., et al. Area-optimized low-latency approximate multipliers for FPGA-based hardware accelerators. DAC’18, 2018, art. 159. DOI: https://doi.org/10.1145/3195970.3195996
[60] Chen Y., Dotzel J., Abdelfattah M. M4BRAM: mixed-precision matrix-matrix multiplication in FPGA block RAMs. ICFPT, 2023, pp. 69--78. DOI: https://doi.org/10.1109/ICFPT59805.2023.00013
[61] Luo E., Huang H., Liu C., et al. DeepBurning-MixQ: an open source mixed-precision neural network accelerator design framework for FPGAs. IEEE/ACM ICCAD, 2023. DOI: https://doi.org/10.1109/ICCAD57390.2023.10323831
[62] Chen Y., Abdelfattah M. BRAMAC: compute-in-BRAM architectures for multiply-accumulate on FPGAs. 31st IEEE FCCM, 2023, pp. 52--62. DOI: https://doi.org/10.1109/FCCM57271.2023.00015
[63] Kabir M.A., Kamucheka T., Fredricks N., et al. IMAGine: an in-memory accelerated GEMV engine overlay. 34th FPL, 2024, pp. 220--226. DOI: https://doi.org/10.1109/FPL64840.2024.00038
[64] Kabir M.A., Kamucheka T., Fredricks N., et al. The BRAM is the limit: shattering myths, shaping standards, and building scalable PIM accelerators. 32nd IEEE FCCM, 2024, p. 223. DOI: https://doi.org/10.1109/FCCM60383.2024.00045
