深度學習硬件加速綜述寫作心得

大三上學期爲了寫一篇關於FPGA-based 硬件加速的綜述，我查閱了大量相關文獻，經過大約三個月時間寫了第一篇英文綜述論文。

關於如何閱讀和收集論文，以及工具推薦請參考這兩篇文章：

吳恩達關於機器學習職業生涯以及閱讀論文的一些建議（附AI領域必讀的10篇論文PDF）

有了這些珍藏的實用工具/學習網站，自學更快樂！文中有論文免費下載方式

我大致講一下這篇文章是怎麼完成的，如果有興趣的朋友歡迎留言，後期會考慮更新如何寫綜述的詳細文章。

一月份放假第二天我參加了江蘇省大學生萬人計劃學術冬令營，很有幸進了南京大學的電子信息前沿技術冬令營，不得不說在這個營期裏我學到了太多東西。順便提一下，我們的主要活動：學術講座、人工智能實訓、學術沙龍。在營期結束後，我回家待了一個星期左右，這期間我在考慮一個十分感興趣的方向AI硬件加速器的設計。因爲之前的課程設計、FPGA競賽等數字邏輯器件設計及機器學習算法的學習，我下定決定要查找相關文獻，想要弄清硬件加速的研究現狀，同時還能實踐由南大陶濤教授介紹的Web of Science的文獻檢索方法。

就這樣，我在家上了浙江圖書館查找了很多這方面的國內外重要文獻。剛開始看到這麼多晦澀難懂的文章，我一臉茫然。經過分析，我發現有很多國內外著名的教授寫了一些綜述性的文章，這時我細緻的看了大約五篇，然後發現了一些寫作框架的構建技巧。於是乎，我當即列了一個框架。

然後，我又找了很多硬件加速器的具體實現方法的文章，這個時候的文章就較爲難懂了，有基於CPU、GPU、DSP，也有基於憶阻器等實現的。我重點看關於FPGAs的實現方法和基於FPGA做了何種應用的加速，如手寫字識別加速、圖像壓縮加速等。

到這個時候，我就得分類整理各種平臺的實現方法、先進性、不足之處，然後總結出幾個平臺的差別，對比出FPGA的硬件加速實現。

在我寫的這篇文章中，我用了大量篇幅來詳細闡述深度學習、CNNs的加速器設計發展，綜合對比，最後給出建設性的意見或者結論。

文章結構

寫綜述的兩個建議：

寫綜述文章，着眼點越小越有針對性，越能寫出好的綜述；
另外，特別想要給的建議就是好好應用表格、統計圖等數據圖表來總結各種方法、發展歷程對比等。這個方法尤爲好用，尤爲推薦！

下面給出草稿封面圖如下：

文獻檢索的能力相當重要，下面是我在寫的時候查找到的文獻，供大家參考。

Refences

[1] Boukaye Boubacar Traore, Bernard Kamsu-Foguem, Fana Tangara, Deep convolution neural network for image recognition, in: Ecological Informatics,Volume 48,2018,Pages 257-268,ISSN 1574-9541.

[2] A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convolutional neural networks, in: Advances in Neural Information Processing Systems, 2012, pp. 1097–1105.

[3] K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, arXiv:1409.1556, 2014.

[4] S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: towards real-time object detection with region proposal networks, in: Advances in Neural Information Processing Systems, 2015, pp. 91–99.

[5] J. Redmon, S. Divvala, R. Girshick, A. Farhadi, You only look once: unified, real-time object detection, in: Proceedings of The IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 779–788.

[6] Dawid Po?ap,Marcin Wozniak.Voice Recognition by Neuro-Heuristic Method[J].Tsinghua Science and Technology,2019,24(01):9-17.

[7] A. Ucar, Y. Demir, C. Guzelis, Object recognition and detection with deep learning for autonomous driving applications, (in English), Simul.-Trans. Soc. Model. Simul. Int. 93 (9) (Sep 2017) 759–769, doi:10.1177/0037549717709932.

[8] P. Pelliccione, E. Knauss, R. Heldal, et al., Automotive architecture framework: the experience of volvo cars, J. Syst. Archit. 77 (2017) 83–100. 06/01/ 2017 https://doi.org/10.1016/j.sysarc.2017.02.005.

[9] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521,no. 7553, pp. 436–444, 2015.

[10] D. Aysegul, J. Jonghoon, G. Vinayak, K. Bharadwaj,C. Alfredo, M. Berin, and C. Eugenio. Accelerating deep neural networks on mobile processor with embedded programmable logic. In NIPS 2013. IEEE, 2013.

[11] S. Cadambi, A. Majumdar, M. Becchi, S. Chakradhar, and H. P. Graf. A programmable parallel accelerator for learning and classification. In Proceedings of the 19th international conference on Parallel architectures and compilation techniques, pages 273{284. ACM, 2010.

[12] C. Farabet, C. Poulet, J. Y. Han, and Y. LeCun. Cnp: An fpga-based processor for convolutional networks. In Field Programmable Logic and Applications, 2009. FPL 2009. International Conference on, pages 32{37. IEEE, 2009.

[13] M. Peemen, A. A. Setio, B. Mesman, and H. Corporaal. Memory-centric accelerator design for convolutional neural networks. In Computer Design (ICCD), 2013 IEEE 31st International Conference on, pages 13{19. IEEE, 2013.

[14] 侯宇青陽,全吉成,王宏偉.深度學習發展綜述[J].艦船電子工程,2017,37(04):5-9+111.

[15] 張榮,李偉平,莫同.深度學習研究綜述[J].信息與控制,2018,47(04):385-397+410.

[16] Jiuxiang Gu, Zhenhua Wang, Jason Kuen, Lianyang Ma, Amir Shahroudy, Bing Shuai, Ting Liu, Xingxing Wang, Gang Wang, Jianfei Cai, Tsuhan Chen,Recent advances in convolutional neural networks,Pattern Recognition,Volume 77,2018,Pages 354-377,ISSN 0031-3203,

[17] McCulloch W S, Pitts W. A logical calculus of the ideas immanent in nervous activity [J]. Bulletin of Mathematical Biophysics,1943,5(4): 115-133

[18] Rosenblatt F. The perceptron: A probabilistic model for information storage and organization in the brain [J].Psychological Review, 1958,65(6):386-408

[19] Hubel D H, Wiesel T N. Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex[J]. Journal of Physiology, 1962,160(1):106-154

[20] M. Sankaradas, V. Jakkula, S. Cadambi, S. Chakradhar,I. Durdanovic, E. Cosatto, and H. P. Graf. A massively parallel coprocessor for convolutional neural networks. In Application-specific Systems, Architectures and Processors, 2009. ASAP 2009. 20th IEEE International Conference on, pages 53{60. IEEE, 2009.

[21] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, J. Cong, “Optimizing fpga-based accelerator design for deep convolutional neural networks”, FPGA, 2015.

[22] Liu Shaoli, Du Zidong, Tao Jinhua, et al. Cambricon: An instruction set architecture for neural networks[C]//Proc of the 43rd Int Symp on Computer Architecture. Piscataway, NJ:IEEE,2016:393-405

[23] Qianru Zhang, Meng Zhang, Tinghuan Chen, Zhifei Sun, Yuzhe Ma, Bei Yu, Recent advances in convolutional neural network acceleration, Neurocomputing, Volume 323, 2019, Pages 37-51,ISSN 0925-2312,

[24] 吳豔霞,梁楷,劉穎,崔慧敏.深度學習FPGA加速器的進展與趨勢[J/OL].計算機學報,2019:1-20[2019-03-19].
http://kns.cnki.net/kcms/detail/11.1826.TP.20190114.1037.002.html.

[25] Cavigelli L, Gschwend D, Mayer C, et al. Origami: A convolutional network accelerator//Proceedings of the Great Lakes Symposium on VLSI. Pittsburgh, USA, 2015: 199-204

[26] Chen Y-H, Krishna T, Emer J, et al. 14.5 Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks //Proceedings of the 2016 IEEE International Solid-State Circuits Conference (ISSCC). San Francisco, USA, 2016: 262-263

[27] Shafiee A, Nag A, Muralimanohar N, et al. ISAAC: A convolutional neural network accelerator with In-situ analog arithmetic in crossbars//Proceedings of the ISCA. Seoul, ROK, 2016: 14-26

[28] Andri R, Cavigelli L, Rossi D, et al. YodaNN: An ultra-low power convolutional neural network accelerator based on binary weights//Proceedings of the IEEE Computer Society Annual Symposium on VLSI. Pittsburgh, USA, 2016: 236-241

[29] Gokmen T, Vlasov Y. Acceleration of deep neural network training with resistive cross-point devices: design considerations. Front neurosci, 2016, 10(51): 333

[30] 陳桂林,馬勝,郭陽.硬件加速神經網絡綜述[J/OL].計算機研究與發展,2019(02)[2019-03-20].
http://kns.cnki.net/kcms/detail/11.1777.TP.20190129.0940.004.html.

[31] 瀋陽靖,沈君成,葉俊,馬琪.基於FPGA的脈衝神經網絡加速器設計[J].電子科技,2017,30(10):89-92+96.

[32] 王思陽. 基於FPGA的卷積神經網絡加速器設計[D].電子科技大學,2017.

[33] Nurvitadhi E V G, Sim J, et al. Can FPGAs beat GPUs in accelerating next-generation deep neural networks?//Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. Monterey, USA, 2017: 5-14

[34] Wang, T., Wang, C., Zhou, X., & Chen, H. (2018). A Survey of FPGA Based Deep Learning Accelerators: Challenges and Opportunities. arXiv preprint arXiv:1901.04988.

[35] C. Farabet, Y. LeCun, K. Kavukcuoglu, et al. Large-scale FPGA-based convolutional networks[J]. In Scaling up Machine Learning: Parallel and Distributed Approaches eds Bekkerman, 2011, 399–419.

[36] C. Farabet, B. Martini, B. Corda, et al. Neuflow: A runtime reconfigurable dataflowprocessor for vision[C]. In Computer Vision and Pattern Recognition Workshops, 2011,109–116.

[37] M. Peemen, A. Setio, B. Mesman, et al. Memory-centric accelerator design forconvolutional neural networks[C]. IEEE International Conference on Computer Design,2013, 13–19.

[38] M. Sankaradas, V. Jakkula, S. Cadambi, et al. A massively parallel coprocessor forconvolutional neural networks[C]. In Application Specific Systems, Architectures andProcessors, 2009, 53–60.

[39] Wei Ding, Zeyu Huang, Zunkai Huang, Li Tian, Hui Wang, Songlin Feng,Designing efficient accelerator of depthwise separable convolutional neural network on FPGA,Journal of Systems Architecture,2018,ISSN 1383-7621.

[40] 劉勤讓,劉崇陽.利用參數稀疏性的卷積神經網絡計算優化及其FPGA加速器設計[J].電子與信息學報,2018,40(06):1368-1374.

[41] Yufei Ma, Naveen Suda, Yu Cao, Sarma Vrudhula, and Jae Sun Seo. 2018. ALAMO: FPGA acceleration of deep learning algorithms with a modularized RTL compiler. Integration, the VLSI Journal. [doi>10.1016/j.vlsi.2017.12.009]

[42] 餘子健,馬德,嚴曉浪,沈君成.基於FPGA的卷積神經網絡加速器[J].計算機工程,2017,43(01):109-114+119.

[43] T. Chen et al., “DianNao: A small-footprint high-throughput accelerator for ubiquitous machine-learning,” in Proc. ASPLOS, Salt Lake City, UT, USA, 2014, pp. 269–284.

[44] D. L. Ly and P. Chow, “A high-performance FPGA architecture for restricted Boltzmann machines,” in Proc. FPGA, Monterey, CA, USA,2009, pp. 73–82.

[45] S. K. Kim, L. C. McAfee, P. L. McMahon, and K. Olukotun, “A highly scalable restricted Boltzmann machine FPGA implementation,” in Proc. FPL, Prague, Czech Republic, 2009, pp. 367–372.

[46] J. Qiu et al., “Going deeper with embedded FPGA platform for convolutional neural network,” in Proc. FPGA, Monterey, CA, USA, 2016,pp. 26–35.

[47] Q. Yu, C. Wang, X. Ma, X. Li, and X. Zhou, “A deep learning prediction process accelerator based FPGA,” in Proc. CCGRID, Shenzhen, China, 2015, pp. 1159–1162.

[48] C. Wang, L. Gong, Q. Yu, X. Li, Y. Xie, X. Zhou, “DLAU: A scalable deep learning accelerator unit on FPGA”, IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 36, no. 3, pp. 513-517, Mar. 2017.

[49] 陳煌,祝永新,田犁,汪輝,封松林.基於FPGA的卷積神經網絡卷積層並行加速結構設計[J].微電子學與計算機,2018,35(10):85-88.

推薦閱讀（點擊標題可跳轉閱讀）

[1] 機器學習實戰 | 邏輯迴歸應用之“Kaggle房價預測”

[2] 機器學習實戰 | 邏輯迴歸應用之“Kaggle泰坦尼克之災”

[3] 本科生晉升GM記錄：Kaggle比賽進階技巧分享

[4] 表情識別FER | 基於深度學習的人臉表情識別系統（Keras）

[5] PyTorch實戰 | 使用卷積神經網絡對CIFAR10圖片進行分類（附源碼）

[6] 有了這些珍藏的實用工具/學習網站，自學更快樂！

關注公衆號邁微電子研發社，文章首發與公衆號。