DeepLearning4j實戰（7）：手寫體數字識別GPU實現與性能比較

Eclipse Deeplearning4j GitChat課程：https://gitbook.cn/gitchat/column/5bfb6741ae0e5f436e35cd9f
Eclipse Deeplearning4j 系列博客：https://blog.csdn.net/wangongxi
Eclipse Deeplearning4j Github：https://github.com/eclipse/deeplearning4j

在之前的博客中已經用單機、Spark分佈式兩種訓練的方式對深度神經網絡進行訓練，但其實DeepLearning4j也是支持多GPU訓練的。這篇文章我就總結下用GPU來對DNN/CNN進行訓練和評估過程。並且我會給出CPU、GPU和多卡GPU之前的性能比較圖表。不過，由於重點在於說明Mnist數據集在GPU上訓練的過程，所以對於一些環境的部署，比如Java環境和CUDA的安裝就不再詳細說明了。

軟件環境的部署主要在於兩個方面，一個是JDK的安裝，另外一個是CUDA。目前最新版本的DeepLearning4j以及Nd4j支持CUDA-8.0，JDK的話1.7以上。

環境部署完後，分別用java -version和nvidia-smi來確認環境是否部署正確，如果出現類似以下的信息，則說明環境部署正確，否則需要重新安裝。

GPU配置：

Java環境截圖：

從系統返回的信息可以看到，jdk是openJDK1.7，GPU是2張P40的卡。

下面說明下代碼的構成：

由於我這裏用了DeepLearning4j最新的版本--v0.8，所以和之前博客的pom文件有些修改，具體如下：

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>DeepLearning</groupId>
  <artifactId>DeepLearning</artifactId>
  <version>2.0</version>
  
  <properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    <nd4j.version>0.8.0</nd4j.version>
  	<dl4j.version>0.8.0</dl4j.version>
  	<datavec.version>0.8.0</datavec.version>
  	<scala.binary.version>2.11</scala.binary.version>
  </properties>
  
 <dependencies>
	   <dependency>
	     <groupId>org.nd4j</groupId>
	     <artifactId>nd4j-native</artifactId> 
	     <version>${nd4j.version}</version>
	   </dependency>
	   <dependency>
	        <groupId>org.deeplearning4j</groupId>
	        <artifactId>deeplearning4j-core</artifactId>
	        <version>${dl4j.version}</version>
	    </dependency>
	    <dependency>
		 <groupId>org.nd4j</groupId>
		 <artifactId>nd4j-cuda-8.0</artifactId>
		 <version>${nd4j.version}</version>
		</dependency>
		<dependency>
            <groupId>org.deeplearning4j</groupId>
            <artifactId>deeplearning4j-parallel-wrapper_${scala.binary.version}</artifactId>
            <version>${dl4j.version}</version>
        </dependency>
  	</dependencies>
  <build>
    <plugins>
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-jar-plugin</artifactId>
            <version>2.4</version>
            <configuration>
            	<source>1.7</source> 
				<target>1.7</target> 
                <archive>
                    <manifest>
                        <mainClass>cn.live.wangongxi.cv.CNNMnist</mainClass>
                    </manifest>
                </archive>
            </configuration>
        </plugin>
    </plugins>
</build>
</project>

創建完Maven工程以及添加了上面POM文件的內容之後，就可以開始着手上層應用邏輯的構建。這裏我參考了官網的例子，具體由以下幾個部分構成：

1.初始化CUDA的環境（底層邏輯包括硬件檢測、CUDA版本校驗和一些GPU參數）

2.讀取Mnist二進制文件（和之前的博客內容一致）

3.CNN的定義，這裏我還是用的LeNet

4.訓練以及評估模型的指標

首先貼一下第一部分的代碼：

    	//精度設置，常用精度有單、雙、半精度
    	//HALF ： 半精度
    	DataTypeUtil.setDTypeForContext(DataBuffer.Type.HALF);
    	//FLOAT : 單精度
    	//DataTypeUtil.setDTypeForContext(DataBuffer.Type.FLOAT);
    	//DOUBLE : 雙精度
    	//DataTypeUtil.setDTypeForContext(DataBuffer.Type.DOUBLE);

    	//創建CUDA上下文實例並設置參數
        CudaEnvironment.getInstance().getConfiguration()
        	//是否允許多GPU
            .allowMultiGPU(false)
            //設置顯存中緩存數據的容量，單位：字節
            .setMaximumDeviceCache(2L * 1024L * 1024L * 1024L)
            //是否允許多GPU間點對點(P2P)的內存訪問
            .allowCrossDeviceAccess(false);

通常我們需要根據需要來設置GPU計算的精度，常用的就像代碼中寫的那樣有單、雙、半精度三種。通過選擇DataBuffer中定義的enum類型Type中的值來達到設置精度的目的。如果不設置，默認的是單精度。

再下面就是設置CUDA的一些上下文參數，比如代碼中羅列的cache數據的顯存大小，P2P訪問內存和多GPU運行的標誌位等等。對於網絡結構相對簡單，數據量不大的情況下，默認的參數就夠用了。這裏我們也只是簡單設置了幾個參數，這對於用LeNet來訓練Mnist數據集來說已經足夠了。

從2~4部分的邏輯和之前的博客裏幾乎是一樣的，就直接上代碼了：

        int nChannels = 1;
        int outputNum = 10;

        int batchSize = 128;
        int nEpochs = 10;
        int iterations = 1;
        int seed = 123;

        log.info("Load data....");
        DataSetIterator mnistTrain = new MnistDataSetIterator(batchSize,true,12345);
        DataSetIterator mnistTest = new MnistDataSetIterator(batchSize,false,12345);

        log.info("Build model....");
        MultiLayerConfiguration conf = new NeuralNetConfiguration.Builder()
            .seed(seed)
            .iterations(iterations)
            .regularization(true).l2(0.0005)
            .learningRate(.01)
            .weightInit(WeightInit.XAVIER)
            .optimizationAlgo(OptimizationAlgorithm.STOCHASTIC_GRADIENT_DESCENT)
            .updater(Updater.NESTEROVS).momentum(0.9)
            .list()
            .layer(0, new ConvolutionLayer.Builder(5, 5)
                .nIn(nChannels)
                .stride(1, 1)
                .nOut(20)
                .activation(Activation.IDENTITY)
                .build())
            .layer(1, new SubsamplingLayer.Builder(SubsamplingLayer.PoolingType.MAX)
                .kernelSize(2,2)
                .stride(2,2)
                .build())
            .layer(2, new ConvolutionLayer.Builder(5, 5)
                .stride(1, 1)
                .nOut(50)
                .activation(Activation.IDENTITY)
                .build())
            .layer(3, new SubsamplingLayer.Builder(SubsamplingLayer.PoolingType.MAX)
                .kernelSize(2,2)
                .stride(2,2)
                .build())
            .layer(4, new DenseLayer.Builder().activation(Activation.RELU)
                .nOut(500).build())
            .layer(5, new OutputLayer.Builder(LossFunctions.LossFunction.NEGATIVELOGLIKELIHOOD)
                .nOut(outputNum)
                .activation(Activation.SOFTMAX)
                .build())
            .setInputType(InputType.convolutionalFlat(28,28,1))
            .backprop(true).pretrain(false).build();
        MultiLayerNetwork model = new MultiLayerNetwork(conf);
        model.init();
        log.info("Train model....");
        model.setListeners(new ScoreIterationListener(100));
        long timeX = System.currentTimeMillis();

        for( int i=0; i<nEpochs; i++ ) {
            long time1 = System.currentTimeMillis();
            model.fit(mnistTrain);
            long time2 = System.currentTimeMillis();
            log.info("*** Completed epoch {}, time: {} ***", i, (time2 - time1));
        }
        long timeY = System.currentTimeMillis();

        log.info("*** Training complete, time: {} ***", (timeY - timeX));

        log.info("Evaluate model....");
        Evaluation eval = new Evaluation(outputNum);
        while(mnistTest.hasNext()){
            DataSet ds = mnistTest.next();
            INDArray output = model.output(ds.getFeatureMatrix(), false);
            eval.eval(ds.getLabels(), output);
        }
        log.info(eval.stats());

        log.info("****************Example finished********************");

以上邏輯就是利用一塊GPU卡進行Mnist數據集進行訓練和評估的邏輯。如果想在多GPU下進行並行訓練的話，需要修改一些設置，例如在之前第一步的創建CUDA環境上下文的時候，需要允許多GPU和P2P內存訪問，即設置爲true。然後在邏輯裏添加並行訓練的邏輯：

        ParallelWrapper wrapper = new ParallelWrapper.Builder(model)
            .prefetchBuffer(24)
            .workers(4)
            .averagingFrequency(3)
            .reportScoreAfterAveraging(true)
            .useLegacyAveraging(true)
            .build();

這樣如果有多張GPU卡就可以進行單機多卡的並行訓練。

下面貼一下訓練Mnist數據集在CPU/GPU/多GPU下的性能比較還有訓練時候的GPU使用情況：

單卡訓練截圖：

雙卡並行訓練截圖：

訓練時間評估：

最後做下簡單的總結。由於Deeplearning4j本身支持GPU單卡，多卡以及集羣的訓練方式，而且對於底層的接口都已經進行了很多的封裝，暴露的接口都是比較hig-level的接口，一般設置一些屬性就可以了。當然前提是硬件包括CUDA都要正確安裝。

DeepLearning4j實戰（7）：手寫體數字識別GPU實現與性能比較

PDManer [元數建模]-v4.9.0 發佈：一款簡單好用的數據庫建模平臺

使用neovim打造go ide(支持代碼跳轉, 代碼補全, 實時語法檢查)

sql求連續值問題

cs01 CSS Syntax

挑戰程序設計競賽 2.3章習題 poj 3046 Ant Counting

[MASM拾遺]Offset僞指令

h30 HTML Layout Elements

瞭解顯卡

一款基於C#開發的通訊調試工具（支持Modbus RTU、MQTT調試）

Linux/Golang/glibC系統調用

DeepLearning4j實戰（7）：手寫體數字識別GPU實現與性能比較

Deeplearning4j 實戰（21）：Bert簡介及NLP問題應用

Deeplearning4j 實戰（8） : Keras爲媒介導入Tensorflow/Theano等其他深度學習庫的模型

Deeplearning4j 實戰（19）：基於膠囊網絡（Capsule Network）的手寫體數字識別

Deeplearning4j 實戰（16）：FastText在監督學習和無監督學習中的應用

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結