前言

從去年十一月份開始學習yolo神經網絡用於目標識別的硬件實現，到現在已經六個月了。一個硬件工程師，C/C++基礎都差勁的很，對照着darknet作者的源碼和網上東拼西湊的原理講解，一點一點地摸索。剛開始進度很慢，每天都感覺學習不了幾行代碼，到後來慢慢的入了門，每週都有不菲的收穫和重大的進展。總結一下自己這大半年的學習，記錄一下心路歷程，也爲躬耕於此的有緣人提供哪怕一點點的幫助吧。

yolov3-tiny 原理

Yolo算法採用一個單獨的CNN模型實現end-to-end的目標檢測，首先將輸入圖片resize到448x448，然後送入CNN網絡，最後處理網絡預測結果得到檢測的目標。
YOLO 的核心思想就是利用整張圖作爲網絡的輸入，直接在輸出層迴歸 bounding box（邊界框）的位置及其所屬的類別。將一幅圖像分成 SxS 個網格（grid cell），如果某個 object 的中心落在這個網格中，則這個網格就負責預測這個 object。

每個 bounding box 要預測 (x, y, w, h) 和 confidence 共5個值，每個網格還要預測一個類別信息，記爲 C 類。則 SxS個網格，每個網格要預測 B 個 bounding box，每個box中都有 C 個 classes對應的概率值。輸出就是 S x S x B x(5+C) 的一個 tensor。

注意：class 信息是針對每個網格的，confidence 信息是針對每個 bounding box 的。

yolov3-tiny中，共有兩個輸出層（yolo層），分別爲13x13和26x26，每個網格可以預測3個bounding box，共有80個分類數。所以最後的yolo層的尺寸爲：13x13x255和26x26x255。
yolov3-tiny網絡層結構如下：

更直觀的一個模型圖:

可以看出，yolov3-tiny共有23層網絡，其中包含五種不同的網絡層：卷積層convolutional(13個)，池化層maxpool(6個)，卷積層convolutional(13個)，路由層route(2個)，上採樣層upsample(1個)，輸出層yolo(2個)。

yolov3-tiny 源碼分析

配置網絡結構

yolov3-tiny前向傳播主要在detector.c中的test_detector函數中完成：

/** 本函數是檢測模型的一個前向推理測試函數.
* @param datacfg       數據集信息文件路徑（也即cfg/*.data文件），文件中包含有關數據集的信息，比如cfg/coco.data
* @param cfgfile       網絡配置文件路徑（也即cfg/*.cfg文件），包含一個網絡所有的結構參數，比如cfg/yolo.cfg
* @param weightfile    已經訓練好的網絡權重文件路徑，比如darknet網站上下載的yolo.weights文件
* @param filename      待進行檢測的圖片路徑（單張圖片）
* @param thresh        閾值，類別檢測概率大於該閾值才認爲其檢測結果有效
* @param hier_thresh
* @param outfile
* @param fullscreen
* @details 該函數爲一個前向推理測試函數，不包括訓練過程，因此如果要使用該函數，必須提前訓練好網絡，並加載訓練好的網絡參數文件，
*          這些文件可以在作者網站上根據作者的提示下載到。本函數由darknet.c中的主函數調用，嚴格來說，本文件不應納入darknet網絡結構文件夾中，
*          其只是一個測試文件，或者說是一個example，應該放入到example文件夾中（新版的darknet已經這樣做了，可以在github上查看）。
*          本函數的流程爲：.
*/
void test_detector(char *datacfg, char *cfgfile, char *weightfile, char *filename, float thresh,
    float hier_thresh, int dont_show, int ext_output, int save_labels, char *outfile, int letter_box, int benchmark_layers)
{
	// 從指定數據文件datacfg（.data文件）中讀入數據信息（測試、訓練數據信息）到options中
	// options是list類型數據，其中的node包含的void指針具體是kvp數據類型，具有鍵值和值（類似C++中的Map）
    list *options = read_data_cfg(datacfg);
	// 獲取數據集的名稱（包括路徑），第二個參數"names"表明要從options中獲取所用數據集的名稱信息（如names = data/coco.names）
    char *name_list = option_find_str(options, "names", "data/names.list");
    int names_size = 0;
	// 從data/**.names中讀取物體名稱/標籤信息
    char **names = get_labels_custom(name_list, &names_size); //get_labels(name_list);
	
    // 加載data/labels/文件夾中所有的字符標籤圖片
    image **alphabet = load_alphabet();

    network net = parse_network_cfg_custom(cfgfile, 1, 1); // set batch=1  配置各網絡層參數，重要

在parser.c中的parse_network_cfg_custom函數中，根據yolov3-tiny.cfg文件對網絡結構進行配置，明確各層網絡的類型、輸入輸出通道數、圖像尺寸、卷積核大小等。

//配置各網絡層參數
network parse_network_cfg_custom(char *filename, int batch, int time_steps)
{
	// 從神經網絡結構參數文件中讀入所有神經網絡層的結構參數，存儲到sections中，
	// sections的每個node包含一層神經網絡的所有結構參數
    list *sections = read_cfg(filename);
	// 獲取sections的第一個節點，可以查看一下cfg/***.cfg文件，其實第一塊參數（以[net]開頭）不是某層神經網絡的參數，
	// 而是關於整個網絡的一些通用參數，比如學習率，衰減率，輸入圖像寬高，batch大小等，
	// 具體的關於某個網絡層的參數是從第二塊開始的，如[convolutional],[maxpool]...，
	// 這些層並沒有編號，只說明了層的屬性，但層的參數都是按順序在文件中排好的，讀入時，
	// sections鏈表上的順序就是文件中的排列順序。
    node *n = sections->front;
    if(!n) error("Config file has no sections");
	// 創建網絡結構並動態分配內存：輸入網絡層數爲sections->size - 1，sections的第一段不是網絡層，而是通用網絡參數
    network net = make_network(sections->size - 1);
	// 所用顯卡的卡號（gpu_index在cuda.c中用extern關鍵字聲明）
	// 在調用parse_network_cfg()之前，使用了cuda_set_device()設置了gpu_index的值號爲當前活躍GPU卡號
    net.gpu_index = gpu_index;
	// size_params結構體元素不含指針變量
    size_params params;

    if (batch > 0) params.train = 0;    // allocates memory for Detection only
    else params.train = 1;              // allocates memory for Detection & Training

    section *s = (section *)n->val;
    list *options = s->options;
    if(!is_network(s)) error("First section must be [net] or [network]");
    parse_net_options(options, &net);

#ifdef GPU
    printf("net.optimized_memory = %d \n", net.optimized_memory);
    if (net.optimized_memory >= 2 && params.train) {
        pre_allocate_pinned_memory((size_t)1024 * 1024 * 1024 * 8);   // pre-allocate 8 GB CPU-RAM for pinned memory
    }
#endif  // GPU

    params.h = net.h;
    params.w = net.w;
    params.c = net.c;
    params.inputs = net.inputs;
    if (batch > 0) net.batch = batch;
    if (time_steps > 0) net.time_steps = time_steps;
    if (net.batch < 1) net.batch = 1;
    if (net.time_steps < 1) net.time_steps = 1;
    if (net.batch < net.time_steps) net.batch = net.time_steps;
    params.batch = net.batch;
    params.time_steps = net.time_steps;
    params.net = net;
    printf("mini_batch = %d, batch = %d, time_steps = %d, train = %d \n", net.batch, net.batch * net.subdivisions, net.time_steps, params.train);

    int avg_outputs = 0;
    float bflops = 0;
    size_t workspace_size = 0;
    size_t max_inputs = 0;
    size_t max_outputs = 0;
    n = n->next;
    int count = 0;
    free_section(s);

	// 此處stderr不是錯誤提示，而是輸出結果提示，提示網絡結構
    fprintf(stderr, "   layer   filters  size/strd(dil)      input                output\n");
    while(n){
        params.index = count;
        fprintf(stderr, "%4d ", count);
        s = (section *)n->val;
        options = s->options;
		// 定義網絡層
        layer l = { (LAYER_TYPE)0 };
		// 獲取網絡層的類別

        LAYER_TYPE lt = string_to_layer_type(s->type);
		
		//通過讀取網絡類型，從而配置各網絡層的參數
        if(lt == CONVOLUTIONAL){//yolov3-tiny  卷積層  13層
            l = parse_convolutional(options, params);
        }else if(lt == LOCAL){
            l = parse_local(options, params);
        }else if(lt == ACTIVE){
            l = parse_activation(options, params);
        }else if(lt == RNN){
            l = parse_rnn(options, params);
        }else if(lt == GRU){
            l = parse_gru(options, params);
        }else if(lt == LSTM){
            l = parse_lstm(options, params);
        }else if (lt == CONV_LSTM) {
            l = parse_conv_lstm(options, params);
        }else if(lt == CRNN){
            l = parse_crnn(options, params);
        }else if(lt == CONNECTED){
            l = parse_connected(options, params);
        }else if(lt == CROP){
            l = parse_crop(options, params);
        }else if(lt == COST){
            l = parse_cost(options, params);
            l.keep_delta_gpu = 1;
        }else if(lt == REGION){
            l = parse_region(options, params);
            l.keep_delta_gpu = 1;
        }else if (lt == YOLO) {//yolov3-tiny YOLO層  兩層
            l = parse_yolo(options, params);
            l.keep_delta_gpu = 1;
        }else if (lt == GAUSSIAN_YOLO) {
            l = parse_gaussian_yolo(options, params);
            l.keep_delta_gpu = 1;
        }else if(lt == DETECTION){
            l = parse_detection(options, params);
        }else if(lt == SOFTMAX){
            l = parse_softmax(options, params);
            net.hierarchy = l.softmax_tree;
            l.keep_delta_gpu = 1;
        }else if(lt == NORMALIZATION){
            l = parse_normalization(options, params);
        }else if(lt == BATCHNORM){
            l = parse_batchnorm(options, params);
        }else if(lt == MAXPOOL){//yolov3-tiny 池化層 maxpool  6層
            l = parse_maxpool(options, params);
        }else if (lt == LOCAL_AVGPOOL) {
            l = parse_local_avgpool(options, params);
        }else if(lt == REORG){
            l = parse_reorg(options, params);        }
        else if (lt == REORG_OLD) {
            l = parse_reorg_old(options, params);
        }else if(lt == AVGPOOL){
            l = parse_avgpool(options, params);
        }else if(lt == ROUTE){//yolov3-tiny 路由層 2層
            l = parse_route(options, params);
            int k;
            for (k = 0; k < l.n; ++k) {
                net.layers[l.input_layers[k]].use_bin_output = 0;
                net.layers[l.input_layers[k]].keep_delta_gpu = 1;
            }
        }else if (lt == UPSAMPLE) {//yolov3-tiny 上採樣層 1層
            l = parse_upsample(options, params, net);
        }else if(lt == SHORTCUT){
            l = parse_shortcut(options, params, net);
            net.layers[count - 1].use_bin_output = 0;
            net.layers[l.index].use_bin_output = 0;
            net.layers[l.index].keep_delta_gpu = 1;
        }else if (lt == SCALE_CHANNELS) {
            l = parse_scale_channels(options, params, net);
            net.layers[count - 1].use_bin_output = 0;
            net.layers[l.index].use_bin_output = 0;
            net.layers[l.index].keep_delta_gpu = 1;
        }
        else if (lt == SAM) {
            l = parse_sam(options, params, net);
            net.layers[count - 1].use_bin_output = 0;
            net.layers[l.index].use_bin_output = 0;
            net.layers[l.index].keep_delta_gpu = 1;
        }else if(lt == DROPOUT){
            l = parse_dropout(options, params);
            l.output = net.layers[count-1].output;
            l.delta = net.layers[count-1].delta;
            .........

下載權重文件

在parser.c的load_weights_upto中，根據卷積層的網絡配置，開始下載讀取各層的權重文件。

//讀取權重文件函數
void load_weights_upto(network *net, char *filename, int cutoff)//cutoff = net->n
{
#ifdef GPU
    if(net->gpu_index >= 0){
        cuda_set_device(net->gpu_index);
    }
#endif
    fprintf(stderr, "Loading weights from %s...\n", filename);
    fflush(stdout);
    FILE *fp = fopen(filename, "rb");
    if(!fp) file_error(filename);

    int major;
    int minor;
    int revision;
    fread(&major, sizeof(int), 1, fp);//讀取一個4字節的數據
    fread(&minor, sizeof(int), 1, fp);//讀取一個4字節的數據
    fread(&revision, sizeof(int), 1, fp);//讀取一個4字節的數據
	printf("the size of int in x64 is %d bytes,attention!!!\n", sizeof(int));//x86 x64: 4
	printf("major ,minor,revision of weight is %d, %d ,%d\n", major, minor, revision);//0.2.0
    if ((major * 10 + minor) >= 2) {//運行這一部分
        printf("\n seen 64");
        uint64_t iseen = 0;
        fread(&iseen, sizeof(uint64_t), 1, fp);//讀取一個8字節的數據
		printf("the size of uint64_t is %d\n", sizeof(uint64_t));
        *net->seen = iseen;
    }
    else {
        printf("\n seen 32");
        uint32_t iseen = 0;
        fread(&iseen, sizeof(uint32_t), 1, fp);
        *net->seen = iseen;
    }
    *net->cur_iteration = get_current_batch(*net);
    printf(", trained: %.0f K-images (%.0f Kilo-batches_64) \n", (float)(*net->seen / 1000), (float)(*net->seen / 64000));
    int transpose = (major > 1000) || (minor > 1000);

    int i;
    for(i = 0; i < net->n && i < cutoff; ++i){//cutoff = net->n
        layer l = net->layers[i];
        if (l.dontload) continue;//always 0		跳過之後的循環體，直接運行++i
        if(l.type == CONVOLUTIONAL && l.share_layer == NULL){ //只運行這一個分支的代碼
            load_convolutional_weights(l, fp);
			//printf("network layer [%d] is CONVOLUTIONAL \n",i);
        }
        .......

在讀取yolov3-tiny各層權重文件前，先讀取4個和訓練有關的參數：major，minor, revision和iseen。在前向傳播的工程當中，並沒有實際的應用。

parser.c中的load_convolutional_weights函數，具體執行對yolov3-tiny權重文件的下載，包括節點參數weight，偏置參數bias和批量歸一化參數BN。

void load_convolutional_weights(layer l, FILE *fp)
{
	static int flipped_num;
    if(l.binary){
        //load_convolutional_weights_binary(l, fp);
        //return;
    }
    int num = l.nweights;
	//int num = l.n*l.c*l.size*l.size;//l.n 輸出的層數 l.c輸入的層數 
    int read_bytes;
    read_bytes = fread(l.biases, sizeof(float), l.n, fp);//讀取偏置參數 l.n個float數據
    if (read_bytes > 0 && read_bytes < l.n) printf("\n Warning: Unexpected end of wights-file! l.biases - l.index = %d \n", l.index);
    //fread(l.weights, sizeof(float), num, fp); // as in connected layer
    if (l.batch_normalize && (!l.dontloadscales)){
        read_bytes = fread(l.scales, sizeof(float), l.n, fp);//讀取batch normalize 參數  l.n個float數據
        if (read_bytes > 0 && read_bytes < l.n) printf("\n Warning: Unexpected end of wights-file! l.scales - l.index = %d \n", l.index);
        read_bytes = fread(l.rolling_mean, sizeof(float), l.n, fp);//讀取batch normalize 參數  l.n個float數據
        if (read_bytes > 0 && read_bytes < l.n) printf("\n Warning: Unexpected end of wights-file! l.rolling_mean - l.index = %d \n", l.index);
        read_bytes = fread(l.rolling_variance, sizeof(float), l.n, fp);//讀取batch normalize 參數  l.n個float數據
        if (read_bytes > 0 && read_bytes < l.n) printf("\n Warning: Unexpected end of wights-file! l.rolling_variance - l.index = %d \n", l.index);

將權重參數批量歸一化

yolov3-tiny每個卷積層之後，激活函數之前，都要對結果進行Batch Normalization：
由於BN層和卷積操作都是線性的，將權重文件進行批量歸一化，可以代替卷積層之後的BN層：

在network.c的fuse_conv_batchnorm函數中實現權重文件和BN層的合併。

void fuse_conv_batchnorm(network net)
{
    int j;
    for (j = 0; j < net.n; ++j) {
        layer *l = &net.layers[j];
		    // printf("the %d layer batch_normalize is %d,   groups is %d \n", j, l->batch_normalize, l->groups);
        if (l->type == CONVOLUTIONAL) { //只運行這一分支   合併卷積層和batch_normal
             //printf(" Merges Convolutional-%d and batch_norm \n", j);

            if (l->share_layer != NULL) {//l->share_layer always is 0,不運行這個分支
                l->batch_normalize = 0;
            }

            if (l->batch_normalize) {//#15,22層卷積，卷積之後沒有batch normalize，其他都要運行這一分支
                int f;
                for (f = 0; f < l->n; ++f)//該層神經網絡 1->n 個輸出層權重
                {
                    l->biases[f] = l->biases[f] - (double)l->scales[f] * l->rolling_mean[f] / (sqrt((double)l->rolling_variance[f] + .00001));

                    const size_t filter_size = l->size*l->size*l->c / l->groups;//kernel_size * kernel_size * c/分組  l->groups存在於卷積層always is 1
                    int i;
                    for (i = 0; i < filter_size; ++i) {
                        int w_index = f*filter_size + i;

                        l->weights[w_index] = (double)l->weights[w_index] * l->scales[f] / (sqrt((double)l->rolling_variance[f] + .00001));
                    }
                }

                free_convolutional_batchnorm(l);//no use
                l->batch_normalize = 0;
                ......

輸入圖像

yolov3-tiny輸入神經網絡的圖像尺寸爲416x416，對不符合該尺寸的圖像，要進行裁剪。在image.c的resize_image函數中完成。這個可以說是整個yolo算法對輸入圖像唯一進行預處理的地方了。這也是yolo算法在工程應用中極好的地方，沒有那麼多類似於降噪、濾波之類的預處理，直接送到網絡裏就完事了。

//im：輸入圖片  w:416 h:416
//函數作用：將輸入圖片熱size到416x416的尺寸，基本按照縮放/擴大的策略
image resize_image(image im, int w, int h)
{
    if (im.w == w && im.h == h) return copy_image(im);

    image resized = make_image(w, h, im.c);//416 x 416 x 3空的地址空間
    image part = make_image(w, im.h, im.c);//416 x im.h x im.c空的地址空間
    int r, c, k;
    float w_scale = (float)(im.w - 1) / (w - 1);//寬度縮放因子
    float h_scale = (float)(im.h - 1) / (h - 1);//高度縮放因子
    for(k = 0; k < im.c; ++k){
        for(r = 0; r < im.h; ++r){
            for(c = 0; c < w; ++c){//416
                float val = 0;
                if(c == w-1 || im.w == 1){//c =415 最後一列
                    val = get_pixel(im, im.w-1, r, k);//取原圖片最後一列的像素
                } else {
                    float sx = c*w_scale;
                    int ix = (int) sx;
                    float dx = sx - ix;
                    val = (1 - dx) * get_pixel(im, ix, r, k) + dx * get_pixel(im, ix+1, r, k);
                }
                set_pixel(part, c, r, k, val);
            }
        }
    }
    for(k = 0; k < im.c; ++k){
        for(r = 0; r < h; ++r){
            float sy = r*h_scale;
            int iy = (int) sy;
            float dy = sy - iy;
            for(c = 0; c < w; ++c){
                float val = (1-dy) * get_pixel(part, c, iy, k);
                set_pixel(resized, c, r, k, val);
            }
            if(r == h-1 || im.h == 1) continue;
            for(c = 0; c < w; ++c){
                float val = dy * get_pixel(part, c, iy+1, k);
                add_pixel(resized, c, r, k, val);
            }
        }
    }

    free_image(part);
    return resized;
}

前向傳播網絡

network.c中的forward_network函數是整個神經網絡的核心部分，各層的網絡都在函數指針l.forward(l, state)中完成。

void forward_network(network net, network_state state)
{
    state.workspace = net.workspace;
    int i;
	   /// 遍歷所有層，從第一層到最後一層，逐層進行前向傳播（網絡總共有net.n層）
    for(i = 0; i < net.n; ++i){		  
        state.index = i;/// 置網絡當前活躍層爲當前層，即第i層		  
        layer l = net.layers[i];/// 獲取當前層		  
        if(l.delta && state.train){//不執行此分支的代碼
			/// 如果當前層的l.delta已經動態分配了內存，則調用fill_cpu()函數，將其所有元素的值初始化爲0			   
            scal_cpu(l.outputs * l.batch, 0, l.delta, 1);/// 第一個參數爲l.delta的元素個數，第二個參數爲初始化值，爲0
			printf("forward_network scal_cpu of %d layer done!\n ", i);
        }
           //double time = get_time_point();
		l.forward(l, state);//進行卷積運算，激活函數，池化運算/
		   //if layer_type = convolutional ;   l.forward = forward_convolutional_layer;
		   //if layer_type = maxpool           l.forward = forward_maxpool_layer;
		   //if layer_type = yolo              l.forward = forward_yolo_layer;
		   //if layer_type = ROUTE             l.forward = forward_route_layer;其實就是數據的複製和搬移
		   //if layer_type = upsample          l.forward = forward_upsample_layer;;		  
           //printf("%d - Predicted in %lf milli-seconds.\n", i, ((double)get_time_point() - time) / 1000);
		   /// 完成某一層的推理時，置網絡的輸入爲當前層的輸出（這將成爲下一層網絡的輸入），要注意的是，此處是直接更改指針變量net.input本身的值，
		   /// 也就是此處是通過改變指針net.input所指的地址來改變其中所存內容的值，並不是直接改變其所指的內容而指針所指的地址沒變，
		   /// 所以在退出forward_network()函數後，其對net.input的改變都將失效，net.input將回到進入forward_network()之前時的值。	
		   ......

卷積層[convolution]

卷積層在convolutional_layer.c中的forward_convolutional_layer函數實現。

void forward_convolutional_layer(convolutional_layer l, network_state state)
{
    
	int out_h = convolutional_out_height(l);//獲得本層卷積層輸出特徵圖的高、寬
    int out_w = convolutional_out_width(l);
    int i, j;
	
	// l.outputs = l.out_h * l.out_w * l.out_c在make各網絡層函數中賦值（比如make_convolutional_layer()），
	// 對應每張輸入圖片的所有輸出特徵圖的總元素個數（每張輸入圖片會得到n也即l.out_c張特徵圖）
	// 初始化輸出l.output全爲0.0；輸入l.outputs*l.batch爲輸出的總元素個數，其中l.outputs爲batch
	// 中一個輸入對應的輸出的所有元素的個數，l.batch爲一個batch輸入包含的圖片張數；0表示初始化所有輸出爲0；
    fill_cpu(l.outputs*l.batch, 0, l.output, 1);//將地址l.output，l.outputs*l.batch個float地址空間的數據初始化0
    .......

作者在進行卷積運算前，將輸入特徵圖進行重新排序：


```c
void im2col_cpu(float* data_im,
     int channels,  int height,  int width,
     int ksize,  int stride, int pad, float* data_col)
{
    int c,h,w;
	// 計算該層神經網絡的輸出圖像尺寸（其實沒有必要再次計算的，因爲在構建卷積層時，make_convolutional_layer()函數
	// 已經調用convolutional_out_width()，convolutional_out_height()函數求取了這兩個參數，
	// 此處直接使用l.out_h,l.out_w即可，函數參數只要傳入該層網絡指針就可了，沒必要弄這麼多參數）
    int height_col = (height + 2*pad - ksize) / stride + 1;
    int width_col = (width + 2*pad - ksize) / stride + 1;
	
	/// 卷積核大小：ksize*ksize是一個卷積核的大小，之所以乘以通道數channels，是因爲輸入圖像有多通道，每個卷積核在做卷積時，
	/// 是同時對同一位置處多通道的圖像進行卷積運算，這裏爲了實現這一目的，將三通道上的卷積核並在一起以便進行計算，因此卷積核
	/// 實際上並不是二維的，而是三維的，比如對於3通道圖像，卷積核尺寸爲3*3，該卷積核將同時作用於三通道圖像上，這樣並起來就得
	/// 到含有27個元素的卷積核，且這27個元素都是獨立的需要訓練的參數。所以在計算訓練參數個數時，一定要注意每一個卷積核的實際
	/// 訓練參數需要乘以輸入通道數。
    int channels_col = channels * ksize * ksize;//輸入通道
	// 外循環次數爲一個卷積核的尺寸數，循環次數即爲最終得到的data_col的總行數
    for (c = 0; c < channels_col; ++c) {

		//行，列偏置都是對應着本次循環要操作的輸出位置的像素而言的，通道偏置，是該位置像素所在的輸出通道的絕對位置（通道數）

		// 列偏移，卷積核是一個二維矩陣，並按行存儲在一維數組中，利用求餘運算獲取對應在卷積核中的列數，比如對於
		// 3*3的卷積核（3通道），當c=0時，顯然在第一列，當c=5時，顯然在第2列，當c=9時，在第二通道上的卷積核的第一列，
		// 當c=26時，在第三列（第三輸入通道上）
        int w_offset = c % ksize;//0,1,2
		// 行偏移，卷積核是一個二維的矩陣，且是按行（卷積核所有行併成一行）存儲在一維數組中的，
		// 比如對於3*3的卷積核，處理3通道的圖像，那麼一個卷積核具有27個元素，每9個元素對應一個通道上的卷積核（互爲一樣），
		// 每當c爲3的倍數，就意味着卷積核換了一行，h_offset取值爲0,1,2，對應3*3卷積核中的第1, 2, 3行
        int h_offset = (c / ksize) % ksize;//0,1,2
		// 通道偏移，channels_col是多通道的卷積核並在一起的，比如對於3通道，3*3卷積核，每過9個元素就要換一通道數，
		// 當c=0~8時，c_im=0;c=9~17時，c_im=1;c=18~26時，c_im=2,操作對象是排序後的像素位置
        int c_im = c / ksize / ksize;
		// 中循環次數等於該層輸出圖像行數height_col，說明data_col中的每一行存儲了一張特徵圖，這張特徵圖又是按行存儲在data_col中的某行中
        for (h = 0; h < height_col; ++h) {
			// 內循環等於該層輸出圖像列數width_col，說明最終得到的data_col總有channels_col行，height_col*width_col列
            for (w = 0; w < width_col; ++w) {
				// 由上面可知，對於3*3的卷積核，行偏置h_offset取值爲0,1,2,當h_offset=0時，會提取出所有與卷積核第一行元素進行運算的像素，
				// 依次類推；加上h*stride是對卷積核進行行移位操作，比如卷積核從圖像(0,0)位置開始做卷積，那麼最先開始涉及(0,0)~(3,3)
				// 之間的像素值，若stride=2，那麼卷積核進行一次行移位時，下一行的卷積操作是從元素(2,0)（2爲圖像行號，0爲列號）開始
                int im_row = h_offset + h * stride;//yolov3-tiny stride = 1
				// 對於3*3的卷積核，w_offset取值也爲0,1,2，當w_offset取1時，會提取出所有與卷積核中第2列元素進行運算的像素，
				// 實際在做卷積操作時，卷積覈對圖像逐行掃描做卷積，加上w*stride就是爲了做列移位，
				// 比如前一次卷積其實像素元素爲(0,0)，若stride=2,那麼下次卷積元素起始像素位置爲(0,2)（0爲行號，2爲列號）
                int im_col = w_offset + w * stride;
				// col_index爲重排後圖像中的像素索引，等於c * height_col * width_col + h * width_col +w（還是按行存儲，所有通道再併成一行），
				// 對應第c通道，h行，w列的元素
                int col_index = (c * height_col + h) * width_col + w;//將重排後的圖片像素，按照左上->右下的順序，計算一維索引

				//im_col + width*im_row +  width*height*channel 重排前的特徵圖在內存中的位置索引
				// im2col_get_pixel函數獲取輸入圖像data_im中第c_im通道，im_row,im_col的像素值並賦值給重排後的圖像，
				// height和width爲輸入圖像data_im的真實高、寬，pad爲四周補0的長度（注意im_row,im_col是補0之後的行列號，
				// 不是真實輸入圖像中的行列號，因此需要減去pad獲取真實的行列號）
                data_col[col_index] = im2col_get_pixel(data_im, height, width, channels,
                        im_row, im_col, c_im, pad);
				// return data_im[im_col + width*im_row +  width*height*channel)];
            }
        }
    }
}

通過gemm進行卷積乘加操作，通過add_bias添加偏置。

//進行卷積的乘加運算，沒有bias偏置參數參與運算；
gemm(0, 0, m, n, k, 1, a, k, b, n, 1, c, n);

add_bias(l.output, l.biases, l.batch, l.n, out_h*out_w);//每個輸出特徵圖的元素都加上對應通道的偏置參數

池化層[maxpool]

maxpool_layer.c中的forward_maxpool_layer函數完成池化操作。yolov3-tiny保留了池化層，並使用最大值池化，將尺寸爲2x2的核中最大值保留下來。

void forward_maxpool_layer_avx(float *src, float *dst, int *indexes, int size, int w, int h, int out_w, int out_h, int c,
    int pad, int stride, int batch)
{

    const int w_offset = -pad / 2;
    const int h_offset = -pad / 2;
    int b, k;

    for (b = 0; b < batch; ++b) {
		// 對於每張輸入圖片，將得到通道數一樣的輸出圖，以輸出圖爲基準，按輸出圖通道，行，列依次遍歷
		// （這對應圖像在l.output的存儲方式，每張圖片按行鋪排成一大行，然後圖片與圖片之間再併成一行）。
		// 以輸出圖爲基準進行遍歷，最終循環的總次數剛好覆蓋池化核在輸入圖片不同位置進行池化操作。
        #pragma omp parallel for
        for (k = 0; k < c; ++k) {
            int i, j, m, n;
            for (i = 0; i < out_h; ++i) {
                //for (j = 0; j < out_w; ++j) {
                j = 0;
                for (; j < out_w; ++j) {
					// out_index爲輸出圖中的索引
                    int out_index = j + out_w*(i + out_h*(k + c*b));//j + out_w * i + out_w * iout_h * k
                    float max = -FLT_MAX;// FLT_MAX爲c語言中float.h定義的對大浮點數，此處初始化最大元素值爲最小浮點數
                    int max_i = -1;// 最大元素值的索引初始化爲-1
                    // 下面兩個循環回到了輸入圖片，計算得到的cur_h以及cur_w都是在當前層所有輸入元素的索引，內外循環的目的是找尋輸入圖像中，
                    // 以(h_offset + i*l.stride, w_offset + j*l.stride)爲左上起點，尺寸爲l.size池化區域中的最大元素值max及其在所有輸入元素中的索引max_i
                    for (n = 0; n < size; ++n) {//2
                        for (m = 0; m < size; ++m) {//2
                            // cur_h，cur_w是在所有輸入圖像中第k通道中的cur_h行與cur_w列，index是在所有輸入圖像元素中的總索引。
                            // 爲什麼這裏少一層對輸入通道數的遍歷循環呢？因爲對於最大池化層來說輸入與輸出通道數是一樣的，並在上面的通道數循環了！
                            int cur_h = h_offset + i*stride + n;
                            int cur_w = w_offset + j*stride + m;
                            int index = cur_w + w*(cur_h + h*(k + b*c));
							// 邊界檢查：正常情況下，是不會越界的，但是如果有補0操作，就會越界了，這裏的處理方式是直接讓這些元素值爲-FLT_MAX
							// （注意雖然稱之爲補0操作，但實際不是補0），總之，這些補的元素永遠不會充當最大元素值。
                            int valid = (cur_h >= 0 && cur_h < h &&
                                cur_w >= 0 && cur_w < w);
                            float val = (valid != 0) ? src[index] : -FLT_MAX;
							// 記錄這個池化區域中的最大的元素值及其在所有輸入元素中的總索引
                            max_i = (val > max) ? index : max_i;
                            max = (val > max) ? val : max;
                        }
                    }
					// 由此得到最大池化層每一個輸出元素值及其在所有輸入元素中的總索引。
					// 爲什麼需要記錄每個輸出元素值對應在輸入元素中的總索引呢？因爲在下面的反向過程中需要用到，在計算當前最大池化層上一層網絡的敏感度時，
					// 需要該索引明確當前層的每個元素究竟是取上一層輸出（也即上前層輸入）的哪一個元素的值，具體見下面backward_maxpool_layer()函數的註釋。
                    dst[out_index] = max;
                    if (indexes) indexes[out_index] = max_i;
                }
            }
        }
    }
}

路由層[route]

yolov3-tiny中共有兩層路由層。第17層路由層（從0層開始），其實直接將第13層網絡的輸出結果輸入。第20層路由層，將第19層和第8層網絡結果合併在一起，19層在前，8層在後。在route_layer.c中的forward_route_layer函數中實現。

void forward_route_layer(const route_layer l, network_state state)
{
    int i, j;
    int offset = 0;
    for(i = 0; i < l.n; ++i){//l.n：  卷積層：輸出特徵圖通道數 路由層：有幾層網絡層輸入本層  17層：1（路由第13層）   20：2（路由第19、8層）
        int index = l.input_layers[i];//輸入本網絡層的網絡層的索引：如13，19，8
        float *input = state.net.layers[index].output;//輸入等於 之前網絡層索引值得輸出（.output）
        int input_size = l.input_sizes[i];//輸入的網絡層的數據量
        int part_input_size = input_size / l.groups;//未分組
        for(j = 0; j < l.batch; ++j){
            //copy_cpu(input_size, input + j*input_size, 1, l.output + offset + j*l.outputs, 1);
			//從首地址input處複製input_size 個數據到 l.output中
            copy_cpu(part_input_size, input + j*input_size + part_input_size*l.group_id, 1, l.output + offset + j*l.outputs, 1);//l.group_id = 0
			//其實就是copy_cpu(part_input_size, input, 1, l.output + offset, 1);
        }
        //offset += input_size;
        offset += part_input_size;
    }
}

上採樣層[upsample]

yolov3-tiny中第19層是上採樣層，將18層13x13x128的輸入特徵圖轉變爲26x26x128的輸出特徵圖。在upsample_layer.c中的forward_upsample_layer函數中完成。

void upsample_cpu(float *in, int w, int h, int c, int batch, int stride, int forward, float scale, float *out)
{
	
    int i, j, k, b;
    for (b = 0; b < batch; ++b) {
        for (k = 0; k < c; ++k) {
            for (j = 0; j < h*stride; ++j) {
                for (i = 0; i < w*stride; ++i) {
                    int in_index = b*w*h*c + k*w*h + (j / stride)*w + i / stride;
                    int out_index = b*w*h*c*stride*stride + k*w*h*stride*stride + j*w*stride + i;
                    if (forward) out[out_index] = scale*in[in_index];
                    else in[in_index] += scale*out[out_index];
                }
            }
        }
    }
}

上採樣效果：

輸出層[yolo]

yolo層完成了對13x13x255和26x26x255輸入特診圖的logistic邏輯迴歸計算。每個box的預測寬度和高度不參與邏輯迴歸，在yolo_layer.c中的forward_yolo_layer函數中完成。

//兩個yolo層 只對數據進行了logistic處理，並沒有預測box的位置
//將0-1通道（x,y） 4-84(confidence+class)計算logistic，三個prior(預測框都是這樣)
void forward_yolo_layer(const layer l, network_state state)
{
    int i, j, b, t, n;
	//從state.input複製數據到l.output
    memcpy(l.output, state.input, l.outputs*l.batch * sizeof(float));

#ifndef GPU
	printf("yolo v3 tiny l.n and l.batch of yolo layer is %d and %d  \n ",l.n,l.batch);
    for (b = 0; b < l.batch; ++b) {//l.batch = 1
        for (n = 0; n < l.n; ++n) {//l.n：3（yolo層）mask 0,1,2  表示每個網絡單元預測三個box?
			
			//printf("l.coords is %d in yolov3 tiny yolo layer ,l.scale_x_y is %f \n", l.coords, l.scale_x_y);
            // l.coords 座標：0  l.classes分類數量：80   l.scale_x_y:1
			//l.w:輸入特徵圖寬度 l.h輸出特徵圖高度  
			int index = entry_index(l, b, n*l.w*l.h, 0);//index = n*l.w*l.h*(4 + l.classes + 1)
			
		   //起始地址爲：l.output + index 個數爲：2 * l.w*l.h  計算邏輯迴歸值，並保存
            activate_array(l.output + index, 2 * l.w*l.h, LOGISTIC);  // x,y,

			//起始地址爲：l.output + index 個數爲：2 * l.w*l.h  計算方式爲：x = x*l.scale_x_y + -0.5*(l.scale_x_y - 1) 簡化後：x = x
			//yolov3-tiny l.scale_x_y = 1  實際上該函數沒有參與任何的運算   scal_add_cpu
            scal_add_cpu(2 * l.w*l.h, l.scale_x_y, -0.5*(l.scale_x_y - 1), l.output + index, 1);    // scale x,y
            
			//
			index = entry_index(l, b, n*l.w*l.h, 4);//index = n*l.w*l.h*(4 + l.classes + 1)+ 4*l.w*l.h
            
			//起始地址爲：l.output + index,個數爲：（1+80）*l.w*l.h   計算器其邏輯迴歸值
			activate_array(l.output + index, (1 + l.classes)*l.w*l.h, LOGISTIC);
        }
    }

預測結果統計[detection ]


//w:輸入圖像寬度640,不一定是416 h:輸入圖像高度424,不一定是416  thresh:圖像置信度閾值0.25   hier:0.5
//map:0   relative:1  num:0   letter:0
//函數作用：統計兩個yolo層中 置信度大於閾值的box個數，並對這個box初始化一段地址空間 dets
//根據網絡來填充該地址空間dets：
//根據yolo層 計算滿足置信度閾值要求的box相對的預測座標、寬度和高度，並將結果保存在dets[count].bbox結構體中
//每個box有80個類別，有一個置信度，該類別對應的可能性prob：class概率*置信度
///捨棄prob小於閾值0.25的box
//將滿足閾值的box個數保存到num中
detection *get_network_boxes(network *net, int w, int h, float thresh, float hier, int *map, int relative, int *num, int letter)
{
	//printf("w、h、thresh、hier and letter is %d 、%d 、%f 、%f and %d\n", w, h, thresh, hier, letter);

	//函數作用：統計兩個yolo層中 置信度大於閾值的box個數，並對這個box初始化一段地址空間 dets
	//將滿足閾值的box個數保存到num中
    detection *dets = make_network_boxes(net, thresh, num);
	
	//根據網絡來填充該地址空間dets：
	//根據yolo層 計算滿足置信度閾值要求的box相對的預測座標、寬度和高度，並將結果保存在dets[count].bbox結構體中
	//每個box有80個類別，有一個置信度，該類別對應的可能性prob：class概率*置信度
	///捨棄prob小於閾值0.25的box
    fill_network_boxes(net, w, h, thresh, hier, map, relative, dets, letter);
    return dets;
}

使用make_network_boxes來創建預測信息的指針變量：

// thresh:  置信度閾值
//num: 0
//函數作用：統計置信度大於閾值的box個數，並對這個box初始化一段地址空間
detection *make_network_boxes(network *net, float thresh, int *num)
{
    layer l = net->layers[net->n - 1];//應該是神經網絡最後一層 net->n:24 最後一層yolo層
	//printf(" net->n  of network is %d\n " ,(net->n));
    int i;
	// -thresh 0.25
	//yolo層：yolov3-tiny中共有兩層
	//三個prior預測框，對每個預測框中，置信度大於thresh 0.25，記爲一次，將次數進行累加，並輸出
	//nboxes:即爲要保留的box的個數 兩個yolo層中的置信度個數一起累加
    int nboxes = num_detections(net, thresh);//-thresh 0.25

	if (num) {
		printf("nbox = %d \n", num);
		*num = nboxes;//不執行該語句
	}
    //申請內存，個數爲nboxes，每個內存大小爲：sizeof(detection)
    detection* dets = (detection*)xcalloc(nboxes, sizeof(detection));

	//遍歷每個box,每個dets.prob申請80個float類型的內存：
	//dets.uc，申請4個float類型的空間：位置信息
    for (i = 0; i < nboxes; ++i) {
        dets[i].prob = (float*)xcalloc(l.classes, sizeof(float));
        // tx,ty,tw,th uncertainty
        dets[i].uc = (float*)xcalloc(4, sizeof(float)); // Gaussian_YOLOv3
        
		if (l.coords > 4) {//不執行這個分支 l.coords：0
            dets[i].mask = (float*)xcalloc(l.coords - 4, sizeof(float));
        }
    }
    return dets;
}

使用get_yolo_detections來統計兩層yolo層的預測信息：

//w,h:640,424    netw, neth:416,416 thresh:圖像置信度閾值0.25   hier:0.5
//map:0   relative:1    letter:0
//根據yolo層 計算滿足置信度閾值要求的box相對的預測座標、寬度和高度，並將結果保存在dets[count].bbox結構體中
//每個box有80個類別，有一個置信度，該類別對應的可能性prob：class概率*置信度
///捨棄prob小於閾值0.25的box
int get_yolo_detections(layer l, int w, int h, int netw, int neth, float thresh, int *map, int relative, detection *dets, int letter)
{
    printf("\n l.batch = %d, l.w = %d, l.h = %d, l.n = %d ,netw = %d, neth = %d \n", l.batch, l.w, l.h, l.n, netw, neth);
    int i,j,n;
    float *predictions = l.output;//yolo層的輸出
    // This snippet below is not necessary
    // Need to comment it in order to batch processing >= 2 images
    //if (l.batch == 2) avg_flipped_yolo(l);
    int count = 0;

	//printf("yolo layer l.mask[0] is %d, l.mask[1] is %d, l.mask[2] is %d\n", l.mask[0], l.mask[1], l.mask[2]);
	//printf("yolo layer l.biases[l.mask[0]*2] is %f, l.biases[l.mask[1]*2] is %f, l.biases[l.mask[2]*2] is %f\n", l.biases[l.mask[0] * 2], l.biases[l.mask[1] * 2], l.biases[l.mask[2] * 2]);
	//遍歷yolo層
    for (i = 0; i < l.w*l.h; ++i){//該yolo層輸出特徵圖的寬度、高度：13x13 26x26
        int row = i / l.w;
        int col = i % l.w;
        for(n = 0; n < l.n; ++n){//yolo層，l.n = 3
			
            //obj_index:置信度層索引
            int obj_index  = entry_index(l, 0, n*l.w*l.h + i, 4);//obj_index  = n*l.w*l.h*(4+l.classes+1) + 4*l.w*l.h + i;
            float objectness = predictions[obj_index];//獲得對應的置信度
            //if(objectness <= thresh) continue;    // incorrect behavior for Nan values
            
			if (objectness > thresh) {//只有置信度大於閾值纔開始執行該分支
                //printf("\n objectness = %f, thresh = %f, i = %d, n = %d \n", objectness, thresh, i, n);
                
				//box_index:yolo層每個像素點有三個box,表示每個box的索引值
				int box_index = entry_index(l, 0, n*l.w*l.h + i, 0);//box_index = n*l.w*l.h*(4+l.classes+1)+ i;

				//l.biases->偏置參數起始地址    l.mask[n]：分別爲3，4，5，0，1，2，biases偏置參數偏移量
				//根據yolo層 計算滿足置信度閾值要求的box相對的預測座標、寬度和高度，並將結果保存在dets[count].bbox結構體中
                dets[count].bbox = get_yolo_box(predictions, l.biases, l.mask[n], box_index, col, row, l.w, l.h, netw, neth, l.w*l.h);

				//獲取對應的置信度，該置信度經過了logistic
                dets[count].objectness = objectness;

				//獲得分類數：80（int類型）
                dets[count].classes = l.classes;
                for (j = 0; j < l.classes; ++j) {
					//80個類別，每個類別對應的概率，class_index爲其所在層的索引
                    int class_index = entry_index(l, 0, n*l.w*l.h + i, 4 + 1 + j);//class_index  = n*l.w*l.h*(4+l.classes+1) + （4+1+j）*l.w*l.h + i;
                    //每個box有80個類別，有一個置信度，該類別對應的可能性prob：class概率*置信度
					float prob = objectness*predictions[class_index];
					
					//捨棄prob小於閾值0.25的box
                    dets[count].prob[j] = (prob > thresh) ? prob : 0;
                }
                ++count;
            }
        }
    }
    correct_yolo_boxes(dets, count, w, h, netw, neth, relative, letter);
    return count;
}

非極大值抑制[NMS]

//dets:box結構體 nboxes:滿足閾值的box個數   l.classe:80    thresh=0.45f
//兩個box,同一類別進行非極大值抑制，遍歷
void do_nms_sort(detection *dets, int total, int classes, float thresh)
{
    int i, j, k;
    k = total - 1;
    for (i = 0; i <= k; ++i) {//box個數
        if (dets[i].objectness == 0) {//置信度==0  不執行該分支，理論上沒有objectness = 0
			printf("there is no objectness == 0 !!! \n");
            detection swap = dets[i];
            dets[i] = dets[k];
            dets[k] = swap;
            --k;
            --i;
        }
    }
    total = k + 1;
	//同一類別進行比較
    for (k = 0; k < classes; ++k) {//80個        
        //box預測的類別
		for (i = 0; i < total; ++i) {//box個數
            dets[i].sort_class = k;
        }
		//函數作用：將prob較大的box排列到前面
        qsort(dets, total, sizeof(detection), nms_comparator_v3);
        for (i = 0; i < total; ++i) {//兩個box,同一類別進行非極大值抑制
            //printf("  k = %d, \t i = %d \n", k, i);
            if (dets[i].prob[k] == 0) continue;
            box a = dets[i].bbox;
            for (j = i + 1; j < total;++j){
				box b = dets[j].bbox;
				if( box_iou(a, b) > thresh) dets[j].prob[k] = 0;
            }
        }
    }
}

yolov3-tiny工程應用和實現

目錄

前言

相關推薦

yolov3-tiny 原理

yolov3-tiny 源碼分析

配置網絡結構

下載權重文件

將權重參數批量歸一化

輸入圖像

前向傳播網絡

卷積層[convolution]

池化層[maxpool]

路由層[route]

上採樣層[upsample]

輸出層[yolo]

預測結果統計[detection ]

非極大值抑制[NMS]

Verilog 有符號數與無符號數運算

Xilinx FPGA 將寄存器放入IOB中

FPGA LUT查找表原理和編程方式

ZYNQ 啓動過程

Verilog-2001的向量部分選擇

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結