【創新實訓第四周】不完全的 CTPN 完結貼 2019.4.11

本週工作進展

經過兩週心酸的調試，在省略了迴歸操作的情況下依舊失敗了無數遍，今天我終於跑出了第一個能看的 CTPN 模型。這篇博客就作爲我 CTPN 之旅的完結總結，雖然全連接後的分支只剩分類了，雖然文本框合併也沒有。

詳細工作內容

① 模型設計

首先，輸入圖片經過 VGG16，長寬縮小到原來 1/16，得到 feature map，所以 feature map 的一個像素對應原圖的 16*16 像素，這也是爲什麼 anchor 的寬度要固定爲 16。

接着，feature map 的每個像素點都取包括周圍的九個像素點拼接，每個像素點通道數爲 c，則可以拼接成一個 9c 通道數的像素。實際操作中，可以用 1*1 卷積代替。

逐行將新 feature map 的像素輸入雙向 LSTM，找到 anchor 間水平的序列關係。

每個 feature map 輸入全連接層，在分別輸出 2k 個分數（最後我只做了這個），2k 個定位，k 個邊緣提純。

# 去掉全連接的 vgg16 網絡
def vgg16_no_tail():
    # 注意一定要把 include_top 設爲 false，
    # 否則 input_shape 默認爲 224*224，會出錯
    vgg = keras.applications.VGG16(include_top=False)
    vgg_no_tail = keras.Model(
        inputs=vgg.input,
        outputs=vgg.get_layer("block5_conv3").output)
 
    return vgg_no_tail
 
 
# 生成訓練模型
def ctpn_model(h=600, w=900, k=10, anchor_size=16):
    conv_h = h // anchor_size
    conv_w = w // anchor_size
    input_layer = vgg16_no_tail(None)
    layer = input_layer.output

    # 卷積代替
    layer = keras.layers.Convolution2D(
        512 * 9, (3, 3),
        activation='relu',
        padding='same',
        name='cnn2rnn')(layer)

    # 變形，用於找到像素的水平關聯
    layer = keras.layers.Reshape((-1, 512 * 9))(layer)

    # bi-lstm
    layer = keras.layers.Bidirectional(
        keras.layers.LSTM(128, return_sequences=True))(layer)

    # 恢復形狀
    layer = keras.layers.Reshape((conv_h, conv_w, 256))(layer)

    # FC
    layer = keras.layers.Convolution2D(512, (1, 1), activation='relu')(layer)

    # score
    sc_layer = keras.layers.Convolution2D(2 * k, (1, 1), activation='relu')(layer)

    # 將最後的維度兩兩組合
    sc_layer = keras.layers.Reshape((conv_h, conv_w, 10, 2))(sc_layer)

    # score 要一個 softmax 輸出，保證正負分數和爲1
    sc_layer = keras.layers.Softmax()(sc_layer)

    model = keras.Model(inputs=input_layer.input,
                        outputs=sc_layer)
    return model

最後輸出的向量的 shape：[batch_size, conv_h, conv_w, anchor_count, 2]。

默認輸入 (600, 900) 的圖像，每個 feature map 像素10個不同高度的 anchor，則輸出 shape 爲：[batch_size, 37, 56, 10, 2]。

然後是 loss 函數設計。這裏只有 score 的。 y_true 和 y_pred 的 shape 形式都同上。使用交叉熵損失函數。但注意，最終輸出的 anchor 數量有1620000個，而包含文本的 anchor 數最多隻有上百個，也就是說正負樣本是嚴重失衡的，如果直接把 y_true 和 y_pred 輸入binary_crossentropy 可能導致最後模型預測不出任何東西。因此我的做法是將正負樣本分開計算 loss。

def ctpn_loss_only_score(y_true, y_pred):
    y_pred = tf.multiply(y_true, y_pred)
    loss = keras.losses.binary_crossentropy
    
    y_true = tf.reshape(y_true, (-1, 2))
    y_pred = tf.reshape(y_pred, (-1, 2))
    y_true_pos = y_true[:, 0]
    y_true_neg = y_true[:, 1]
    y_pred_pos = y_pred[:, 0]
    y_pred_neg = y_pred[:, 1]

    pos_sum = tf.reduce_sum(y_true_pos) + 1
    neg_sum = tf.reduce_sum(y_true_neg) + 1
    sum = pos_sum + neg_sum

    return sum * loss(y_true_pos, y_pred_pos) / pos_sum + \
           sum * loss(y_true_neg, y_pred_neg) / neg_sum

開始模型訓練。

def ctpn_model_run():
    model = ctpn_model()

    # GD 優化器效果比較穩定，原來用的是 Adam，loss 一路飆升完全無法收斂
    model.compile(optimizer=tf.train.GradientDescentOptimizer(0.001),
                  loss=ctpn_loss_only_score,
                  metrics=['accuracy'])

    train_x, train_y, test_x, test_y = load_data()
    # x:輸入圖片的numpy，[n, 600, 900, 3]
    # y:對應輸出的 feature map 的10個 anchor 的正負分數，[n, 37, 56, 10, 2]
    # 一定要保證 train 和 test 的 y 格式一致，不然會報錯

    time = datetime.datetime.now().strftime('%Y-%m-%d_%H-%M-%S')
    model.fit(train_x, train_y, batch_size=4, epochs=50,
              validation_data=(test_x, test_y), callbacks=[
            keras.callbacks.ModelCheckpoint(
                "./model/model_real_only_score_" + time + "_{epoch:02d}-{val_loss:.2f}.hdf5",
                monitor='val_loss', verbose=1,
                save_best_only=True, period=1),
            keras.callbacks.TensorBoard("./model/logs_real_only_score_" + time,
                                        batch_size=4)
        ])

② 數據集預處理

輸入圖像一定要處理成固定大小。我通常使用的是 (600, 900) 的大小，產生 label 的格式就是 [batch_size, 37, 56, 10, 2]。

接下來就是照搬我第二週的內容了。

首先參考這篇，將 box 切成 16 像素等寬的 Anchor

到這一步，Anchor 輸出格式是 (x_position, y, h) 的列表：

但是，我們需要處理成和模型輸出相同的格式 [batch, h, w, k=10, 4]，其中的“4”分別是文字分數、背景分數、縱座標 y 和高度 h。每16*16像素都需要生成10個 Anchor，高度分別是 [11, 16, 23, 33, 46, 66, 94, 134, 191, 273]。這些 Anchor 中，只有與上圖找出的 Anchor 中，橫座標相同且面積交併比大於 0.7 的才能被判定爲文字區域。

def overlap_anchors(img, box, anchor_width=16):
    iou_threshold = 0.7
    anchor_sizes = [11, 16, 23, 33, 46, 66, 94, 134, 191, 273]
    anchors = generate_gt_anchor(img, box, anchor_width)
    anchors = {x[0]: (x[1], x[2]) for x in anchors}
    # print(anchors)
    total_anchors = []
    for h in range(imgg.shape[0] // anchor_width):
        curH = []
        total_anchors.append(curH)
        for w in range(imgg.shape[1] // anchor_width):
            curW = []
            curH.append(curW)
            for k in range(len(anchor_sizes)):
                if w not in anchors:
                    curW.append([0, 1, 0, 1])
                else:
                    cy, ch = anchors[w]
                    ty, th = h * anchor_width + anchor_width / 2, anchor_sizes[k]
                    if iou(cy, ch, ty, th) > iou_threshold:
                        curW.append([1, 0, ty, th])
                    else:
                        curW.append([0, 1, 0, 0])
    return total_anchors
 
 
def iou(y1, h1, y2, h2):
    b1, u1 = y1 - h1 / 2, y1 + h1 / 2
    b2, u2 = y2 - h2 / 2, y2 + h2 / 2
    if u2 > u1:
        b1, u1, b2, u2 = b2, u2, b1, u1
    if b1 >= u2:
        return 0
    else:
        if b2 > b1:
            return (u2 - b2) / (u1 - b1)
        else:
            return (u2 - b1) / (u1 - b2)

最終輸出的效果：