將 MVAD 的標註數據轉成 CSV（Youtube Clips 的數據格式）

原創

2018-08-22 01:10

Preface

目前我正在處理幾個 Video Caption 的數據集，一個是 YoutubeClips 數據集。其標註是微軟發佈的一個 Microsoft Research Video Description Corpus
，安裝完成後，會得到一個 CSV 文件，這個文件如下：

第一列是視頻名稱，第二列 Start 是標註的開始幀數，第三列 End 是標註的結束幀數，第七列 Language 是標註的語言，最後一列是標註文字內容。

但是，另一個數據集：MVAD: Montreal Video Annotation Dataset，其標註格式是 srt 格式的文件，形式如下：

那麼，要想用重複利用訓練 YouTubeClips 的代碼，就得講 MVAD 的數據格式轉化爲 CSV 文件。

這個轉化就得用上傳說中的 pandas 模塊了。我之前沒接觸到 pandas，這也是第一次使用吧。其實這個模塊很方便很簡單，我寫了一段腳本進行轉換，並保存爲 CSV 文件，代碼如下。

Code

#! encoding:UTF-8

import os
import glob

import cv2

import numpy as np
import pandas as pd

train_videos_path = '/home/ou-lc/chenxp/Downloads/MVAD/train_videos'

train_srt_txt_path = '/home/ou-lc/chenxp/Downloads/M-VAD_txtfiles/srt_files/train_srt'
train_txt_files = glob.glob(train_srt_txt_path + '/*.srt')

video_information = []
for each_train_srt in train_txt_files:
    train_srts = open(each_train_srt, 'r').read().splitlines()

    videos_ID = []
    #videos_Time_Stamp = []; videos_Start = []; videos_End = []
    videos_Language = []
    videos_Descriptions = []
    for idx_srt, video_srt in enumerate(train_srts):
        if idx_srt % 4 == 0:
            videos_ID.append(video_srt)
        #if idx_srt % 4 == 1:
        #    videos_Time_Stamp.append(video_srt)
        if idx_srt % 4 == 2:
            videos_Language.append('English')
            videos_Descriptions.append(video_srt)
    for idx, each_video_name in enumerate(videos_ID):
        video_information.append((each_video_name, videos_Language[idx], videos_Descriptions[idx]))

df = pd.DataFrame(video_information, columns=['VideoID', 'Language', 'Description'])

print df.shape

df.to_csv('convert_MVAD_train.csv', sep=',', encoding='utf-8')

Reference

以上腳本的兩處關鍵代碼參考瞭如下資料：
1. http://stackoverflow.com/questions/16923281/pandas-writing-dataframe-to-csv-file
2. http://stackoverflow.com/questions/19961490/construct-pandas-dataframe-from-list-of-tuples

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

將 MVAD 的標註數據轉成 CSV（Youtube Clips 的數據格式）

Preface

Code

Reference

中外程序員到底有啥區別？

Nginx R31 doc-13-Limiting Access to Proxied HTTP Resources 訪問限流

Python數據分析與挖掘實戰（5章）

python包：pandas

公司剛入職了一名 Java 中級開發，短短 4 行代碼居然湊齊了 3 個 bug！我哭了~~

C++文件/流

一、什麼是Docker

二、Docker 組件

揹包九講一 01揹包

今天！通義靈碼在北京、成都、杭州三城開講啦

Torch 中的引用、深拷貝以及 getParameters 獲取參數的探討

利用 caffe 接口構建 CNN 網絡

論文閱讀：Deep Relative Distance Learning: Tell the Difference Between Similar Vehicles

Linux 中 bashrc 中的 rc 是什麼意思

Triplet Loss、Coupled Cluster Loss 探究

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結