Torch學習筆記

Torch筆記 (五)DNN實現多分類

上一篇學習了torch中DNN訓練的兩種模式,現在咱們開始磨刀實戰了,使用torch中的組件實現機器學習中常見的多分類問題。

咱們選取一個三分類的經典數據集,數據集地址http://mlr.cs.umass.edu/ml/datasets/Wine,這是一個關於葡萄酒的數據集,有178個樣本,13個屬性,總共3個類別,沒有缺失值,而且13個屬性都是連續類型。其中第一列表示葡萄酒類別編號(1、2、3),後面的第2至第14列是葡萄酒的酒精濃度、酸性、密度等等屬性值。這裏截取部分樣本預覽一下

1,14.23,1.71,2.43,15.6,127,2.8,3.06,.28,2.29,5.64,1.04,3.92,1065
1,13.2,1.78,2.14,11.2,100,2.65,2.76,.26,1.28,4.38,1.05,3.4,1050
1,13.16,2.36,2.67,18.6,101,2.8,3.24,.3,2.81,5.68,1.03,3.17,1185
1,14.37,1.95,2.5,16.8,113,3.85,3.49,.24,2.18,7.8,.86,3.45,1480
1,13.24,2.59,2.87,21,118,2.8,2.69,.39,1.82,4.32,1.04,2.93,735
1,14.2,1.76,2.45,15.2,112,3.27,3.39,.34,1.97,6.75,1.05,2.85,1450
1,14.39,1.87,2.45,14.6,96,2.5,2.52,.3,1.98,5.25,1.02,3.58,1290
1,14.06,2.15,2.61,17.6,121,2.6,2.51,.31,1.25,5.05,1.06,3.58,1295

ok,現在針對這個數據集,咱們訓練一個DNN模型,只要知道葡萄酒的13個屬性值,我就可以準確預測該葡萄酒的類別(代號1或2或3)。

首先搭建網絡結構,這裏數據集比較簡單,所以網絡結構也是很簡單

    -- 這裏包含兩個隱層的網絡,當然,這裏具體應該是多少層,每層多少個神經元,
    -- 一般需要測試的,最後效果表現最好的網絡結構往往需要多次實驗,反正我還
    -- 不知道怎樣快速選取合適的網絡結構,如果有大牛知道,告訴我,小弟感激不盡
    mlp = nn.Sequential()
    mlp:add(nn.Linear(13, 13)) -- 輸入是13個神經元
    mlp:add(nn.ReLU())
    mlp:add(nn.Linear(13, 13))
    mlp:add(nn.ReLU())
    mlp:add(nn.Linear(13, 3)) -- 輸出是3個神經元(三分類)
    mlp:add(nn.SoftMax())

定義爲多分類量身定做的評價函數

criterion=nn.ClassNLLCriterion()

應用上一篇講到的訓練方法,將網絡的訓練寫出來
第一種訓練方法,手動for循環來訓練

-- mlp是搭建的網絡模型,x是模型輸入,y是模型輸出,criterion是模型評價函數,learningRate是學習率
function gradUpdate(mlp, x, y, criterion, learningRate)
    local pred = mlp:forward(x)
    local err = criterion:forward(pred, y)
    mlp:zeroGradParameters()
    local gradCriterion = criterion:backward(pred, y)
    mlp:backward(x, gradCriterion)
    mlp:updateParameters(learningRate)
end

第二種方法,利用optim包進行訓練

-- trainset.data是訓練數據集,數據結構是一個table
w, dl_dw = mlp:getParameters()
config = {
    learningRate = 1e-2,
}

for i = 1,50 do
    for j = 1,#trainset.data do
        input = x[j]
        output = y[j]
        feval = function(w_new)
            if w ~= w_new then w:copy(w_new) end
            dl_dw:zero()
            pred = mlp:forward(input)
            loss = criterion:forward(pred, output)
            gradCriterion = criterion:backward(pred, output)
            gradInput = mlp:backward(input, gradCriterion)
            return loss, dl_dw
        end
        optim.rprop(feval,w,config)
    end
end

ok,框架已經搭好了,就差格式化數據、加載數據了,由於torch是基於lua的,因此需要使用lua的語法來完成這個數據加載的過程,雖然稍顯麻煩,但是熟悉之後那也就是分分鐘的事了。這裏重點將一下數據的加載過程。

假設下載下來的數據集的名稱是“rawdata.txt”,咱需要對每行進行解析,然而不幸的是lua沒有想Java和Python那樣的split切分字符串函數,這個函數真是太方便了,但是沒有關係,像這樣的函數庫裏面沒有,那就自己造一個輪子咯。
字符串分割函數,和Java、Python的split一樣,這裏返回一個包含切分之後多個元素的table,默認分隔符是’,’ 。

function lua_string_split(str, split_char)
    split_char = split_char or ','
    local sub_str_tab = {};
    local i = 0;
    local j = 0;
    while true do
        j = string.find(str, split_char,i+1);
        if j == nil then
            local last_str = string.sub(str,i+1,#str)
            if last_str ~= "" and last_str ~= split_char then
                table.insert(sub_str_tab,last_str);
            end
            break;
        end;
        local subtmp = string.sub(str,i+1,j-1)
        if subtmp ~= "" and subtmp ~= split_char then
            table.insert(sub_str_tab,subtmp);
        end
        i = j;
    end
    return sub_str_tab;
end

解析完每一行的數據,咱們就加載了數據集,然後是一個輸入歸一化的過程,對於連續值來說,這應該是必須的一步吧,因爲網絡需要正確學到數據的分佈,而如果不做歸一化,如果各個屬性值之間的差異巨大,將會把模型跑偏。

-- 這裏有tensor的下標操作符的使用,在[Torch筆記 (二)快速入門](http://blog.csdn.net/whgyxy/article/details/52204206)
-- 中有詳細講到
-- 數據歸一化操作
function inputNormlization(data)
    mean = {}
    stdv = {}
    for i = 1,data:size(2) do
    -- 分別計算數據集每一列的均值和標準差,做歸一化
        mean[i] = data[{{},{i}}]:mean()
        print(string.format("mean[%d]=%f",i,mean[i]))
        data[{{},{i}}]:add(-mean[i])
        stdv[i] = data[{{},{i}}]:std()
        print(string.format("stdv[%d]=%f",i,stdv[i]))
        data[{{},{i}}]:div(stdv[i])
    end
end

ok,熟悉了以上兩個函數之後,咱們再從數據加載從頭說起,首先從數據集中解析每一行的數據,得到每一個樣本的屬性數據和類別label,讀完文件,咱們就得到了兩個table,一個table是數據集的屬性數據,另外一個table是數據集的label。

function getData(filepath)
    file = torch.DiskFile(filepath,"r")
    strfile = file:readString("*a")
    file:close()
    lines = lua_string_split(strfile,"\n")

    label = {} -- 存放所有樣本的label  
    data = {}  -- 存放所有樣本的屬性數據
    for i =1,#lines do
        record = {}
        elements = lua_string_split(lines[i],",")
        for j = 2,#elements do
            table.insert(record,tonumber(elements[j]))
        end
        table.insert(label,tonumber(elements[1]))
        table.insert(data,record)
    end
    print(#data)
    return data,label
end

這裏得到是整個數據集,咱們在評估模型效果的時候是需要做交叉驗證的,所以需要對數據集分爲n組,其中n-1組作爲訓練集,剩下的一組作爲測試集,假設每組編號分別爲1、2、…、n,那麼訓練的時候可以以第一組作爲測試集,其他組作爲訓練集進行訓練,第二次以第二組作爲測試集,其他組作爲訓練集,…,這樣做n次。在數據分組的過程中,咱們將數據Shuffle一下,即打亂輸入數據的順序,不讓輸入數據的順序對模型產生影響。

咱們先進行數據集分組

-- data是上面獲取到的數據集
-- label是上面獲取到的各個樣本的label
-- num是分組數目n
-- 返回一個table batch,裏面裝着所有組的數據,
-- 和另一個table label_b裏面裝着所有組的label
function splitDataSet(data,label,num)
    local samplenum = #data
    local batch = {}
    local label_b = {}
    for i = 1,num do
        batch[i] = {}
        label_b[i] = {}
    end
    local batchsize = samplenum / num
    local lastbatchsize = samplenum - batchsize * (num - 1)
    flag  = {} -- 數據Shuffle需要用到,樣本是否已經分組的標記
    for i = 1,samplenum do
        flag[i] = 0  -- 初始都沒有分組
    end
    math.randomseed(os.time())
    i = 1
    isbreak = 0
    -- 每組的數據量是一樣的(除了最後一組),從第一個組開始裝數據
    while 1 do
        for batch_index = 1,num do
            index = math.random(samplenum)  -- 隨機生成一個樣本下標,
            -- 如果這個下標還沒有分組,並且當前組還沒有填滿
            if flag[index] == 0 and #(batch[batch_index]) < batchsize then
                table.insert(batch[batch_index] , data[index])
                table.insert(label_b[batch_index] ,label[index])
                flag[index] = 1
                i = i+1
                if i > samplenum then  -- 全部分組完畢
                    isbreak = 1
                    break
                end
            end
        end
        if isbreak == 1 then break end
    end
    return batch,label_b
end

然後根據數據分組和label分組以及分組索引確定訓練集和測試集

-- 指定第index組爲測試集,其餘組合併爲訓練集
function generateTrain_Test(batch,label_b,index)
    trainset = {data = {},label = {}}
    testset = {data = {},label = {} }
    testset.data = batch[index]
    testset.label = label_b[index]
    tmpdata = {}
    tmplabel = {}
    tmpi = 1
    for i =1,#batch do
        if i ~= index then
            for j = 1,#(batch[i]) do
                tmpdata[tmpi] = batch[i][j]
                tmplabel[tmpi] = label_b[i][j]
                tmpi = tmpi + 1
            end
        end
    end
    trainset.data = tmpdata
    trainset.label = tmplabel
    return trainset,testset
end

然後咱們把加載數據的過程綜合一下

-- filepath數據集文件路徑
-- batchsize是分組中每組的大小,testbatch_index是指定爲測試集的分組索引
-- 返回訓練集和測試集
function load_data(filepath,batchsize,testbatch_index)
    rawdata,rawlabel = getData(filepath)
    batch,label_b = splitDataSet(rawdata,rawlabel,batchsize)
    -- Generate the trainset and testset
    trainset,testset = generateTrain_Test(batch,label_b,testbatch_index)
    return trainset,testset
end

當訓練完成之後,咱們就要評估一下模型預測的準確率了,這裏既然是分類,那就以分類準確率來衡量。

-- model是訓練好的模型
-- data是測試集屬性數據,label是測試數據的label
function Test_MultiClass(model,data,label)
    pred = model:forward(data)
    print("pred")
    print(pred)
    -- 對預測結果進行排序,取最大概率對應的類別最爲該樣本的類別
    tmp,index = torch.sort(pred,2,true)
    print(tmp)
    print(index)
    correct = 0
    print("comp")
    -- 將所有樣本的真實結果和預測結果進行比較,計算分類準確率
    for i = 1,pred:size(1) do   -- label[i] is a Tensor object,but index[i][1] is a number
    -- if label[i]:eq(index[i][1]):all() then   -- we alse can write like that 
    if torch.eq(label[i],index[i][1])[1] == 1 then
        correct = correct + 1
    end
    end
    print("correct")
    print(correct)
    correctRate = correct * 1.0 / pred:size(1)
    print(string.format("correct is %f",correctRate))
    if correctRate > 0.98 then torch.save('model',mlp) end
    for i = 1,pred:size(1) do
        print(pred[i][1],pred[i][2],pred[i][3],y[i])
    end
end

好了,到此爲止,可以開始訓練了,玩DNN了,話說數據處理的過程比模型部分麻煩很多啊,大家有沒有這樣的感覺。不過一旦數據處理好之後,模型的調整也是一個需要耐心的過程。

本節完整代碼如下

require 'torch'
require 'nn'
require 'optim'


-------------------------------------------------------------------------------------
-- Split String
-- str: Stirng to be splited
-- split_char: the separator,default is ','
-------------------------------------------------------------------------------------
function lua_string_split(str, split_char)
    split_char = split_char or ','
    local sub_str_tab = {};
    local i = 0;
    local j = 0;
    while true do
        j = string.find(str, split_char,i+1);
        if j == nil then
            local last_str = string.sub(str,i+1,#str)
            if last_str ~= "" and last_str ~= split_char then
                table.insert(sub_str_tab,last_str);
            end
            break;
        end;
        local subtmp = string.sub(str,i+1,j-1)
        if subtmp ~= "" and subtmp ~= split_char then
            table.insert(sub_str_tab,subtmp);
        end
        i = j;
    end
    return sub_str_tab;
end

-------------------------------------------------------------------------------------
-- Normlization the input data
-- x: the input data
-- Each feature sub it's mean and div it's std
-------------------------------------------------------------------------------------
function inputNormlization(data)
    mean = {}
    stdv = {}
    for i = 1,data:size(2) do
        mean[i] = data[{{},{i}}]:mean()
        print(string.format("mean[%d]=%f",i,mean[i]))
        data[{{},{i}}]:add(-mean[i])
        stdv[i] = data[{{},{i}}]:std()
        print(string.format("stdv[%d]=%f",i,stdv[i]))
        data[{{},{i}}]:div(stdv[i])
    end
end


function Test_MultiClass(model,data,label)
    pred = model:forward(data)
    print("pred")
    print(pred)
    tmp,index = torch.sort(pred,2,true)
    print(tmp)
    print(index)
    correct = 0
    print("comp")
    for i = 1,pred:size(1) do   -- label[i] is a Tensor object,but index[i][1] is a number
    -- if label[i]:eq(index[i][1]):all() then   -- we alse can write like that 
    if torch.eq(label[i],index[i][1])[1] == 1 then
        correct = correct + 1
    end
    end
    print("correct")
    print(correct)
    correctRate = correct * 1.0 / pred:size(1)
    print(string.format("correct is %f",correctRate))
    if correctRate > 0.98 then torch.save('model',mlp) end
    for i = 1,pred:size(1) do
        print(pred[i][1],pred[i][2],pred[i][3],y[i])
    end
end


-------------------------------------------------------------------------------------
-- Get raw data and the raw label
-- filepath:the path of file
-- Return a table of raw data and it's label
-- This function fits for training data which has label
-------------------------------------------------------------------------------------
function getData(filepath)
    file = torch.DiskFile(filepath,"r")
    strfile = file:readString("*a")
    file:close()
    lines = lua_string_split(strfile,"\n")

    label = {}
    data = {}
    for i =1,#lines do
        record = {}
        elements = lua_string_split(lines[i],",")
        for j = 2,#elements do
            table.insert(record,tonumber(elements[j]))
        end
        table.insert(label,tonumber(elements[1]))
        table.insert(data,record)
    end
    print(#data)
    return data,label
end

-------------------------------------------------------------------------------------
-- Split DataSet
-- data: rew data
-- label: the label of the raw
-- num: split the data into num batches
-- This function splits the raw data into num batches,using for selecting the
--    train data and the validation data
-------------------------------------------------------------------------------------
function splitDataSet(data,label,num)
    local samplenum = #data
    local batch = {}
    local label_b = {}
    for i = 1,num do
        batch[i] = {}
        label_b[i] = {}
    end
    local batchsize = samplenum / num
    local lastbatchsize = samplenum - batchsize * (num - 1)
    flag  = {}
    for i = 1,samplenum do
        flag[i] = 0
    end
    math.randomseed(os.time())
    i = 1
    isbreak = 0
    while 1 do
        for batch_index = 1,num do
            index = math.random(samplenum)
            if flag[index] == 0 and #(batch[batch_index]) < batchsize then
                table.insert(batch[batch_index] , data[index])
                table.insert(label_b[batch_index] ,label[index])
                flag[index] = 1
                i = i+1
                if i > samplenum then
                    isbreak = 1
                    break
                end
            end
        end
        if isbreak == 1 then break end
    end
    return batch,label_b
end


-------------------------------------------------------------------------------------
-- Generate Train and Test data
-- batch: pieces of batches data
-- label_b: pieces of batches label
-- index: the index of batch that will be selected as test data
-------------------------------------------------------------------------------------
function generateTrain_Test(batch,label_b,index)
    trainset = {data = {},label = {}}
    testset = {data = {},label = {} }
    testset.data = batch[index]
    testset.label = label_b[index]
    tmpdata = {}
    tmplabel = {}
    tmpi = 1
    for i =1,#batch do
        if i ~= index then
            for j = 1,#(batch[i]) do
                tmpdata[tmpi] = batch[i][j]
                tmplabel[tmpi] = label_b[i][j]
                tmpi = tmpi + 1
            end
        end
    end
    trainset.data = tmpdata
    trainset.label = tmplabel
    return trainset,testset
end

-------------------------------------------------------------------------------------
-- Load the data
-- filepath: the path of dataset
-- batchsize: the number of batches splited the data into
-- testbatch_index: the index of batch that will be selected as test data
-- Return trainset and testset
-------------------------------------------------------------------------------------
function load_data(filepath,batchsize,testbatch_index)
    rawdata,rawlabel = getData(filepath)
    batch,label_b = splitDataSet(rawdata,rawlabel,batchsize)
    -- Generate the trainset and testset
    trainset,testset = generateTrain_Test(batch,label_b,testbatch_index)
    return trainset,testset
end

trainset,testset = load_data('./datasets/rawdata.txt',6,3)
x = torch.Tensor(trainset.data)
y = torch.Tensor(trainset.label)
x = x:reshape(#trainset.data,13)
y = y:reshape(#trainset.label,1)
inputNormlization(x)

mlp = init_model()
-- The criterion for multiclassification
criterion=nn.ClassNLLCriterion()
-- Get the parameters and derivative of the parameters
w, dl_dw = mlp:getParameters()

config = {
    learningRate = 1e-2,
}
-- One way to train the model
-- Using the function optim.rprop,it can be fast
-- It needs a optim function as the first parameter,and the second parameter is also
--   the optim function's parameter,the thrid is a config
--for i = 1,50 do
--    for j = 1,#trainset.data do
--        input = x[j]
--        output = y[j]
--        feval = function(w_new)
--            if w ~= w_new then w:copy(w_new) end
--            dl_dw:zero()
--            pred = mlp:forward(input)
--            loss = criterion:forward(pred, output)
----            print("loss is ")
----            print(loss)
--            gradCriterion = criterion:backward(pred, output)
----            print("gradCriterion")
----            print(gradCriterion)
--            gradInput = mlp:backward(input, gradCriterion)
----            print("gradInput")
----            print(gradInput)
--            return loss, dl_dw
--        end
----        print("the w is :")
----        print(w)
----        print("the dl_dw is ")
----        print(dl_dw)
--        optim.rprop(feval,w,config)
--    end
--end

-- The other way ot train the model is update the parameters by each samples
function gradUpdate(mlp, x, y, criterion, learningRate)
    local pred = mlp:forward(x)
    print("pred")
    print(pred)
    local err = criterion:forward(pred, y)
    mlp:zeroGradParameters()
    local gradCriterion = criterion:backward(pred, y)
    print("grad of loss")
    print(gradCriterion)
    mlp:backward(x, gradCriterion)
    mlp:updateParameters(learningRate)
end
for i = 1,40  do
    for j = 1,#trainset.data do
        input = x[j]
        output = y[j]
        print("train data")
    print(x[j])
    print(y[j])
        gradUpdate(mlp, input, output, criterion, 0.01)
    end
end


x = torch.Tensor(testset.data)
y = torch.Tensor(testset.label)
-- 數據集是table格式,需要轉成torch可以接受的tensor格式
x = x:reshape(#testset.data,13) 
y = y:reshape(#testset.label,1)
inputNormlization(x)  -- 測試數據歸一化
Test_MultiClass(mlp,x,y)  -- 驗證以及計算分類準確率





發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章