Torch筆記 (五)DNN實現多分類
上一篇學習了torch中DNN訓練的兩種模式,現在咱們開始磨刀實戰了,使用torch中的組件實現機器學習中常見的多分類問題。
咱們選取一個三分類的經典數據集,數據集地址http://mlr.cs.umass.edu/ml/datasets/Wine,這是一個關於葡萄酒的數據集,有178個樣本,13個屬性,總共3個類別,沒有缺失值,而且13個屬性都是連續類型。其中第一列表示葡萄酒類別編號(1、2、3),後面的第2至第14列是葡萄酒的酒精濃度、酸性、密度等等屬性值。這裏截取部分樣本預覽一下
1,14.23,1.71,2.43,15.6,127,2.8,3.06,.28,2.29,5.64,1.04,3.92,1065
1,13.2,1.78,2.14,11.2,100,2.65,2.76,.26,1.28,4.38,1.05,3.4,1050
1,13.16,2.36,2.67,18.6,101,2.8,3.24,.3,2.81,5.68,1.03,3.17,1185
1,14.37,1.95,2.5,16.8,113,3.85,3.49,.24,2.18,7.8,.86,3.45,1480
1,13.24,2.59,2.87,21,118,2.8,2.69,.39,1.82,4.32,1.04,2.93,735
1,14.2,1.76,2.45,15.2,112,3.27,3.39,.34,1.97,6.75,1.05,2.85,1450
1,14.39,1.87,2.45,14.6,96,2.5,2.52,.3,1.98,5.25,1.02,3.58,1290
1,14.06,2.15,2.61,17.6,121,2.6,2.51,.31,1.25,5.05,1.06,3.58,1295ok,現在針對這個數據集,咱們訓練一個DNN模型,只要知道葡萄酒的13個屬性值,我就可以準確預測該葡萄酒的類別(代號1或2或3)。
首先搭建網絡結構,這裏數據集比較簡單,所以網絡結構也是很簡單
-- 這裏包含兩個隱層的網絡,當然,這裏具體應該是多少層,每層多少個神經元,
-- 一般需要測試的,最後效果表現最好的網絡結構往往需要多次實驗,反正我還
-- 不知道怎樣快速選取合適的網絡結構,如果有大牛知道,告訴我,小弟感激不盡
mlp = nn.Sequential()
mlp:add(nn.Linear(13, 13)) -- 輸入是13個神經元
mlp:add(nn.ReLU())
mlp:add(nn.Linear(13, 13))
mlp:add(nn.ReLU())
mlp:add(nn.Linear(13, 3)) -- 輸出是3個神經元(三分類)
mlp:add(nn.SoftMax())
定義爲多分類量身定做的評價函數
criterion=nn.ClassNLLCriterion()
應用上一篇講到的訓練方法,將網絡的訓練寫出來
第一種訓練方法,手動for循環來訓練
-- mlp是搭建的網絡模型,x是模型輸入,y是模型輸出,criterion是模型評價函數,learningRate是學習率
function gradUpdate(mlp, x, y, criterion, learningRate)
local pred = mlp:forward(x)
local err = criterion:forward(pred, y)
mlp:zeroGradParameters()
local gradCriterion = criterion:backward(pred, y)
mlp:backward(x, gradCriterion)
mlp:updateParameters(learningRate)
end
第二種方法,利用optim包進行訓練
-- trainset.data是訓練數據集,數據結構是一個table
w, dl_dw = mlp:getParameters()
config = {
learningRate = 1e-2,
}
for i = 1,50 do
for j = 1,#trainset.data do
input = x[j]
output = y[j]
feval = function(w_new)
if w ~= w_new then w:copy(w_new) end
dl_dw:zero()
pred = mlp:forward(input)
loss = criterion:forward(pred, output)
gradCriterion = criterion:backward(pred, output)
gradInput = mlp:backward(input, gradCriterion)
return loss, dl_dw
end
optim.rprop(feval,w,config)
end
end
ok,框架已經搭好了,就差格式化數據、加載數據了,由於torch是基於lua的,因此需要使用lua的語法來完成這個數據加載的過程,雖然稍顯麻煩,但是熟悉之後那也就是分分鐘的事了。這裏重點將一下數據的加載過程。
假設下載下來的數據集的名稱是“rawdata.txt”,咱需要對每行進行解析,然而不幸的是lua沒有想Java和Python那樣的split切分字符串函數,這個函數真是太方便了,但是沒有關係,像這樣的函數庫裏面沒有,那就自己造一個輪子咯。
字符串分割函數,和Java、Python的split一樣,這裏返回一個包含切分之後多個元素的table,默認分隔符是’,’ 。
function lua_string_split(str, split_char)
split_char = split_char or ','
local sub_str_tab = {};
local i = 0;
local j = 0;
while true do
j = string.find(str, split_char,i+1);
if j == nil then
local last_str = string.sub(str,i+1,#str)
if last_str ~= "" and last_str ~= split_char then
table.insert(sub_str_tab,last_str);
end
break;
end;
local subtmp = string.sub(str,i+1,j-1)
if subtmp ~= "" and subtmp ~= split_char then
table.insert(sub_str_tab,subtmp);
end
i = j;
end
return sub_str_tab;
end
解析完每一行的數據,咱們就加載了數據集,然後是一個輸入歸一化的過程,對於連續值來說,這應該是必須的一步吧,因爲網絡需要正確學到數據的分佈,而如果不做歸一化,如果各個屬性值之間的差異巨大,將會把模型跑偏。
-- 這裏有tensor的下標操作符的使用,在[Torch筆記 (二)快速入門](http://blog.csdn.net/whgyxy/article/details/52204206)
-- 中有詳細講到
-- 數據歸一化操作
function inputNormlization(data)
mean = {}
stdv = {}
for i = 1,data:size(2) do
-- 分別計算數據集每一列的均值和標準差,做歸一化
mean[i] = data[{{},{i}}]:mean()
print(string.format("mean[%d]=%f",i,mean[i]))
data[{{},{i}}]:add(-mean[i])
stdv[i] = data[{{},{i}}]:std()
print(string.format("stdv[%d]=%f",i,stdv[i]))
data[{{},{i}}]:div(stdv[i])
end
end
ok,熟悉了以上兩個函數之後,咱們再從數據加載從頭說起,首先從數據集中解析每一行的數據,得到每一個樣本的屬性數據和類別label,讀完文件,咱們就得到了兩個table,一個table是數據集的屬性數據,另外一個table是數據集的label。
function getData(filepath)
file = torch.DiskFile(filepath,"r")
strfile = file:readString("*a")
file:close()
lines = lua_string_split(strfile,"\n")
label = {} -- 存放所有樣本的label
data = {} -- 存放所有樣本的屬性數據
for i =1,#lines do
record = {}
elements = lua_string_split(lines[i],",")
for j = 2,#elements do
table.insert(record,tonumber(elements[j]))
end
table.insert(label,tonumber(elements[1]))
table.insert(data,record)
end
print(#data)
return data,label
end
這裏得到是整個數據集,咱們在評估模型效果的時候是需要做交叉驗證的,所以需要對數據集分爲n組,其中n-1組作爲訓練集,剩下的一組作爲測試集,假設每組編號分別爲1、2、…、n,那麼訓練的時候可以以第一組作爲測試集,其他組作爲訓練集進行訓練,第二次以第二組作爲測試集,其他組作爲訓練集,…,這樣做n次。在數據分組的過程中,咱們將數據Shuffle一下,即打亂輸入數據的順序,不讓輸入數據的順序對模型產生影響。
咱們先進行數據集分組
-- data是上面獲取到的數據集
-- label是上面獲取到的各個樣本的label
-- num是分組數目n
-- 返回一個table batch,裏面裝着所有組的數據,
-- 和另一個table label_b裏面裝着所有組的label
function splitDataSet(data,label,num)
local samplenum = #data
local batch = {}
local label_b = {}
for i = 1,num do
batch[i] = {}
label_b[i] = {}
end
local batchsize = samplenum / num
local lastbatchsize = samplenum - batchsize * (num - 1)
flag = {} -- 數據Shuffle需要用到,樣本是否已經分組的標記
for i = 1,samplenum do
flag[i] = 0 -- 初始都沒有分組
end
math.randomseed(os.time())
i = 1
isbreak = 0
-- 每組的數據量是一樣的(除了最後一組),從第一個組開始裝數據
while 1 do
for batch_index = 1,num do
index = math.random(samplenum) -- 隨機生成一個樣本下標,
-- 如果這個下標還沒有分組,並且當前組還沒有填滿
if flag[index] == 0 and #(batch[batch_index]) < batchsize then
table.insert(batch[batch_index] , data[index])
table.insert(label_b[batch_index] ,label[index])
flag[index] = 1
i = i+1
if i > samplenum then -- 全部分組完畢
isbreak = 1
break
end
end
end
if isbreak == 1 then break end
end
return batch,label_b
end
然後根據數據分組和label分組以及分組索引確定訓練集和測試集
-- 指定第index組爲測試集,其餘組合併爲訓練集
function generateTrain_Test(batch,label_b,index)
trainset = {data = {},label = {}}
testset = {data = {},label = {} }
testset.data = batch[index]
testset.label = label_b[index]
tmpdata = {}
tmplabel = {}
tmpi = 1
for i =1,#batch do
if i ~= index then
for j = 1,#(batch[i]) do
tmpdata[tmpi] = batch[i][j]
tmplabel[tmpi] = label_b[i][j]
tmpi = tmpi + 1
end
end
end
trainset.data = tmpdata
trainset.label = tmplabel
return trainset,testset
end
然後咱們把加載數據的過程綜合一下
-- filepath數據集文件路徑
-- batchsize是分組中每組的大小,testbatch_index是指定爲測試集的分組索引
-- 返回訓練集和測試集
function load_data(filepath,batchsize,testbatch_index)
rawdata,rawlabel = getData(filepath)
batch,label_b = splitDataSet(rawdata,rawlabel,batchsize)
-- Generate the trainset and testset
trainset,testset = generateTrain_Test(batch,label_b,testbatch_index)
return trainset,testset
end
當訓練完成之後,咱們就要評估一下模型預測的準確率了,這裏既然是分類,那就以分類準確率來衡量。
-- model是訓練好的模型
-- data是測試集屬性數據,label是測試數據的label
function Test_MultiClass(model,data,label)
pred = model:forward(data)
print("pred")
print(pred)
-- 對預測結果進行排序,取最大概率對應的類別最爲該樣本的類別
tmp,index = torch.sort(pred,2,true)
print(tmp)
print(index)
correct = 0
print("comp")
-- 將所有樣本的真實結果和預測結果進行比較,計算分類準確率
for i = 1,pred:size(1) do -- label[i] is a Tensor object,but index[i][1] is a number
-- if label[i]:eq(index[i][1]):all() then -- we alse can write like that
if torch.eq(label[i],index[i][1])[1] == 1 then
correct = correct + 1
end
end
print("correct")
print(correct)
correctRate = correct * 1.0 / pred:size(1)
print(string.format("correct is %f",correctRate))
if correctRate > 0.98 then torch.save('model',mlp) end
for i = 1,pred:size(1) do
print(pred[i][1],pred[i][2],pred[i][3],y[i])
end
end
好了,到此爲止,可以開始訓練了,玩DNN了,話說數據處理的過程比模型部分麻煩很多啊,大家有沒有這樣的感覺。不過一旦數據處理好之後,模型的調整也是一個需要耐心的過程。
本節完整代碼如下
require 'torch'
require 'nn'
require 'optim'
-------------------------------------------------------------------------------------
-- Split String
-- str: Stirng to be splited
-- split_char: the separator,default is ','
-------------------------------------------------------------------------------------
function lua_string_split(str, split_char)
split_char = split_char or ','
local sub_str_tab = {};
local i = 0;
local j = 0;
while true do
j = string.find(str, split_char,i+1);
if j == nil then
local last_str = string.sub(str,i+1,#str)
if last_str ~= "" and last_str ~= split_char then
table.insert(sub_str_tab,last_str);
end
break;
end;
local subtmp = string.sub(str,i+1,j-1)
if subtmp ~= "" and subtmp ~= split_char then
table.insert(sub_str_tab,subtmp);
end
i = j;
end
return sub_str_tab;
end
-------------------------------------------------------------------------------------
-- Normlization the input data
-- x: the input data
-- Each feature sub it's mean and div it's std
-------------------------------------------------------------------------------------
function inputNormlization(data)
mean = {}
stdv = {}
for i = 1,data:size(2) do
mean[i] = data[{{},{i}}]:mean()
print(string.format("mean[%d]=%f",i,mean[i]))
data[{{},{i}}]:add(-mean[i])
stdv[i] = data[{{},{i}}]:std()
print(string.format("stdv[%d]=%f",i,stdv[i]))
data[{{},{i}}]:div(stdv[i])
end
end
function Test_MultiClass(model,data,label)
pred = model:forward(data)
print("pred")
print(pred)
tmp,index = torch.sort(pred,2,true)
print(tmp)
print(index)
correct = 0
print("comp")
for i = 1,pred:size(1) do -- label[i] is a Tensor object,but index[i][1] is a number
-- if label[i]:eq(index[i][1]):all() then -- we alse can write like that
if torch.eq(label[i],index[i][1])[1] == 1 then
correct = correct + 1
end
end
print("correct")
print(correct)
correctRate = correct * 1.0 / pred:size(1)
print(string.format("correct is %f",correctRate))
if correctRate > 0.98 then torch.save('model',mlp) end
for i = 1,pred:size(1) do
print(pred[i][1],pred[i][2],pred[i][3],y[i])
end
end
-------------------------------------------------------------------------------------
-- Get raw data and the raw label
-- filepath:the path of file
-- Return a table of raw data and it's label
-- This function fits for training data which has label
-------------------------------------------------------------------------------------
function getData(filepath)
file = torch.DiskFile(filepath,"r")
strfile = file:readString("*a")
file:close()
lines = lua_string_split(strfile,"\n")
label = {}
data = {}
for i =1,#lines do
record = {}
elements = lua_string_split(lines[i],",")
for j = 2,#elements do
table.insert(record,tonumber(elements[j]))
end
table.insert(label,tonumber(elements[1]))
table.insert(data,record)
end
print(#data)
return data,label
end
-------------------------------------------------------------------------------------
-- Split DataSet
-- data: rew data
-- label: the label of the raw
-- num: split the data into num batches
-- This function splits the raw data into num batches,using for selecting the
-- train data and the validation data
-------------------------------------------------------------------------------------
function splitDataSet(data,label,num)
local samplenum = #data
local batch = {}
local label_b = {}
for i = 1,num do
batch[i] = {}
label_b[i] = {}
end
local batchsize = samplenum / num
local lastbatchsize = samplenum - batchsize * (num - 1)
flag = {}
for i = 1,samplenum do
flag[i] = 0
end
math.randomseed(os.time())
i = 1
isbreak = 0
while 1 do
for batch_index = 1,num do
index = math.random(samplenum)
if flag[index] == 0 and #(batch[batch_index]) < batchsize then
table.insert(batch[batch_index] , data[index])
table.insert(label_b[batch_index] ,label[index])
flag[index] = 1
i = i+1
if i > samplenum then
isbreak = 1
break
end
end
end
if isbreak == 1 then break end
end
return batch,label_b
end
-------------------------------------------------------------------------------------
-- Generate Train and Test data
-- batch: pieces of batches data
-- label_b: pieces of batches label
-- index: the index of batch that will be selected as test data
-------------------------------------------------------------------------------------
function generateTrain_Test(batch,label_b,index)
trainset = {data = {},label = {}}
testset = {data = {},label = {} }
testset.data = batch[index]
testset.label = label_b[index]
tmpdata = {}
tmplabel = {}
tmpi = 1
for i =1,#batch do
if i ~= index then
for j = 1,#(batch[i]) do
tmpdata[tmpi] = batch[i][j]
tmplabel[tmpi] = label_b[i][j]
tmpi = tmpi + 1
end
end
end
trainset.data = tmpdata
trainset.label = tmplabel
return trainset,testset
end
-------------------------------------------------------------------------------------
-- Load the data
-- filepath: the path of dataset
-- batchsize: the number of batches splited the data into
-- testbatch_index: the index of batch that will be selected as test data
-- Return trainset and testset
-------------------------------------------------------------------------------------
function load_data(filepath,batchsize,testbatch_index)
rawdata,rawlabel = getData(filepath)
batch,label_b = splitDataSet(rawdata,rawlabel,batchsize)
-- Generate the trainset and testset
trainset,testset = generateTrain_Test(batch,label_b,testbatch_index)
return trainset,testset
end
trainset,testset = load_data('./datasets/rawdata.txt',6,3)
x = torch.Tensor(trainset.data)
y = torch.Tensor(trainset.label)
x = x:reshape(#trainset.data,13)
y = y:reshape(#trainset.label,1)
inputNormlization(x)
mlp = init_model()
-- The criterion for multiclassification
criterion=nn.ClassNLLCriterion()
-- Get the parameters and derivative of the parameters
w, dl_dw = mlp:getParameters()
config = {
learningRate = 1e-2,
}
-- One way to train the model
-- Using the function optim.rprop,it can be fast
-- It needs a optim function as the first parameter,and the second parameter is also
-- the optim function's parameter,the thrid is a config
--for i = 1,50 do
-- for j = 1,#trainset.data do
-- input = x[j]
-- output = y[j]
-- feval = function(w_new)
-- if w ~= w_new then w:copy(w_new) end
-- dl_dw:zero()
-- pred = mlp:forward(input)
-- loss = criterion:forward(pred, output)
---- print("loss is ")
---- print(loss)
-- gradCriterion = criterion:backward(pred, output)
---- print("gradCriterion")
---- print(gradCriterion)
-- gradInput = mlp:backward(input, gradCriterion)
---- print("gradInput")
---- print(gradInput)
-- return loss, dl_dw
-- end
---- print("the w is :")
---- print(w)
---- print("the dl_dw is ")
---- print(dl_dw)
-- optim.rprop(feval,w,config)
-- end
--end
-- The other way ot train the model is update the parameters by each samples
function gradUpdate(mlp, x, y, criterion, learningRate)
local pred = mlp:forward(x)
print("pred")
print(pred)
local err = criterion:forward(pred, y)
mlp:zeroGradParameters()
local gradCriterion = criterion:backward(pred, y)
print("grad of loss")
print(gradCriterion)
mlp:backward(x, gradCriterion)
mlp:updateParameters(learningRate)
end
for i = 1,40 do
for j = 1,#trainset.data do
input = x[j]
output = y[j]
print("train data")
print(x[j])
print(y[j])
gradUpdate(mlp, input, output, criterion, 0.01)
end
end
x = torch.Tensor(testset.data)
y = torch.Tensor(testset.label)
-- 數據集是table格式,需要轉成torch可以接受的tensor格式
x = x:reshape(#testset.data,13)
y = y:reshape(#testset.label,1)
inputNormlization(x) -- 測試數據歸一化
Test_MultiClass(mlp,x,y) -- 驗證以及計算分類準確率