R語言數據抽樣（數據失衡處理、sample隨機抽樣、數據等比抽樣、交叉驗證抽樣）

關注微信公共號：小程在線

關注CSDN博客：程志偉的博客

詳細內容爲《R語言遊戲數據分析與挖掘》第五章學習筆記

數據抽樣包括：

1.數據類不平衡問題解決

2.隨機抽樣

3.數據等比例抽樣（用於多分類）

4.用於交叉驗證的樣本抽取

5.1.2類失衡處理方法

在R中，DMwR包中的SMOTE()函數可以實現SMOTE方法。

perc.over=500表示對原始數據集中的每個少數樣本，都生成5個新的少數樣本；

perc.under=80表示從原始數據集中選擇的多數類的樣本是新生的數據集中少數樣本的80%。

> hyper <-read.csv('http://archive.ics.uci.edu/ml/machine-learning-databases/thyroid-disease/hypothyroid.data',
+ header=F)
> names <- read.csv('http://archive.ics.uci.edu/ml/machine-learning-databases/thyroid-disease/hypothyroid.names',
+ header=F, sep='\t')[[1]]
> names <- gsub(pattern =":|[.]", replacement="", x = names)
> colnames(hyper)<-names
> colnames(hyper)
[1] "hypothyroid, negative" "age"
[3] "sex" "on_thyroxine"
[5] "query_on_thyroxine" "on_antithyroid_medication"
[7] "thyroid_surgery" "query_hypothyroid"
[9] "query_hyperthyroid" "pregnant"
[11] "sick" "tumor"
[13] "lithium" "goitre"
[15] "TSH_measured" "TSH"
[17] "T3_measured" "T3"
[19] "TT4_measured" "TT4"
[21] "T4U_measured" "T4U"
[23] "FTI_measured" "FTI"
[25] "TBG_measured" "TBG"
> # 我們將第一列的列名從 hypothyroid, negative改成target，並將該列中的因子negative變成0，其他值變成1
> colnames(hyper)[1]<-"target"
> colnames(hyper)
[1] "target" "age"
[3] "sex" "on_thyroxine"
[5] "query_on_thyroxine" "on_antithyroid_medication"
[7] "thyroid_surgery" "query_hypothyroid"
[9] "query_hyperthyroid" "pregnant"
[11] "sick" "tumor"
[13] "lithium" "goitre"
[15] "TSH_measured" "TSH"
[17] "T3_measured" "T3"
[19] "TT4_measured" "TT4"
[21] "T4U_measured" "T4U"
[23] "FTI_measured" "FTI"
[25] "TBG_measured" "TBG"
> hyper$target<-ifelse(hyper$target=="negative",0,1)
> # 檢查下0、1的結果
> table(hyper$target)

0 1
3012 151
> prop.table(table(hyper$target))

0 1
0.95226051 0.04773949

數據集存在嚴重額失衡數據。

# 利用SMOTE對類失衡問題進行處理
> # 將變量target變成因子型
> hyper$target <- as.factor(hyper$target)
> # 加載DMwR包
> if(!require(DMwR)) install.packages("DMwR")
載入需要的程輯包：DMwR
載入需要的程輯包：lattice
載入需要的程輯包：grid
Registered S3 method overwritten by 'quantmod':
method from
as.zoo.data.frame zoo
Warning message:
程輯包‘DMwR’是用R版本3.6.3 來建造的

# 進行類失衡處理
# perc.over=100:表示少數樣本數=151+151*100%=302
# perc.under=200:表示多數樣本數(新增少數樣本數*200%=151*200%=302)
> hyper_new <- SMOTE(target~.,hyper,perc.over = 100,perc.under = 200)
> # 查看處理後變量target的0、1個數
> table(hyper_new$target)

0 1
302 302

# perc.over=200:表示少數樣本數=151+151*200%=453
# perc.under=300:表示多數樣本數(新增少數樣本數*300%=151*200%*300%=906)
> hyper_new1 <- SMOTE(target~.,hyper,perc.over = 200,perc.under = 300)
> # 查看處理後變量target的0、1個數
> table(hyper_new1$target)

0 1
906 453
> # 對活躍用戶是否付費數據進行研究
> # 導入數據
> user <- read.csv("H:\\程志偉\\R語言遊戲數據分析與挖掘\\Game_DataMining_With_R-master\\data\\第5章\\活躍用戶是否付費數據.csv",header=T)
> # 查看變量名
> colnames(user)
[1] "用戶id" "是否付費"
[3] "註冊至今距離天數" "最後一週登陸天數"
[5] "最後一週登陸次數" "最後一週0.8點登陸次數"
[7] "最後一週8.18點登陸次數" "最後一週18.24點登陸次數"
> # 查看是否付費的類別佔比(0:非付費，1:付費)
> prop.table(table(user$是否付費))

0 1
0.8589596 0.1410404
> table(user$是否付費)

0 1
106176 17434
> # 將是否付費變量轉換成因子型
> user$是否付費 <- as.factor(user$是否付費)
> library(DMwR)
> # 對類失衡數據進行處理
> user_new <- SMOTE(是否付費~.,data=user,perc.over=100,perc.under=200)
> # 查看處理後的結果
> table(user_new$是否付費)

0 1
34868 34868

5.1.3 數據隨機抽樣
> # sample小例子
> set.seed(1234)
> # 創建對象x，有1~10組成
> x <- seq(1,10);x
[1] 1 2 3 4 5 6 7 8 9 10

# 利用sample函數對x進行無放回抽樣
> a <- sample(x,8,replace=FALSE);a
[1] 10 6 5 4 1 8 2 7

# 利用sample函數對x進行有放回抽樣
> b <- sample(x,8,replace=TRUE);b
[1] 7 6 10 6 4 8 4 4
> # 當size大於x的長度
> (c <- sample(x,15,replace = F))
Error in sample.int(length(x), size, replace, prob) :
cannot take a sample larger than the population when 'replace = FALSE'
> (c <- sample(x,15,replace = T))
[1] 5 8 4 8 3 4 10 5 2 8 4 3 7 9 3

> # 利用sample對活躍用戶數據進行抽樣
> # 導入數據
> #user <- read.csv("活躍用戶是否付費數據.csv",T)
> # 查看數據user的行數
> nrow(user)
[1] 123610

# 利用sample函數對user數據進行無放回抽樣
> set.seed(1234)
> # 提取下標集
> index <- sample(nrow(user),10000,replace=TRUE)
> # 將抽樣數據賦予對象user_sample
> user_sample <- user[index,]
> # 查看user_sample的行數
> nrow(user_sample)
[1] 10000
> # 現在我們分別查看user與user_sample變量“是否付費”中0、1佔比。
> round(prop.table(table(user$是否付費)),3)

0 1
0.859 0.141
> round(prop.table(table(user_sample$是否付費)),3)

0 1
0.853 0.147

# 以下代碼實現抽樣後的“是否付費”的0、1佔比不變
> # 計算出“是否付費”中0的佔比
> rate <- sum(user$是否付費==0)/nrow(user)
> # 提取未付費用戶的下標子集
> d <- 1:nrow(user)
> index1 <- sample(d[user$是否付費==0],10000*rate)
> # 提取付費用戶的下標子集
> index2 <- sample(d[user$是否付費==1],10000*(1-rate))
> # 將抽樣數據賦予對象user_sample1
> user_sample1 <- user[c(index1,index2),]
> # 查看“是否付費”的0、1佔比
> round(prop.table(table(user_sample1$是否付費)),3)

0 1
0.859 0.141

5.1.4利用createDataPartition函數對數據進行等比抽樣

createDataPartition（y,times=1,p=0.5,list=TRUE,groups=min(5,length(y))）

y：一個向量；

times:需要進行抽樣的次數；

p:需要從數據中抽取的樣本比例；

list：結果是否是list形式；

groups：如果輸出變量爲數值型數據，默認按分位數分組進行取樣。

> library(caret)
載入需要的程輯包：ggplot2
Warning messages:
1: 程輯包‘caret’是用R版本3.6.2 來建造的
2: 程輯包‘ggplot2’是用R版本3.6.2 來建造的
> # 提取下標集
> splitindex <- createDataPartition(iris$Species,times=1,p=0.1,list=FALSE)
> splitindex
Resample1
[1,] 4
[2,] 14
[3,] 26
[4,] 27
[5,] 48
[6,] 63
[7,] 67
[8,] 71
[9,] 74
[10,] 97
[11,] 101
[12,] 105
[13,] 107
[14,] 109
[15,] 110
> # 提取符合子集
> sample <- iris[splitindex,]

# 查看Species變量中各類別的個數和佔比
> table(sample$Species);

setosa versicolor virginica
5 5 5
> prop.table(table(sample$Species))

setosa versicolor virginica
0.3333333 0.3333333 0.3333333
> # 設置list爲TRUE
> # 提取下標集
> splitindex1 <- createDataPartition(iris$Species,times=1,p=0.1,list=TRUE)
> # 查看下標集
> splitindex1
$Resample1
[1] 6 10 13 27 38 56 64 97 98 99 113 128 143 145 149

> # 提取子集
> iris[splitindex1$Resample1,]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
6 5.4 3.9 1.7 0.4 setosa
10 4.9 3.1 1.5 0.1 setosa
13 4.8 3.0 1.4 0.1 setosa
27 5.0 3.4 1.6 0.4 setosa
38 4.9 3.6 1.4 0.1 setosa
56 5.7 2.8 4.5 1.3 versicolor
64 6.1 2.9 4.7 1.4 versicolor
97 5.7 2.9 4.2 1.3 versicolor
98 6.2 2.9 4.3 1.3 versicolor
99 5.1 2.5 3.0 1.1 versicolor
113 6.8 3.0 5.5 2.1 virginica
128 6.1 3.0 4.9 1.8 virginica
143 5.8 2.7 5.1 1.9 virginica
145 6.7 3.3 5.7 2.5 virginica
149 6.2 3.4 5.4 2.3 virginica

# 設置times=2
> splitindex2 <- createDataPartition(iris$Species,times=2,p=0.1,list=TRUE)
> splitindex2
$Resample1
[1] 15 18 20 27 29 51 54 84 89 91 102 105 117 129 131

$Resample2
[1] 18 19 22 29 49 55 58 70 76 84 108 113 119 124 133

# 對12萬本週活躍用戶的數據按照“是否付費”的比例隨機抽取1萬的活躍用戶進行探索性分析
> # 導入數據
> #user <- read.csv("活躍用戶是否付費數據.csv",T)
> # 將“是否付費”改爲因子型變量
> user$是否付費 <- as.factor(user$是否付費)
> # 提取下標集
> ind <- createDataPartition(user$是否付費,p=10000/nrow(user),
+ times=1,list=FALSE)
> # 查看子集中0、1佔比
> prop.table(table(user[ind,'是否付費']))

0 1
0.8589141 0.1410859
> # 利用sample函數對數據分區
> # 提取訓練數據集的下標
> ind <- sample(nrow(user),0.7*nrow(user),replace=F)
> # 構建訓練集數據
> traindata <- user[ind,]
> # 構建測試集數據
> testdata <- user[-ind,]
> # 查看“是否付費”的0、1佔比
> prop.table(table(user$是否付費))

0 1
0.8589596 0.1410404
> prop.table(table(traindata$是否付費))

0 1
0.8590382 0.1409618
> prop.table(table(testdata$是否付費))

0 1
0.8587763 0.1412237

# 利用createDataPartition函數按照”是否付費“等比例對數據進行分區
> library(caret)
> # 將”是否付費“變量轉換成因子型
> user$是否付費 <- as.factor(user$是否付費)
> # 構建訓練數據下標集
> idx <- createDataPartition(user$是否付費,p=0.7,list=FALSE)
> # 構建訓練數據集
> train <- user[idx,]
> # 構建測試數據集
> test <- user[-idx,]
> # 查看”是否付費“的0、1佔比
> prop.table(table(user$是否付費))

0 1
0.8589596 0.1410404
> prop.table(table(train$是否付費))

0 1
0.8589589 0.1410411
> prop.table(table(test$是否付費))

0 1
0.8589612 0.1410388

5.1.5 用於交叉驗證的樣本抽樣
> # zz1爲所有觀測值的下標
> n <- nrow(user);zz1 <- 1:n
> # zz2爲1:5的隨機排列
> set.seed(1234)
> zz2 <- rep(1:5,ceiling(n/5))[1:n]
> zz2 <- sample(zz2,n)
> # 構建訓練集及測試集
> for(i in 1:5){
+ m <- zz1[zz2==i]
+ train <- user[-m,]
+ test <- user[m,]
+ # 接下來就可以利用訓練集建立模型，測試集驗證模型，並計算5次MSE
+ }

利用createFoldsh函數構建五折交叉驗證的訓練集和測試集

createFoldsh（y,k=10,list=TRUE,returnTrain=FALSE）

y:要依據分類的變量；

k:交叉驗證的樣本，默認是10，每重的樣本量爲樣本總量/10；

list:是否以列表或矩陣的形式存儲，默認爲FLASE；

returnTrain：是否返回抽樣的真實值，默認返回樣本的索引值。

> user$是否付費 <- as.factor(user$是否付費)
> index <- createFolds(user$是否付費,k=5,list=FALSE)
> prop.table(table(user[index==1,'是否付費']))

0 1
0.8589515 0.1410485
> prop.table(table(user[index==2,'是否付費']))

0 1
0.8589863 0.1410137
> prop.table(table(user[index==3,'是否付費']))

0 1
0.8589572 0.1410428
> prop.table(table(user[index==4,'是否付費']))

0 1
0.8589515 0.1410485
> prop.table(table(user[index==5,'是否付費']))

0 1
0.8589515 0.1410485

R語言數據抽樣（數據失衡處理、sample隨機抽樣、數據等比抽樣、交叉驗證抽樣）

.NET有哪些好用的定時任務調度框架

Python 將PDF轉爲PDF/A、PDF/X，以及PDF/A轉回PDF

elk3

Kafka存儲機制

aws語音呼叫調用，告警電話

深度學習框架火焰圖pprof和CUDA Nsys配置指南

爬蟲兩種繞過5s盾的方法

【轉】[C#] WebAPI 防止併發調用二（冥等性）

【轉】[SQL Server]關掉 SSMS 的 IntelliSense

號稱能打敗MLP的KAN到底行不行？數學核心原理全面解析

Kettle 安裝與簡單案例介紹

GIT 史上最詳細Git使用教程

Julia（未來可能替代Python與R語言）數據抽樣與結果評價

mysql 免安裝版本

R語言兩種方法連接oracle以及將處理後的數據導入數據庫中

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

R語言 數據抽樣（數據失衡處理、sample隨機抽樣、數據等比抽樣、交叉驗證抽樣）

R語言數據抽樣（數據失衡處理、sample隨機抽樣、數據等比抽樣、交叉驗證抽樣）