數據挖掘測試實例
用戶收視習慣聚類分析
用戶收視習慣在不同的小時段,不同的星期,會呈現不一樣的特色,我們現在要做的就是將用戶IPTV數據按照每小時收視時長進行聚類分析
測試樣本:
2013年6月6日(星期四,非假日)南京地區當天觀看過IPTV的用戶
用戶數:269745 人
數據準備:
1.創建臨時表
select s_userid,s_hour,s_timeleninto tmp_user_hour_len from tst_fct_d20130606_4 where s_city_id=1
2、生成目標表
select s_userid,
(case when s_hour='00' then s_timelen else 0 end)as hour00 ,
(case when s_hour='01' then s_timelen else 0 end)as hour01 ,
(case when s_hour='02' then s_timelen else 0 end)as hour02 ,
(case when s_hour='03' then s_timelen else 0 end)as hour03 ,
(case when s_hour='04' then s_timelen else 0 end)as hour04 ,
(case when s_hour='05' then s_timelen else 0 end)as hour05 ,
(case when s_hour='06' then s_timelen else 0 end)as hour06 ,
(case when s_hour='07' then s_timelen else 0 end)as hour07 ,
(case when s_hour='08' then s_timelen else 0 end)as hour08 ,
(case when s_hour='09' then s_timelen else 0 end)as hour09 ,
(case when s_hour='10' then s_timelen else 0 end)as hour10 ,
(case when s_hour='11' then s_timelen else 0 end) ashour11 ,
(case when s_hour='12' then s_timelen else 0 end)as hour12 ,
(case when s_hour='13' then s_timelen else 0 end)as hour13 ,
(case when s_hour='14' then s_timelen else 0 end)as hour14 ,
(case when s_hour='15' then s_timelen else 0 end)as hour15 ,
(case when s_hour='16' then s_timelen else 0 end)as hour16 ,
(case when s_hour='17' then s_timelen else 0 end)as hour17 ,
(case when s_hour='18' then s_timelen else 0 end)as hour18 ,
(case when s_hour='19' then s_timelen else 0 end)as hour19 ,
(case when s_hour='20' then s_timelen else 0 end)as hour20 ,
(case when s_hour='21' then s_timelen else 0 end)as hour21 ,
(case when s_hour='22' then s_timelen else 0 end)as hour22 ,
(case when s_hour='23' then s_timelen else 0 end)as hour23 into user_hour_len_nj_20130606
from tmp_user_hour_len
3、在211服務器上導出文件到本地
bcp user_hour_len_nj_20130606 outuser_hour_len_nj_20130606.txt -UXXX -PXXX -SXXX -c -t '|' -r '\n'
4、提取前200個實例進行測試
分析方法:
採用k均值算法進行聚類分析
數據源格式:
屬性集:
屬性集包含24個時段的詳細信息,格式如下(這裏real也可以爲numeric):
@relation cluster
@attribute H00 real
@attribute H01 real
@attribute H02 real
@attribute H03 real
@attribute H04 real
@attribute H05 real
@attribute H06 real
@attribute H07 real
@attribute H08 real
@attribute H09 real
@attribute H10 real
@attribute H11 real
@attribute H12 real
@attribute H13 real
@attribute H14 real
@attribute H15 real
@attribute H16 real
@attribute H17 real
@attribute H18 real
@attribute H19 real
@attribute H20 real
@attribute H21 real
@attribute H22 real
@attribute H23 real
數據集:
數據集包含每個用戶的訂購信息,格式如下:
@data
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,31,12,0
0,0,0,0,0,0,0,0,0,0,0,0,26,59,16,0,0,0,50,55,56,58,59,10
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,34,59,59,18,0
57,35,0,0,0,0,20,0,0,0,0,0,0,0,15,59,59,59,59,59,59,58,54,35
.....
測試過程:
打開weka explorer,open file打開特徵文件(如example_cluster_ID_H24_200.arff),然後選擇cluster,選擇算法SimpleKmeans,選擇距離方法Euclidean distance (orsimilarity) function.迭代次數maxIterations=500,類數目numcluster=5(或3,4都可以),seed=10,start
numcluster=5時,得出如下結果
1)
這裏代表所聚的各個類中的樣本條數、數量佔整個樣本集的百分比。
2)
Number of iterations: 7
Within cluster sum of squared errors:228.6644541918032
Within cluster sum of squared errors,代表簇內距離,這個值越小,聚類效果越好(當然聚類數越多這個值越小)。在不改變聚類數量的前提下,調整seed值可以改變上面squared errors值的大小,使得簇內距離越小,聚類效果越好。
參數說明:
參數選擇窗口如下:
參數說明:
displayStdDevs是否顯示數字屬性標準差和名詞屬性個數
distanceFunction 用於比較實例的距離函數,包括馬氏距離、歐氏距 離、明氏距離等(默認:weka.core.EuclideanDistance)。
dontReplaceMissingValues 是否不使用mean/mode替換全部丟失的值。
maxIterations 最大迭代次數
numClusters 所聚的類數
preserveInstancesOrder 是否預先排列實例的順序
seed 設定的隨機種子值
QuestionS:
1、如何找出哪個ID聚到了哪一類中;
A: 針對訓練樣本,在聚類結果右擊點擊“Visualizecluster assignments”,在彈出的窗口中點擊save,則可保存一個arff文件,在這個文件中每個樣本最後一個屬性值即(“@attributeCluster”)給出了詳細劃入的簇類別;
另外,第一個數值爲訓練樣本的標號。
以文件的部分數據爲例(save_file_ID2Class.arff),如下:
----------------------------------------------------------------------------------------------------------------
@attributeH22 numeric
@attributeH23 numeric
@attributeCluster {cluster0,cluster1,cluster2,cluster3}
@data
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,31,12,0,cluster1
1,0,0,0,0,0,0,0,0,0,0,0,0,26,59,16,0,0,0,50,55,56,58,59,10,cluster2
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,34,59,59,18,0,cluster2
3,57,35,0,0,0,0,20,0,0,0,0,0,0,0,15,59,59,59,59,59,59,58,54,35,cluster3
----------------------------------------------------------------------------------------------------------------