用戶收視習慣聚類分析

數據挖掘測試實例

用戶收視習慣聚類分析

用戶收視習慣在不同的小時段，不同的星期，會呈現不一樣的特色，我們現在要做的就是將用戶IPTV數據按照每小時收視時長進行聚類分析

測試樣本：

2013年6月6日（星期四，非假日）南京地區當天觀看過IPTV的用戶

用戶數：269745 人

數據準備：

1.創建臨時表

select s_userid,s_hour,s_timeleninto tmp_user_hour_len from tst_fct_d20130606_4 where s_city_id=1

2、生成目標表

select s_userid,

(case when s_hour='00' then s_timelen else 0 end)as hour00 ,

(case when s_hour='01' then s_timelen else 0 end)as hour01 ,

(case when s_hour='02' then s_timelen else 0 end)as hour02 ,

(case when s_hour='03' then s_timelen else 0 end)as hour03 ,

(case when s_hour='04' then s_timelen else 0 end)as hour04 ,

(case when s_hour='05' then s_timelen else 0 end)as hour05 ,

(case when s_hour='06' then s_timelen else 0 end)as hour06 ,

(case when s_hour='07' then s_timelen else 0 end)as hour07 ,

(case when s_hour='08' then s_timelen else 0 end)as hour08 ,

(case when s_hour='09' then s_timelen else 0 end)as hour09 ,

(case when s_hour='10' then s_timelen else 0 end)as hour10 ,

(case when s_hour='11' then s_timelen else 0 end) ashour11 ,

(case when s_hour='12' then s_timelen else 0 end)as hour12 ,

(case when s_hour='13' then s_timelen else 0 end)as hour13 ,

(case when s_hour='14' then s_timelen else 0 end)as hour14 ,

(case when s_hour='15' then s_timelen else 0 end)as hour15 ,

(case when s_hour='16' then s_timelen else 0 end)as hour16 ,

(case when s_hour='17' then s_timelen else 0 end)as hour17 ,

(case when s_hour='18' then s_timelen else 0 end)as hour18 ,

(case when s_hour='19' then s_timelen else 0 end)as hour19 ,

(case when s_hour='20' then s_timelen else 0 end)as hour20 ,

(case when s_hour='21' then s_timelen else 0 end)as hour21 ,

(case when s_hour='22' then s_timelen else 0 end)as hour22 ,

(case when s_hour='23' then s_timelen else 0 end)as hour23 into user_hour_len_nj_20130606

from tmp_user_hour_len

3、在211服務器上導出文件到本地

bcp user_hour_len_nj_20130606 outuser_hour_len_nj_20130606.txt -UXXX -PXXX -SXXX -c -t '|' -r '\n'

4、提取前200個實例進行測試

分析方法：

採用k均值算法進行聚類分析

數據源格式：

屬性集：

屬性集包含24個時段的詳細信息，格式如下(這裏real也可以爲numeric)：

@relation cluster

@attribute H00 real

@attribute H01 real

@attribute H02 real

@attribute H03 real

@attribute H04 real

@attribute H05 real

@attribute H06 real

@attribute H07 real

@attribute H08 real

@attribute H09 real

@attribute H10 real

@attribute H11 real

@attribute H12 real

@attribute H13 real

@attribute H14 real

@attribute H15 real

@attribute H16 real

@attribute H17 real

@attribute H18 real

@attribute H19 real

@attribute H20 real

@attribute H21 real

@attribute H22 real

@attribute H23 real

數據集：

數據集包含每個用戶的訂購信息，格式如下：

@data

0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,31,12,0

0,0,0,0,0,0,0,0,0,0,0,0,26,59,16,0,0,0,50,55,56,58,59,10

0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,34,59,59,18,0

57,35,0,0,0,0,20,0,0,0,0,0,0,0,15,59,59,59,59,59,59,58,54,35

.....

測試過程：

打開weka explorer，open file打開特徵文件(如example_cluster_ID_H24_200.arff)，然後選擇cluster，選擇算法SimpleKmeans，選擇距離方法Euclidean distance (orsimilarity) function.迭代次數maxIterations=500,類數目numcluster=5（或3,4都可以），seed=10,start

numcluster=5時，得出如下結果

1）

這裏代表所聚的各個類中的樣本條數、數量佔整個樣本集的百分比。

2）

Number of iterations: 7

Within cluster sum of squared errors:228.6644541918032

Within cluster sum of squared errors，代表簇內距離，這個值越小，聚類效果越好（當然聚類數越多這個值越小）。在不改變聚類數量的前提下，調整seed值可以改變上面squared errors值的大小，使得簇內距離越小，聚類效果越好。

參數說明：

參數選擇窗口如下：

參數說明：

displayStdDevs是否顯示數字屬性標準差和名詞屬性個數
distanceFunction 用於比較實例的距離函數，包括馬氏距離、歐氏距離、明氏距離等（默認:weka.core.EuclideanDistance）。
dontReplaceMissingValues 是否不使用mean/mode替換全部丟失的值。
maxIterations 最大迭代次數
numClusters 所聚的類數
preserveInstancesOrder 是否預先排列實例的順序
seed 設定的隨機種子值

QuestionS：

1、如何找出哪個ID聚到了哪一類中；

A: 針對訓練樣本，在聚類結果右擊點擊“Visualizecluster assignments”，在彈出的窗口中點擊save，則可保存一個arff文件，在這個文件中每個樣本最後一個屬性值即(“@attributeCluster”)給出了詳細劃入的簇類別；

另外，第一個數值爲訓練樣本的標號。

以文件的部分數據爲例(save_file_ID2Class.arff)，如下：

----------------------------------------------------------------------------------------------------------------

@attributeH22 numeric

@attributeH23 numeric

@attributeCluster {cluster0,cluster1,cluster2,cluster3}

@data

0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,31,12,0,cluster1

1,0,0,0,0,0,0,0,0,0,0,0,0,26,59,16,0,0,0,50,55,56,58,59,10,cluster2

2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,34,59,59,18,0,cluster2

3,57,35,0,0,0,0,20,0,0,0,0,0,0,0,15,59,59,59,59,59,59,58,54,35,cluster3

----------------------------------------------------------------------------------------------------------------

用戶收視習慣聚類分析

測試樣本：

數據準備：

分析方法：

數據源格式：

測試過程：

DAPPER 事務 TRANSACTION

Java中線程的創建方式

用戶流失統計

用戶收視習慣聚類分析

Oracle 數據庫的管理

sql server實戰

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結