說起推薦算法,大家耳熟能詳的就是CF(協同過濾),這次就拿CF中ALS(alternating least squares),交替最小二乘,來做個例子吧。
CF裏面的算法比較多,有基於物品的,基於用戶的,ALS是基於矩陣分解的,關於對推薦算法的小結,請參考我的推薦算法總結Recommendation
先介紹下mllib,mllib是運行在Spark上一個機器學習算法庫。藉助Spark的內存計算,可以使機器學習的模型計算時間大大縮短。
目前,spark1.0.0中的mllib中已經有很多算法了,具體可以參見官方網站http://spark.apache.org/docs/latest/mllib-guide.html
我們知道,協同過濾是基於用戶行爲的一種推薦算法,需要用戶對Item的評價。
於是乎我們還是找到最經典的數據集movielens,地址http://grouplens.org/datasets/movielens/
Down下來ml-100k,解壓後有很多文件,可以看README裏面對數據集的介紹。
user id | item id | rating | timestamp
1 1 5 874965758
1 2 3 876893171
1 3 4 878542960
1 4 3 876893119
1 5 3 889751712
1 7 4 875071561
1 8 1 875072484
1 9 5 878543541
2 1 2 875072262
2 3 5 875071805
2 5 5 875071608
2 6 5 878543541
2 8 4 887432020
2 9 5 875071515
2 1 1 878542772
2 2 4 875072404
這裏有user對某個movie的評分rating和時間timstamp
先預處理一下數據
cat u1.base | awk -F "\t" '{print $1"::"$2"::"$3"::"$4}' > ratings.dat
cat u.item | awk -F "|" '{print $1"\t"$2"\t"$3}' > movies.dat
數據結果:
user id::movie id::rating:: timestamp
1::1::5::874965758
1::5::3::889751712
1::7::4::875071561
1::8::1::875072484
1::9::5::878543541
2::258::3::888549961
2::269::4::888550774
2::272::5::888979061
2::273::4::888551647
2::274::3::888551497
movie id ::movie id :: movie release date
1::Toy Story (1995)::01-Jan-1995
2::GoldenEye (1995)::01-Jan-1995
3::Four Rooms (1995)::01-Jan-1995
4::Get Shorty (1995)::01-Jan-1995
5::Copycat (1995)::01-Jan-1995
6::Shanghai Triad (Yao a yao yao dao waipo qiao) (1995)::01-Jan-1995
7::Twelve Monkeys (1995)::01-Jan-1995
8::Babe (1995)::01-Jan-1995
9::Dead Man Walking (1995)::01-Jan-1995
10::Richard III (1995)::22-Jan-1996
11::Seven (Se7en) (1995)::01-Jan-1995
12::Usual Suspects, The (1995)::14-Aug-1995
OK,下面我們要用官方的ALS算法例子來運行下這個推薦。
首先導入mllib包,我們需要用到ALS算法類和Rating評分類
import org.apache.spark.mllib.recommendation.ALS
import org.apache.spark.mllib.recommendation.Rating
//加載數據
val data = sc.textFile("/app/hadoop/ml-100k/ratings.dat")
//data中每條數據經過map的split後會是一個數組,模式匹配後,會new一個Rating對象
val ratings = data.map(_.split("::") match { case Array(user, item, rate, ts) =>
Rating(user.toInt, item.toInt, rate.toDouble)
})
最終會new成對象
scala> ratings take 2
......
14/06/25 17:51:06 INFO scheduler.DAGScheduler: Computing the requested partition locally
14/06/25 17:51:06 INFO rdd.HadoopRDD: Input split: file:/app/hadoop/ml-100k/ratings.dat:0+1826544
14/06/25 17:51:07 INFO spark.SparkContext: Job finished: take at <console>:22, took 0.062239021 s
res0: Array[org.apache.spark.mllib.recommendation.Rating] = Array(Rating(1,1,5.0), Rating(1,2,3.0))
//設置潛在因子個數爲10
scala> val rank = 10
rank: Int = 10
//要迭代計算30次
scala> val numIterations = 30
numIterations: Int = 30
接下來調用ALS.train()方法,進行模型訓練:
val model = ALS.train(ratings, rank, numIterations, 0.01)
14/06/25 17:53:04 INFO storage.MemoryStore: ensureFreeSpace(200) called with curMem=84002, maxMem=308713881
14/06/25 17:53:04 INFO storage.MemoryStore: Block broadcast_60 stored as values to memory (estimated size 200.0 B, free 294.3 MB)
model: org.apache.spark.mllib.recommendation.MatrixFactorizationModel = org.apache.spark.mllib.recommendation.MatrixFactorizationModel@17596ee0
訓練完後,我們要對比一下預測的結果,我們那訓練集當作測試集,來進行對比測試:
scala> val usersProducts = ratings.map { case Rating(user, product, rate) =>
| (user, product)
| }
usersProducts: org.apache.spark.rdd.RDD[(Int, Int)] = MappedRDD[623] at map at <console>:21
//預測後的用戶,電影,評分
scala> val predictions =
| model.predict(usersProducts).map { case Rating(user, product, rate) =>
| ((user, product), rate)
| }
predictions: org.apache.spark.rdd.RDD[((Int, Int), Double)] = MappedRDD[632] at map at <console>:30
我們用均方根誤差來評價一個模型的好壞,所以我們要算一下MSE,來判定這個模型的準確率,其值越小說明越準確。
join一下,然後再計算:
//原始{(用戶,電影),評分} join 預測後的{(用戶,電影),評分}
val ratesAndPreds = ratings.map { case Rating(user, product, rate) =>
((user, product), rate)
}.join(predictions)
ratesAndPreds.collect take 3
14/06/25 17:59:35 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 632.0, whose tasks have all completed, from pool
14/06/25 17:59:35 INFO scheduler.DAGScheduler: Stage 632 (collect at <console>:34) finished in 1.906 s
14/06/25 17:59:35 INFO spark.SparkContext: Job finished: collect at <console>:34, took 1.939437725 s
res11: Array[((Int, Int), (Double, Double))] = Array(((933,627),(2.0,1.6977799770529198)), ((537,24),(1.0,2.3191609228008327)), ((717,125),(4.0,3.795616142104737)))
join後的結果,就是每個用戶對電影的實際打分和預測打分的一個對比,例如:
(用戶,電影),(原始評分,預測的評分)
(933,627),(2.0,1.6977799770529198)
(537,24),(1.0, 2.3191609228008327)
(717,125),(4.0,3.795616142104737)
......
最後計算均方根誤差:
val MSE = ratesAndPreds.map { case ((user, product), (r1, r2)) =>
val err = (r1 - r2)
err * err
}.mean()
14/06/25 18:02:28 INFO scheduler.TaskSetManager: Finished TID 79 in 554 ms on localhost (progress: 1/1)
14/06/25 18:02:28 INFO scheduler.DAGScheduler: Stage 702 (mean at <console>:36) finished in 0.556 s
14/06/25 18:02:28 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 702.0, whose tasks have all completed, from pool
14/06/25 18:02:28 INFO spark.SparkContext: Job finished: mean at <console>:36, took 0.585592521 s
MSE: Double = 0.4254804561682655
順便提一下預測的API有三個重載,上面用的是第二個:
調用model的API predict
scala> model.predict
def predict(user: Int, product: Int): Double
def predict(usersProducts: org.apache.spark.rdd.RDD[(Int, Int)]): org.apache.spark.rdd.RDD[org.apache.spark.mllib.recommendation.Rating]
def predict(usersProductsJRDD: org.apache.spark.api.java.JavaRDD[Array[Byte]]): org.apache.spark.api.java.JavaRDD[Array[Byte]]
我們也可以傳入user id, product id 來與此 某個用戶 對某個 電影 的評分
model.predict(1,2)
14/06/25 18:17:36 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 963.0, whose tasks have all completed, from pool
14/06/25 18:17:36 INFO scheduler.DAGScheduler: Stage 963 (lookup at MatrixFactorizationModel.scala:46) finished in 0.035 s
14/06/25 18:17:36 INFO spark.SparkContext: Job finished: lookup at MatrixFactorizationModel.scala:46, took 0.066978561 s
res13: Double = 2.9204503120927363
模型都有了,推薦系統怎麼設計就根據實際需求了 :)
總結:
MLlib充分利用了Spark的快速內存計算,迭代效率高的優勢,將機器學習的模型計算性能提到另一片天地,這也就是爲什麼最近Spark備受推崇,那麼火的原因。
目前Mllib的算法庫還不是很多,但是Mahout都宣佈不接受Mapreduce算法庫,都遷移到spark上來了,看來未來機器學習要靠Spark了。至於爲什麼對於協同過濾先支持的是ALS,也是看中的ALS算法的並行度比較好,在Spark上更能發揮該算法的優勢吧。
——EOF——