基於物品的協同過濾算法(Spark Scala實現)

目前Spark ML中未實現協同過濾推薦算法, 本文將根據基於鄰域的協同過濾算法的理論知識, 實現基於物品的協同過濾推薦算法。 基於Spark ML實現 的相似度計算模型,可以計算物品與物品之間的相似度,支持同現相似度、歐幾里得距離相似度、Cosine相似度3種相似度計算方法, 圖1爲根據用戶評分矩陣,採用同現相似度方法計算物品相似度矩陣。

圖1 根據用戶評分矩陣,採用同現相似度方法計算物品相似度矩陣

注意:本文代碼使用Zeppelin進行操作

1、數據準備

數據來源:MovieLens 【數據地址:https://grouplens.org/datasets/movielens/】(1M、10M、20M 共三個數據集)

1.1 導入依賴包

import scala.math._
import org.apache.spark.sql.Dataset
import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
import org.apache.spark.sql.functions._
import org.apache.spark.sql._
import scala.collection.mutable.WrappedArray
import scala.collection.JavaConverters._
import scala.collection.mutable.ArrayBuffer

1.2 讀取item配置表

將movies.csv文件加載成DataFrame

val item_conf_path = "hdfs://mycluster/user/***/ml-latest-small/movies.csv"
val item_conf_df = spark.read.options(Map(("delimiter", ","), ("header", "true"))).csv(item_conf_path)
item_conf_df.show(5)

輸出:

+-------+--------------------+--------------------+
|movieId| title| genres|
+-------+--------------------+--------------------+
| 1| Toy Story (1995)|Adventure|Animati...|
| 2| Jumanji (1995)|Adventure|Childre...|
| 3|Grumpier Old Men ...| Comedy|Romance|
| 4|Waiting to Exhale...|Comedy|Drama|Romance|
| 5|Father of the Bri...| Comedy|
+-------+--------------------+--------------------+
only showing top 5 rows

Map: (movieId, title)

val item_id2title_map = item_conf_df.select("movieId", "title").collect().map(row => (row(0).toString(), row(1).toString())).toMap

Map: (movieId, genres)

val item_id2genres_map = item_conf_df.select("movieId", "genres").collect().map(row => (row(0).toString(), row(1).toString())).toMap

廣播

val item_id2title_map_BC = spark.sparkContext.broadcast(item_id2title_map)
val item_id2genres_map_BC = spark.sparkContext.broadcast(item_id2genres_map)

1.3 讀取用戶行爲數據

val user_rating_path = "hdfs://mycluster/user/***/ml-latest-small/ratings.csv"
val user_rating_df = spark.read.options(Map(("delimiter", ","), ("header", "true"))).csv(user_rating_path)
user_rating_df.show(5)

輸出:

user_rating_path: String = hdfs://mycluster/user/***/ml-latest-small/ratings.csv
user_rating_df: org.apache.spark.sql.DataFrame = [userId: string, movieId: string ... 2 more fields]
+------+-------+------+---------+
|userId|movieId|rating|timestamp|
+------+-------+------+---------+
| 1| 1| 4.0|964982703|
| 1| 3| 4.0|964981247|
| 1| 6| 4.0|964982224|
| 1| 47| 5.0|964983815|
| 1| 50| 5.0|964982931|
+------+-------+------+---------+

查看user_rating_df的數據類型:

user_rating_df.dtypes

輸出:

res210: Array[(String, String)] = Array((userId,StringType), (movieId,StringType), (rating,StringType), (timestamp,StringType))

數據類型轉換:

// 聲明轉換後的數據類型
case class ItemPref(userid: String, itemid: String, pref: Double)

// 轉換數據類型
val user_ds = user_rating_df.map {
case Row(userId: String, movieId: String, rating: String, timestamp: String) =>
ItemPref(userId, movieId, rating.toDouble)
}
println("user_ds.show(10)")
user_ds.show(10)
user_ds.cache()
user_ds.count()

輸出:

defined class ItemPref
user_ds: org.apache.spark.sql.Dataset[ItemPref] = [userid: string, itemid: string ... 1 more field]
user_ds.show(10)
+------+------+----+
|userid|itemid|pref|
+------+------+----+
| 1| 1| 4.0|
| 1| 3| 4.0|
| 1| 6| 4.0|
| 1| 47| 5.0|
| 1| 50| 5.0|
| 1| 70| 3.0|
| 1| 101| 5.0|
| 1| 110| 4.0|
| 1| 151| 5.0|
| 1| 157| 5.0|
+------+------+----+
only showing top 10 rows
res213: user_ds.type = [userid: string, itemid: string ... 1 more field]
res214: Long = 100836

2、相似度計算

分佈式同現相似度矩陣計算過程,實現過程如下。

  • 首先以用戶id爲key進行group by操作,得到每個用戶的所有物品集合;
  • 然後對每個用戶的物品集合進行flatMap操作:對物品集合生成兩兩物品對(物品,物品),其中只生成上三角部分;
  • 之後對(物品,物品)對進行 group by 操作,得到物品與物品的總出現次數
  • 隨後再根據同現相似度公式( w(i,j) = N(i) ⋂ N(j) / sqrt( N(i) × N(j) ),其中分子是 i 與 j 的同現頻次,分母的 N(i) 是 i 頻次、N(j) 是 j 頻 次)計算物品與物品的相似度,最終得到所有上三角部分的相似度。 過程如圖2所示。
圖2 分佈式同現相似度矩陣計算過程

分佈式同現相似度矩陣計算代碼實現過程如下:

首先以用戶id爲key進行group by操作,得到每個用戶的所有物品集合。

// (用戶:物品) => (用戶:(物品集合))
val user_ds1 = user_ds.groupBy("userid").agg(collect_set("itemid")).withColumnRenamed("collect_set(itemid)", "itemid_set")
user_ds1.show(5)

輸出:

user_ds1: org.apache.spark.sql.DataFrame = [userid: string, itemid_set: array<string>]
+------+--------------------+
|userid| itemid_set|
+------+--------------------+
| 296|[110, 65261, 356,...|
| 467|[389, 41, 780, 23...|
| 125|[151695, 62434, 7...|
| 451|[376, 1, 1356, 14...|
| 124|[110, 1, 296, 795...|
+------+--------------------+

然後對每個用戶的物品集合進行flatMap操作: 對物品集合生成兩兩物品對(物品,物品),其中只生成上三角部分。

// 物品:物品,上三角數據
val user_ds2 = user_ds1.flatMap { row =>
val itemlist = row.getAs[scala.collection.mutable.WrappedArray[String]](1).toArray.sorted
val result = new ArrayBuffer[(String, String, Double)]()
for (i <- 0 to itemlist.length - 2) {
for (j <- i + 1 to itemlist.length - 1) {
result += ((itemlist(i), itemlist(j), 1.0))
}
}
result
}.withColumnRenamed("_1", "itemidI").withColumnRenamed("_2", "itemidJ").withColumnRenamed("_3", "score")

user_ds2.show(5)

輸出:

user_ds2: org.apache.spark.sql.DataFrame = [itemidI: string, itemidJ: string ... 1 more field]
+-------+-------+-----+
|itemidI|itemidJ|score|
+-------+-------+-----+
| 110| 1201| 1.0|
| 110| 160848| 1.0|
| 110| 166528| 1.0|
| 110| 169034| 1.0|
| 110| 1704| 1.0|
+-------+-------+-----+

之後對(物品,物品)對進行 group by 操作,得到物品與物品的總出現次數

// 計算物品與物品,上三角,同現頻次
val user_ds3 = user_ds2.groupBy("itemidI", "itemidJ").agg(sum("score").as("sumIJ"))
user_ds3.show(5)

輸出:

user_ds3: org.apache.spark.sql.DataFrame = [itemidI: string, itemidJ: string ... 1 more field]
+-------+-------+-----+
|itemidI|itemidJ|sumIJ|
+-------+-------+-----+
| 171765| 79132| 2.0|
| 2028| 5618| 46.0|
| 2334| 2384| 3.0|
| 100083| 89190| 1.0|
| 100882| 162578| 1.0|
+-------+-------+-----+

計算物品總共出現的頻次,即N(i)與N(j)

val user_ds0 = user_ds.withColumn("score", lit(1)).groupBy("itemid").agg(sum("score").as("score"))
user_ds0.show(5)

輸出:

// 計算N(i)
user_ds0: org.apache.spark.sql.DataFrame = [itemid: string, score: bigint]
+------+-----+
|itemid|score|
+------+-----+
| 296| 307|
| 1090| 63|
|115713| 28|
| 3210| 42|
| 88140| 32|
+------+-----+
// 計算N(j)
user_ds0.withColumnRenamed("itemid", "itemidJ").withColumnRenamed("score", "sumJ").select("itemidJ", "sumJ").show(5)

輸出:

+-------+----+
|itemidJ|sumJ|
+-------+----+
| 296| 307|
| 1090| 63|
| 115713| 28|
| 3210| 42|
| 88140| 32|
+-------+----+

計算同現相似度

val user_ds4 = user_ds3.join(user_ds0.withColumnRenamed("itemid", "itemidJ").withColumnRenamed("score", "sumJ").select("itemidJ", "sumJ"), "itemidJ")
user_ds4.show(5)

輸出:

user_ds4: org.apache.spark.sql.DataFrame = [itemidJ: string, itemidI: string ... 2 more fields]
+-------+-------+-----+----+
|itemidJ|itemidI|sumIJ|sumJ|
+-------+-------+-----+----+
| 79132| 171765| 2.0| 143|
| 5618| 2028| 46.0| 87|
| 2384| 2334| 3.0| 29|
| 89190| 100083| 1.0| 2|
| 162578| 100882| 1.0| 6|
+-------+-------+-----+----+
user_ds0.withColumnRenamed("itemid", "itemidI").withColumnRenamed("score", "sumI").select("itemidI", "sumI").show(5)

輸出:

+-------+----+
|itemidI|sumI|
+-------+----+
| 296| 307|
| 1090| 63|
| 115713| 28|
| 3210| 42|
| 88140| 32|
+-------+----+
val user_ds5 = user_ds4.join(user_ds0.withColumnRenamed("itemid", "itemidI").withColumnRenamed("score", "sumI").select("itemidI", "sumI"), "itemidI")
user_ds5.show(5)

輸出:

user_ds5: org.apache.spark.sql.DataFrame = [itemidI: string, itemidJ: string ... 3 more fields]
+-------+-------+-----+----+----+
|itemidI|itemidJ|sumIJ|sumJ|sumI|
+-------+-------+-----+----+----+
| 171765| 79132| 2.0| 143| 2|
| 2028| 5618| 46.0| 87| 188|
| 2334| 2384| 3.0| 29| 13|
| 100083| 89190| 1.0| 2| 3|
| 100882| 162578| 1.0| 6| 2|
+-------+-------+-----+----+----+
// 根據公式N(i)∩N(j)/sqrt(N(i)*N(j)) 計算: 其中,分子是i與j的同現頻次(sumIJ),分母的N(i)是i頻次(sumI)、N(j)是j頻次(sumJ)
val user_ds6 = user_ds5.withColumn("result", col("sumIJ") / sqrt(col("sumI") * col("sumJ")))
user_ds6.show(5)

輸出:

user_ds6: org.apache.spark.sql.DataFrame = [itemidI: string, itemidJ: string ... 4 more fields]
+-------+-------+-----+----+----+-------------------+
|itemidI|itemidJ|sumIJ|sumJ|sumI| result|
+-------+-------+-----+----+----+-------------------+
| 171765| 79132| 2.0| 143| 2|0.11826247919781652|
| 2028| 5618| 46.0| 87| 188|0.35968247729147257|
| 2334| 2384| 3.0| 29| 13| 0.1545078607873814|
| 100083| 89190| 1.0| 2| 3| 0.4082482904638631|
| 100882| 162578| 1.0| 6| 2| 0.2886751345948129|
+-------+-------+-----+----+----+-------------------+
// 6 上、下三角合併
println(s"user_ds6.count(): ${user_ds6.count()}")
val user_ds7 = user_ds6.select("itemidI", "itemidJ", "result").union(user_ds6.select($"itemidJ".as("itemidI"), $"itemidI".as("itemidJ"), $"result"))
println(s"user_ds7.count(): ${user_ds7.count()}")
user_ds7.show(5)

輸出:

user_ds6.count(): 13157672
user_ds7: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [itemidI: string, itemidJ: string ... 1 more field]
user_ds7.count(): 26315344
+-------+-------+-------------------+
|itemidI|itemidJ| result|
+-------+-------+-------------------+
| 171765| 79132|0.11826247919781652|
| 2028| 5618|0.35968247729147257|
| 2334| 2384| 0.1545078607873814|
| 100083| 89190| 0.4082482904638631|
| 100882| 162578| 0.2886751345948129|
+-------+-------+-------------------+
// 結果返回

// 物品相似度
case class ItemSimi(itemidI: String, itemidJ: String, similar: Double)

val out = user_ds7.select("itemidI", "itemidJ", "result").map { row =>
val itemidI = row.getString(0)
val itemidJ = row.getString(1)
val similar = row.getDouble(2)
ItemSimi(itemidI, itemidJ, similar)
}

out.show(5)

輸出:

out: org.apache.spark.sql.Dataset[ItemSimi] = [itemidI: string, itemidJ: string ... 1 more field]
+-------+-------+-------------------+
|itemidI|itemidJ| similar|
+-------+-------+-------------------+
| 171765| 79132|0.11826247919781652|
| 2028| 5618|0.35968247729147257|
| 2334| 2384| 0.1545078607873814|
| 100083| 89190| 0.4082482904638631|
| 100882| 162578| 0.2886751345948129|
+-------+-------+-------------------+
// 獲得同現相似度
val items_similar_cooccurrence = out.map {
case ItemSimi(itemidI: String, itemidJ: String, similar: Double) =>
val i_title = item_id2title_map_BC.value.getOrElse(itemidI, "")
val j_title = item_id2title_map_BC.value.getOrElse(itemidJ, "")
val i_genres = item_id2genres_map_BC.value.getOrElse(itemidI, "")
val j_genres = item_id2genres_map_BC.value.getOrElse(itemidJ, "")
(itemidI, itemidJ, similar, i_title, j_title, i_genres, j_genres)
}.withColumnRenamed("_1", "itemidI").
withColumnRenamed("_2", "itemidJ").
withColumnRenamed("_3", "similar").
withColumnRenamed("_4", "i_title").
withColumnRenamed("_5", "j_title").
withColumnRenamed("_6", "i_genres").
withColumnRenamed("_7", "j_genres")

輸出:

items_similar_cooccurrence: org.apache.spark.sql.DataFrame = [itemidI: string, itemidJ: string ... 5 more fields]
+-------+-------+-------------------+--------------------+--------------------+--------------------+--------------------+
|itemidI|itemidJ| similar| i_title| j_title| i_genres| j_genres|
+-------+-------+-------------------+--------------------+--------------------+--------------------+--------------------+
| 171765| 79132|0.11826247919781652| Okja (2017)| Inception (2010)|Action|Adventure|...|Action|Crime|Dram...|
| 2028| 5618|0.35968247729147257|Saving Private Ry...|Spirited Away (Se...| Action|Drama|War|Adventure|Animati...|
| 2334| 2384| 0.1545078607873814| Siege, The (1998)|Babe: Pig in the ...| Action|Thriller|Adventure|Childre...|
| 100083| 89190| 0.4082482904638631| Movie 43 (2013)|Conan the Barbari...| Comedy|Action|Adventure|...|
| 100882| 162578| 0.2886751345948129|Journey to the We...|Kubo and the Two ...|Adventure|Comedy|...|Adventure|Animati...|
+-------+-------+-------------------+--------------------+--------------------+--------------------+--------------------+

3、推薦計算

基於 Spark ML 實現了基於物品的協同過濾推薦模型,根據物品相似度模型W用戶評分A指定最大推薦數量K進行用戶推薦。在圖 3中,根據物品相似度矩陣和用戶評分來計算用戶推薦列表,計算公式是 R= W × A,取推薦計算中用戶未評分過的物品,並且按照 計算結果倒序推薦給用戶。

圖3 協同推薦計算

其推薦計算實現了分佈式 計算, 首先對相似表和用戶評分表以物品 id 爲 key 進行關聯,得到用戶和物品之間的關係( 評分 × 相似 度),然後通過 group by 操作進行彙總,最後通過過濾用戶已評分物品,並且按照計算結果倒序推薦給用戶,過程如圖 4 所示。

圖4 分佈式協同過濾推薦的計算過程

具體代碼實現如下:

# 獲取物品相似度
val items_similar_ds1 = items_similar_cooccurrence.select("itemidI", "itemidJ", "similar").map {
case Row(itemidI: String, itemidJ: String, similar: Double) =>
ItemSimi(itemidI, itemidJ, similar)
}

items_similar_ds1.show(5)

輸出:

items_similar_ds1: org.apache.spark.sql.Dataset[ItemSimi] = [itemidI: string, itemidJ: string ... 1 more field]
+-------+-------+-------------------+
|itemidI|itemidJ| similar|
+-------+-------+-------------------+
| 171765| 79132|0.11826247919781652|
| 2028| 5618|0.35968247729147257|
| 2334| 2384| 0.1545078607873814|
| 100083| 89190| 0.4082482904638631|
| 100882| 162578| 0.2886751345948129|
+-------+-------+-------------------+

根據用戶的item召回相似物品

val user_prefer_ds2 = items_similar_ds1.join(user_prefer_ds1, $"itemidI" === $"itemid", "inner")
user_prefer_ds2.show(5)

輸出:

user_prefer_ds2: org.apache.spark.sql.DataFrame = [itemidI: string, itemidJ: string ... 4 more fields]
+-------+-------+-------------------+------+------+----+
|itemidI|itemidJ| similar|userid|itemid|pref|
+-------+-------+-------------------+------+------+----+
| 171765| 79132|0.11826247919781652| 414|171765| 4.0|
| 171765| 79132|0.11826247919781652| 296|171765| 3.5|
| 2028| 5618|0.35968247729147257| 610| 2028| 5.0|
| 2028| 5618|0.35968247729147257| 608| 2028| 4.5|
| 2028| 5618|0.35968247729147257| 607| 2028| 5.0|
+-------+-------+-------------------+------+------+----+

計算召回的用戶物品得分

val user_prefer_ds3 = user_prefer_ds2.withColumn("score", col("pref") * col("similar")).select("userid", "itemidJ", "score")
user_prefer_ds3.show(5)

輸出:

user_prefer_ds3: org.apache.spark.sql.DataFrame = [userid: string, itemidJ: string ... 1 more field]
+------+-------+-------------------+
|userid|itemidJ| score|
+------+-------+-------------------+
| 414| 79132|0.47304991679126607|
| 296| 79132| 0.4139186771923578|
| 610| 5618| 1.798412386457363|
| 608| 5618| 1.6185711478116265|
| 607| 5618| 1.798412386457363|
+------+-------+-------------------+

得分彙總

val user_prefer_ds4 = user_prefer_ds3.groupBy("userid", "itemidJ").agg(sum("score").as("score")).withColumnRenamed("itemidJ", "itemid")
user_prefer_ds4.show(5)

輸出:

user_prefer_ds4: org.apache.spark.sql.DataFrame = [userid: string, itemid: string ... 1 more field]
+------+------+------------------+
|userid|itemid| score|
+------+------+------------------+
| 105| 5618| 693.772655810524|
| 83| 5618|121.67206123775786|
| 264| 62434| 37.52938682929201|
| 110| 62434| 32.87763551021598|
| 241|149902|16.321437468092682|
+------+------+------------------+

用戶得分排序結果,去除用戶已評分物品

val user_prefer_ds5 = user_prefer_ds4.join(user_prefer_ds1, Seq("userid", "itemid"), "left").where("pref is null")
user_prefer_ds5.show(5)

輸出:

user_prefer_ds5: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [userid: string, itemid: string ... 2 more fields]
+------+------+------------------+----+
|userid|itemid| score|pref|
+------+------+------------------+----+
| 83| 5618|121.67206123775786|null|
| 264| 62434| 37.52938682929201|null|
| 110| 62434| 32.87763551021598|null|
| 241|149902|16.321437468092682|null|
| 2|149902| 5.85755844914133|null|
+------+------+------------------+----+

獲得輸出結果

val out1 = user_prefer_ds5.select("userid", "itemid", "score").map { row =>
val userid = row.getString(0)
val itemid = row.getString(1)
val pref = row.getDouble(2)
UserRecomm(userid, itemid, pref)
}

out1.show(5)

輸出:

out1: org.apache.spark.sql.Dataset[UserRecomm] = [userid: string, itemid: string ... 1 more field]
+------+------+------------------+
|userid|itemid| pref|
+------+------+------------------+
| 83| 5618|121.67206123775786|
| 264| 62434| 37.52938682929201|
| 110| 62434| 32.87763551021598|
| 241|149902|16.321437468092682|
| 2|149902| 5.85755844914133|
+------+------+------------------+

用戶預測

val user_predictr_cooccurrence = ItemSimilarity.Recommend(cooccurrence, user_ds).map {
case UserRecomm(userid: String, itemid: String, pref: Double) =>
val title = item_id2title_map_BC.value.getOrElse(itemid, "")
val genres = item_id2genres_map_BC.value.getOrElse(itemid, "")
(userid, itemid, title, genres, pref)
}.withColumnRenamed("_1", "userid").
withColumnRenamed("_2", "itemid").
withColumnRenamed("_3", "title").
withColumnRenamed("_4", "genres").
withColumnRenamed("_5", "pref")

user_predictr_cooccurrence.count()
println("user_predictr_cooccurrence.show(5)")
user_predictr_cooccurrence.orderBy($"userid".asc, $"pref".desc).show(5)

輸出:

res269: Long = 5777812
user_predictr_cooccurrence.show(5)
+------+------+--------------------+--------------------+------------------+
|userid|itemid| title| genres| pref|
+------+------+--------------------+--------------------+------------------+
| 1| 2918|Ferris Bueller's ...| Comedy|366.73977336626604|
| 1| 1036| Die Hard (1988)|Action|Crime|Thri...| 355.2698121740735|
| 1| 1391|Mars Attacks! (1996)|Action|Comedy|Sci-Fi|350.94166057250465|
| 1| 2011|Back to the Futur...|Adventure|Comedy|...| 349.2778478948547|
| 1| 1968|Breakfast Club, T...| Comedy|Drama| 347.6108775933672|
+------+------+--------------------+--------------------+------------------+

4、相關鏈接

1、黃美靈. 推薦系統算法實踐 . 電子工業出版社.

2、基於鄰域的協同過濾算法

3、GitHub代碼實現地址:https://github.com/freeshow/RecommenderSystems.git


本文分享自微信公衆號 - 大數據AI(songxt1990)。
如有侵權,請聯繫 [email protected] 刪除。
本文參與“OSC源創計劃”,歡迎正在閱讀的你也加入,一起分享。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章