基於物品的協同過濾算法(Spark Scala實現)

目前Spark ML中未實現協同過濾推薦算法，本文將根據基於鄰域的協同過濾算法的理論知識，實現基於物品的協同過濾推薦算法。基於Spark ML實現的相似度計算模型，可以計算物品與物品之間的相似度，支持同現相似度、歐幾里得距離相似度、Cosine相似度3種相似度計算方法，圖1爲根據用戶評分矩陣，採用同現相似度方法計算物品相似度矩陣。

注意：本文代碼使用Zeppelin進行操作

1、數據準備

數據來源：MovieLens 【數據地址：https://grouplens.org/datasets/movielens/】（1M、10M、20M 共三個數據集）

1.1 導入依賴包

import scala.math._
import org.apache.spark.sql.Dataset
import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
import org.apache.spark.sql.functions._
import org.apache.spark.sql._
import scala.collection.mutable.WrappedArray
import scala.collection.JavaConverters._
import scala.collection.mutable.ArrayBuffer

1.2 讀取item配置表

將movies.csv文件加載成DataFrame

val item_conf_path = "hdfs://mycluster/user/***/ml-latest-small/movies.csv"
val item_conf_df = spark.read.options(Map(("delimiter", ","), ("header", "true"))).csv(item_conf_path)
item_conf_df.show(5)

輸出：

+-------+--------------------+--------------------+
|movieId|               title|              genres|
+-------+--------------------+--------------------+
|      1|    Toy Story (1995)|Adventure|Animati...|
|      2|      Jumanji (1995)|Adventure|Childre...|
|      3|Grumpier Old Men ...|      Comedy|Romance|
|      4|Waiting to Exhale...|Comedy|Drama|Romance|
|      5|Father of the Bri...|              Comedy|
+-------+--------------------+--------------------+
only showing top 5 rows

Map: (movieId, title)

val item_id2title_map = item_conf_df.select("movieId", "title").collect().map(row => (row(0).toString(), row(1).toString())).toMap

Map: (movieId, genres)

val item_id2genres_map = item_conf_df.select("movieId", "genres").collect().map(row => (row(0).toString(), row(1).toString())).toMap

廣播

val item_id2title_map_BC = spark.sparkContext.broadcast(item_id2title_map)
val item_id2genres_map_BC = spark.sparkContext.broadcast(item_id2genres_map)

1.3 讀取用戶行爲數據

val user_rating_path = "hdfs://mycluster/user/***/ml-latest-small/ratings.csv"
val user_rating_df = spark.read.options(Map(("delimiter", ","), ("header", "true"))).csv(user_rating_path)
user_rating_df.show(5)

輸出：

user_rating_path: String = hdfs://mycluster/user/***/ml-latest-small/ratings.csv
user_rating_df: org.apache.spark.sql.DataFrame = [userId: string, movieId: string ... 2 more fields]
+------+-------+------+---------+
|userId|movieId|rating|timestamp|
+------+-------+------+---------+
|     1|      1|   4.0|964982703|
|     1|      3|   4.0|964981247|
|     1|      6|   4.0|964982224|
|     1|     47|   5.0|964983815|
|     1|     50|   5.0|964982931|
+------+-------+------+---------+

查看user_rating_df的數據類型：

user_rating_df.dtypes

輸出：

res210: Array[(String, String)] = Array((userId,StringType), (movieId,StringType), (rating,StringType), (timestamp,StringType))

數據類型轉換：

// 聲明轉換後的數據類型
case class ItemPref(userid: String, itemid: String, pref: Double)

// 轉換數據類型
val user_ds = user_rating_df.map {
  case Row(userId: String, movieId: String, rating: String, timestamp: String)  =>
    ItemPref(userId, movieId, rating.toDouble)
}
println("user_ds.show(10)")
user_ds.show(10)
user_ds.cache()
user_ds.count()

輸出：

defined class ItemPref
user_ds: org.apache.spark.sql.Dataset[ItemPref] = [userid: string, itemid: string ... 1 more field]
user_ds.show(10)
+------+------+----+
|userid|itemid|pref|
+------+------+----+
|     1|     1| 4.0|
|     1|     3| 4.0|
|     1|     6| 4.0|
|     1|    47| 5.0|
|     1|    50| 5.0|
|     1|    70| 3.0|
|     1|   101| 5.0|
|     1|   110| 4.0|
|     1|   151| 5.0|
|     1|   157| 5.0|
+------+------+----+
only showing top 10 rows
res213: user_ds.type = [userid: string, itemid: string ... 1 more field]
res214: Long = 100836

2、相似度計算

分佈式同現相似度矩陣計算過程，實現過程如下。

首先以用戶id爲key進行group by操作，得到每個用戶的所有物品集合;
然後對每個用戶的物品集合進行flatMap操作：對物品集合生成兩兩物品對（物品，物品），其中只生成上三角部分；
之後對（物品，物品）對進行 group by 操作，得到物品與物品的總出現次數
隨後再根據同現相似度公式( w(i,j) = N(i) ⋂ N(j) / sqrt( N(i) × N(j) )，其中分子是 i 與 j 的同現頻次，分母的 N(i) 是 i 頻次、N(j) 是 j 頻次）計算物品與物品的相似度，最終得到所有上三角部分的相似度。過程如圖2所示。

分佈式同現相似度矩陣計算代碼實現過程如下：

首先以用戶id爲key進行group by操作，得到每個用戶的所有物品集合。

// (用戶：物品) => (用戶：(物品集合))
val user_ds1 = user_ds.groupBy("userid").agg(collect_set("itemid")).withColumnRenamed("collect_set(itemid)", "itemid_set")
user_ds1.show(5)

輸出：

user_ds1: org.apache.spark.sql.DataFrame = [userid: string, itemid_set: array<string>]
+------+--------------------+
|userid|          itemid_set|
+------+--------------------+
|   296|[110, 65261, 356,...|
|   467|[389, 41, 780, 23...|
|   125|[151695, 62434, 7...|
|   451|[376, 1, 1356, 14...|
|   124|[110, 1, 296, 795...|
+------+--------------------+

然後對每個用戶的物品集合進行flatMap操作：對物品集合生成兩兩物品對（物品，物品），其中只生成上三角部分。

// 物品:物品，上三角數據
val user_ds2 = user_ds1.flatMap { row =>
  val itemlist = row.getAs[scala.collection.mutable.WrappedArray[String]](1).toArray.sorted
  val result = new ArrayBuffer[(String, String, Double)]()
  for (i <- 0 to itemlist.length - 2) {
    for (j <- i + 1 to itemlist.length - 1) {
      result += ((itemlist(i), itemlist(j), 1.0))
    }
  }
  result
}.withColumnRenamed("_1", "itemidI").withColumnRenamed("_2", "itemidJ").withColumnRenamed("_3", "score")

user_ds2.show(5)

輸出：

user_ds2: org.apache.spark.sql.DataFrame = [itemidI: string, itemidJ: string ... 1 more field]
+-------+-------+-----+
|itemidI|itemidJ|score|
+-------+-------+-----+
|    110|   1201|  1.0|
|    110| 160848|  1.0|
|    110| 166528|  1.0|
|    110| 169034|  1.0|
|    110|   1704|  1.0|
+-------+-------+-----+

之後對（物品，物品）對進行 group by 操作，得到物品與物品的總出現次數

// 計算物品與物品，上三角,同現頻次
val user_ds3 = user_ds2.groupBy("itemidI", "itemidJ").agg(sum("score").as("sumIJ"))
user_ds3.show(5)

輸出：

user_ds3: org.apache.spark.sql.DataFrame = [itemidI: string, itemidJ: string ... 1 more field]
+-------+-------+-----+
|itemidI|itemidJ|sumIJ|
+-------+-------+-----+
| 171765|  79132|  2.0|
|   2028|   5618| 46.0|
|   2334|   2384|  3.0|
| 100083|  89190|  1.0|
| 100882| 162578|  1.0|
+-------+-------+-----+

計算物品總共出現的頻次,即N(i)與N(j)

val user_ds0 = user_ds.withColumn("score", lit(1)).groupBy("itemid").agg(sum("score").as("score"))
user_ds0.show(5)

輸出：

// 計算N(i)
user_ds0: org.apache.spark.sql.DataFrame = [itemid: string, score: bigint]
+------+-----+
|itemid|score|
+------+-----+
|   296|  307|
|  1090|   63|
|115713|   28|
|  3210|   42|
| 88140|   32|
+------+-----+

// 計算N(j)
user_ds0.withColumnRenamed("itemid", "itemidJ").withColumnRenamed("score", "sumJ").select("itemidJ", "sumJ").show(5)

輸出：

+-------+----+
|itemidJ|sumJ|
+-------+----+
|    296| 307|
|   1090|  63|
| 115713|  28|
|   3210|  42|
|  88140|  32|
+-------+----+

計算同現相似度

val user_ds4 = user_ds3.join(user_ds0.withColumnRenamed("itemid", "itemidJ").withColumnRenamed("score", "sumJ").select("itemidJ", "sumJ"), "itemidJ")
user_ds4.show(5)

輸出：

user_ds4: org.apache.spark.sql.DataFrame = [itemidJ: string, itemidI: string ... 2 more fields]
+-------+-------+-----+----+
|itemidJ|itemidI|sumIJ|sumJ|
+-------+-------+-----+----+
|  79132| 171765|  2.0| 143|
|   5618|   2028| 46.0|  87|
|   2384|   2334|  3.0|  29|
|  89190| 100083|  1.0|   2|
| 162578| 100882|  1.0|   6|
+-------+-------+-----+----+

user_ds0.withColumnRenamed("itemid", "itemidI").withColumnRenamed("score", "sumI").select("itemidI", "sumI").show(5)

輸出：

+-------+----+
|itemidI|sumI|
+-------+----+
|    296| 307|
|   1090|  63|
| 115713|  28|
|   3210|  42|
|  88140|  32|
+-------+----+

val user_ds5 = user_ds4.join(user_ds0.withColumnRenamed("itemid", "itemidI").withColumnRenamed("score", "sumI").select("itemidI", "sumI"), "itemidI")
user_ds5.show(5)

輸出：

user_ds5: org.apache.spark.sql.DataFrame = [itemidI: string, itemidJ: string ... 3 more fields]
+-------+-------+-----+----+----+
|itemidI|itemidJ|sumIJ|sumJ|sumI|
+-------+-------+-----+----+----+
| 171765|  79132|  2.0| 143|   2|
|   2028|   5618| 46.0|  87| 188|
|   2334|   2384|  3.0|  29|  13|
| 100083|  89190|  1.0|   2|   3|
| 100882| 162578|  1.0|   6|   2|
+-------+-------+-----+----+----+

// 根據公式N(i)∩N(j)/sqrt(N(i)*N(j)) 計算: 其中，分子是i與j的同現頻次(sumIJ)，分母的N（i）是i頻次(sumI)、N（j）是j頻次(sumJ)
val user_ds6 = user_ds5.withColumn("result", col("sumIJ") / sqrt(col("sumI") * col("sumJ")))
user_ds6.show(5)

輸出：

user_ds6: org.apache.spark.sql.DataFrame = [itemidI: string, itemidJ: string ... 4 more fields]
+-------+-------+-----+----+----+-------------------+
|itemidI|itemidJ|sumIJ|sumJ|sumI|             result|
+-------+-------+-----+----+----+-------------------+
| 171765|  79132|  2.0| 143|   2|0.11826247919781652|
|   2028|   5618| 46.0|  87| 188|0.35968247729147257|
|   2334|   2384|  3.0|  29|  13| 0.1545078607873814|
| 100083|  89190|  1.0|   2|   3| 0.4082482904638631|
| 100882| 162578|  1.0|   6|   2| 0.2886751345948129|
+-------+-------+-----+----+----+-------------------+

// 6 上、下三角合併
println(s"user_ds6.count(): ${user_ds6.count()}")
val user_ds7 = user_ds6.select("itemidI", "itemidJ", "result").union(user_ds6.select($"itemidJ".as("itemidI"), $"itemidI".as("itemidJ"), $"result"))
println(s"user_ds7.count(): ${user_ds7.count()}")
user_ds7.show(5)

輸出：

user_ds6.count(): 13157672
user_ds7: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [itemidI: string, itemidJ: string ... 1 more field]
user_ds7.count(): 26315344
+-------+-------+-------------------+
|itemidI|itemidJ|             result|
+-------+-------+-------------------+
| 171765|  79132|0.11826247919781652|
|   2028|   5618|0.35968247729147257|
|   2334|   2384| 0.1545078607873814|
| 100083|  89190| 0.4082482904638631|
| 100882| 162578| 0.2886751345948129|
+-------+-------+-------------------+

// 結果返回

// 物品相似度
case class ItemSimi(itemidI: String, itemidJ: String, similar: Double)

val out = user_ds7.select("itemidI", "itemidJ", "result").map { row =>
  val itemidI = row.getString(0)
  val itemidJ = row.getString(1)
  val similar = row.getDouble(2)
  ItemSimi(itemidI, itemidJ, similar)
}

out.show(5)

輸出：

out: org.apache.spark.sql.Dataset[ItemSimi] = [itemidI: string, itemidJ: string ... 1 more field]
+-------+-------+-------------------+
|itemidI|itemidJ|            similar|
+-------+-------+-------------------+
| 171765|  79132|0.11826247919781652|
|   2028|   5618|0.35968247729147257|
|   2334|   2384| 0.1545078607873814|
| 100083|  89190| 0.4082482904638631|
| 100882| 162578| 0.2886751345948129|
+-------+-------+-------------------+

// 獲得同現相似度
val items_similar_cooccurrence = out.map {
      case ItemSimi(itemidI: String, itemidJ: String, similar: Double) =>
        val i_title = item_id2title_map_BC.value.getOrElse(itemidI, "")
        val j_title = item_id2title_map_BC.value.getOrElse(itemidJ, "")
        val i_genres = item_id2genres_map_BC.value.getOrElse(itemidI, "")
        val j_genres = item_id2genres_map_BC.value.getOrElse(itemidJ, "")
        (itemidI, itemidJ, similar, i_title, j_title, i_genres, j_genres)
    }.withColumnRenamed("_1", "itemidI").
      withColumnRenamed("_2", "itemidJ").
      withColumnRenamed("_3", "similar").
      withColumnRenamed("_4", "i_title").
      withColumnRenamed("_5", "j_title").
      withColumnRenamed("_6", "i_genres").
      withColumnRenamed("_7", "j_genres")

輸出：

items_similar_cooccurrence: org.apache.spark.sql.DataFrame = [itemidI: string, itemidJ: string ... 5 more fields]
+-------+-------+-------------------+--------------------+--------------------+--------------------+--------------------+
|itemidI|itemidJ|            similar|             i_title|             j_title|            i_genres|            j_genres|
+-------+-------+-------------------+--------------------+--------------------+--------------------+--------------------+
| 171765|  79132|0.11826247919781652|         Okja (2017)|    Inception (2010)|Action|Adventure|...|Action|Crime|Dram...|
|   2028|   5618|0.35968247729147257|Saving Private Ry...|Spirited Away (Se...|    Action|Drama|War|Adventure|Animati...|
|   2334|   2384| 0.1545078607873814|   Siege, The (1998)|Babe: Pig in the ...|     Action|Thriller|Adventure|Childre...|
| 100083|  89190| 0.4082482904638631|     Movie 43 (2013)|Conan the Barbari...|              Comedy|Action|Adventure|...|
| 100882| 162578| 0.2886751345948129|Journey to the We...|Kubo and the Two ...|Adventure|Comedy|...|Adventure|Animati...|
+-------+-------+-------------------+--------------------+--------------------+--------------------+--------------------+

3、推薦計算

基於 Spark ML 實現了基於物品的協同過濾推薦模型，根據物品相似度模型W、用戶評分A和指定最大推薦數量K進行用戶推薦。在圖 3中，根據物品相似度矩陣和用戶評分來計算用戶推薦列表，計算公式是 R= W × A，取推薦計算中用戶未評分過的物品，並且按照計算結果倒序推薦給用戶。

其推薦計算實現了分佈式計算，首先對相似表和用戶評分表以物品 id 爲 key 進行關聯，得到用戶和物品之間的關係（評分 × 相似度），然後通過 group by 操作進行彙總，最後通過過濾用戶已評分物品，並且按照計算結果倒序推薦給用戶，過程如圖 4 所示。

具體代碼實現如下：

# 獲取物品相似度
val items_similar_ds1 = items_similar_cooccurrence.select("itemidI", "itemidJ", "similar").map {
  case Row(itemidI: String, itemidJ: String, similar: Double) =>
    ItemSimi(itemidI, itemidJ, similar)
}

items_similar_ds1.show(5)

輸出：

items_similar_ds1: org.apache.spark.sql.Dataset[ItemSimi] = [itemidI: string, itemidJ: string ... 1 more field]
+-------+-------+-------------------+
|itemidI|itemidJ|            similar|
+-------+-------+-------------------+
| 171765|  79132|0.11826247919781652|
|   2028|   5618|0.35968247729147257|
|   2334|   2384| 0.1545078607873814|
| 100083|  89190| 0.4082482904638631|
| 100882| 162578| 0.2886751345948129|
+-------+-------+-------------------+

根據用戶的item召回相似物品

val user_prefer_ds2 = items_similar_ds1.join(user_prefer_ds1, $"itemidI" === $"itemid", "inner")
user_prefer_ds2.show(5)

輸出：

user_prefer_ds2: org.apache.spark.sql.DataFrame = [itemidI: string, itemidJ: string ... 4 more fields]
+-------+-------+-------------------+------+------+----+
|itemidI|itemidJ|            similar|userid|itemid|pref|
+-------+-------+-------------------+------+------+----+
| 171765|  79132|0.11826247919781652|   414|171765| 4.0|
| 171765|  79132|0.11826247919781652|   296|171765| 3.5|
|   2028|   5618|0.35968247729147257|   610|  2028| 5.0|
|   2028|   5618|0.35968247729147257|   608|  2028| 4.5|
|   2028|   5618|0.35968247729147257|   607|  2028| 5.0|
+-------+-------+-------------------+------+------+----+

計算召回的用戶物品得分

val user_prefer_ds3 = user_prefer_ds2.withColumn("score", col("pref") * col("similar")).select("userid", "itemidJ", "score")
user_prefer_ds3.show(5)

輸出：

user_prefer_ds3: org.apache.spark.sql.DataFrame = [userid: string, itemidJ: string ... 1 more field]
+------+-------+-------------------+
|userid|itemidJ|              score|
+------+-------+-------------------+
|   414|  79132|0.47304991679126607|
|   296|  79132| 0.4139186771923578|
|   610|   5618|  1.798412386457363|
|   608|   5618| 1.6185711478116265|
|   607|   5618|  1.798412386457363|
+------+-------+-------------------+

得分彙總

val user_prefer_ds4 = user_prefer_ds3.groupBy("userid", "itemidJ").agg(sum("score").as("score")).withColumnRenamed("itemidJ", "itemid")
user_prefer_ds4.show(5)

輸出：

user_prefer_ds4: org.apache.spark.sql.DataFrame = [userid: string, itemid: string ... 1 more field]
+------+------+------------------+
|userid|itemid|             score|
+------+------+------------------+
|   105|  5618|  693.772655810524|
|    83|  5618|121.67206123775786|
|   264| 62434| 37.52938682929201|
|   110| 62434| 32.87763551021598|
|   241|149902|16.321437468092682|
+------+------+------------------+

用戶得分排序結果，去除用戶已評分物品

val user_prefer_ds5 = user_prefer_ds4.join(user_prefer_ds1, Seq("userid", "itemid"), "left").where("pref is null")
user_prefer_ds5.show(5)

輸出：

user_prefer_ds5: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [userid: string, itemid: string ... 2 more fields]
+------+------+------------------+----+
|userid|itemid|             score|pref|
+------+------+------------------+----+
|    83|  5618|121.67206123775786|null|
|   264| 62434| 37.52938682929201|null|
|   110| 62434| 32.87763551021598|null|
|   241|149902|16.321437468092682|null|
|     2|149902|  5.85755844914133|null|
+------+------+------------------+----+

獲得輸出結果

val out1 = user_prefer_ds5.select("userid", "itemid", "score").map { row =>
  val userid = row.getString(0)
  val itemid = row.getString(1)
  val pref = row.getDouble(2)
  UserRecomm(userid, itemid, pref)
}

out1.show(5)

輸出：

out1: org.apache.spark.sql.Dataset[UserRecomm] = [userid: string, itemid: string ... 1 more field]
+------+------+------------------+
|userid|itemid|              pref|
+------+------+------------------+
|    83|  5618|121.67206123775786|
|   264| 62434| 37.52938682929201|
|   110| 62434| 32.87763551021598|
|   241|149902|16.321437468092682|
|     2|149902|  5.85755844914133|
+------+------+------------------+

用戶預測

val user_predictr_cooccurrence = ItemSimilarity.Recommend(cooccurrence, user_ds).map {
  case UserRecomm(userid: String, itemid: String, pref: Double) =>
    val title = item_id2title_map_BC.value.getOrElse(itemid, "")
    val genres = item_id2genres_map_BC.value.getOrElse(itemid, "")
    (userid, itemid, title, genres, pref)
}.withColumnRenamed("_1", "userid").
  withColumnRenamed("_2", "itemid").
  withColumnRenamed("_3", "title").
  withColumnRenamed("_4", "genres").
  withColumnRenamed("_5", "pref")

user_predictr_cooccurrence.count()
println("user_predictr_cooccurrence.show(5)")
user_predictr_cooccurrence.orderBy($"userid".asc, $"pref".desc).show(5)

輸出：

res269: Long = 5777812
user_predictr_cooccurrence.show(5)
+------+------+--------------------+--------------------+------------------+
|userid|itemid|               title|              genres|              pref|
+------+------+--------------------+--------------------+------------------+
|     1|  2918|Ferris Bueller's ...|              Comedy|366.73977336626604|
|     1|  1036|     Die Hard (1988)|Action|Crime|Thri...| 355.2698121740735|
|     1|  1391|Mars Attacks! (1996)|Action|Comedy|Sci-Fi|350.94166057250465|
|     1|  2011|Back to the Futur...|Adventure|Comedy|...| 349.2778478948547|
|     1|  1968|Breakfast Club, T...|        Comedy|Drama| 347.6108775933672|
+------+------+--------------------+--------------------+------------------+

4、相關鏈接

1、黃美靈. 推薦系統算法實踐 . 電子工業出版社.

2、基於鄰域的協同過濾算法

3、GitHub代碼實現地址：https://github.com/freeshow/RecommenderSystems.git

本文分享自微信公衆號 - 大數據AI（songxt1990）。
如有侵權，請聯繫 [email protected] 刪除。
本文參與“OSC源創計劃”，歡迎正在閱讀的你也加入，一起分享。

基於物品的協同過濾算法(Spark Scala實現)

1、數據準備

1.1 導入依賴包

1.2 讀取item配置表

1.3 讀取用戶行爲數據

2、相似度計算

3、推薦計算

4、相關鏈接

Debian 系統初體驗

vue2數據雙向綁定Object.defineProperty

魔獸爭霸卡頓解決

BCS2024｜Baidu Comate：以研發提效爲驅動實現“安全左移”

Puppeteer實戰案例：自動化抓取社交媒體上的媒體資源

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結