使用Mahout實現協同過濾 spark

Mahout使用了Taste來提高協同過濾算法的實現,它是一個基於Java實現的可擴展的,高效的推薦引擎。Taste既實現了最基本的基 於用戶的和基於內容的推薦算法,同時也提供了擴展接口,使用戶可以方便的定義和實現自己的推薦算法。同時,Taste不僅僅只適用於Java應用程序,它 可以作爲內部服務器的一個組件以HTTP和Web Service的形式向外界提供推薦的邏輯。Taste的設計使它能滿足企業對推薦引擎在性能、靈活性和可擴展性等方面的要求。

接口相關介紹

Taste主要包括以下幾個接口:

  • DataModel 是用戶喜好信息的抽象接口,它的具體實現支持從任意類型的數據源抽取用戶喜好信息。Taste 默認提供 JDBCDataModel 和 FileDataModel,分別支持從數據庫和文件中讀取用戶的喜好信息。
  • UserSimilarity 和 ItemSimilarity 。UserSimilarity 用於定義兩個用戶間的相似度,它是基於協同過濾的推薦引擎的核心部分,可以用來計算用戶的“鄰居”,這裏我們將與當前用戶口味相似的用戶稱爲他的鄰居。ItemSimilarity 類似的,計算內容之間的相似度。
  • UserNeighborhood 用於基於用戶相似度的推薦方法中,推薦的內容是基於找到與當前用戶喜好相似的鄰居用戶的方式產生的。UserNeighborhood 定義了確定鄰居用戶的方法,具體實現一般是基於 UserSimilarity 計算得到的。
  • Recommender 是推薦引擎的抽象接口,Taste 中的核心組件。程序中,爲它提供一個 DataModel,它可以計算出對不同用戶的推薦內容。實際應用中,主要使用它的實現類 GenericUserBasedRecommender 或者 GenericItemBasedRecommender,分別實現基於用戶相似度的推薦引擎或者基於內容的推薦引擎。
  • RecommenderEvaluator :評分器。
  • RecommenderIRStatsEvaluator :蒐集推薦性能相關的指標,包括準確率、召回率等等。

目前,Mahout爲DataModel提供了以下幾種實現:

  • org.apache.mahout.cf.taste.impl.model.GenericDataModel
  • org.apache.mahout.cf.taste.impl.model.GenericBooleanPrefDataModel
  • org.apache.mahout.cf.taste.impl.model.PlusAnonymousUserDataModel
  • org.apache.mahout.cf.taste.impl.model.file.FileDataModel
  • org.apache.mahout.cf.taste.impl.model.hbase.HBaseDataModel
  • org.apache.mahout.cf.taste.impl.model.cassandra.CassandraDataModel
  • org.apache.mahout.cf.taste.impl.model.mongodb.MongoDBDataModel
  • org.apache.mahout.cf.taste.impl.model.jdbc.SQL92JDBCDataModel
  • org.apache.mahout.cf.taste.impl.model.jdbc.MySQLJDBCDataModel
  • org.apache.mahout.cf.taste.impl.model.jdbc.PostgreSQLJDBCDataModel
  • org.apache.mahout.cf.taste.impl.model.jdbc.GenericJDBCDataModel
  • org.apache.mahout.cf.taste.impl.model.jdbc.SQL92BooleanPrefJDBCDataModel
  • org.apache.mahout.cf.taste.impl.model.jdbc.MySQLBooleanPrefJDBCDataModel
  • org.apache.mahout.cf.taste.impl.model.jdbc.PostgreBooleanPrefSQLJDBCDataModel
  • org.apache.mahout.cf.taste.impl.model.jdbc.ReloadFromJDBCDataModel

從類名上就可以大概猜出來每個DataModel的用途,奇怪的是竟然沒有HDFS的DataModel,有人實現了一個,請參考 MAHOUT-1579 。

UserSimilarity 和 ItemSimilarity 相似度實現有以下幾種:

  • CityBlockSimilarity :基於Manhattan距離相似度
  • EuclideanDistanceSimilarity :基於歐幾里德距離計算相似度
  • LogLikelihoodSimilarity :基於對數似然比的相似度
  • PearsonCorrelationSimilarity :基於皮爾遜相關係數計算相似度
  • SpearmanCorrelationSimilarity :基於皮爾斯曼相關係數相似度
  • TanimotoCoefficientSimilarity :基於谷本系數計算相似度
  • UncenteredCosineSimilarity :計算 Cosine 相似度

以上相似度的說明,請參考Mahout推薦引擎介紹。

UserNeighborhood 主要實現有兩種:

  • NearestNUserNeighborhood:對每個用戶取固定數量N個最近鄰居
  • ThresholdUserNeighborhood:對每個用戶基於一定的限制,取落在相似度限制以內的所有用戶爲鄰居

Recommender分爲以下幾種實現:

  • GenericUserBasedRecommender:基於用戶的推薦引擎
  • GenericBooleanPrefUserBasedRecommender:基於用戶的無偏好值推薦引擎
  • GenericItemBasedRecommender:基於物品的推薦引擎
  • GenericBooleanPrefItemBasedRecommender:基於物品的無偏好值推薦引擎

RecommenderEvaluator有以下幾種實現:

  • AverageAbsoluteDifferenceRecommenderEvaluator :計算平均差值
  • RMSRecommenderEvaluator :計算均方根差

RecommenderIRStatsEvaluator的實現類是GenericRecommenderIRStatsEvaluator。

單機運行

首先,需要在maven中加入對mahout的依賴:

<code class="language-xml"><span class="nt"><span class="tag"><<span class="title">dependency</span>></span></span>
<span class="nt"><span class="tag"><<span class="title">groupId</span>></span></span>org.apache.mahout<span class="nt"><span class="tag"></<span class="title">groupId</span>></span></span>
<span class="nt"><span class="tag"><<span class="title">artifactId</span>></span></span>mahout-core<span class="nt"><span class="tag"></<span class="title">artifactId</span>></span></span>
<span class="nt"><span class="tag"><<span class="title">version</span>></span></span>0.9<span class="nt"><span class="tag"></<span class="title">version</span>></span></span>
<span class="nt"><span class="tag"></<span class="title">dependency</span>></span></span>
<span class="nt"><span class="tag"><<span class="title">dependency</span>></span></span>
<span class="nt"><span class="tag"><<span class="title">groupId</span>></span></span>org.apache.mahout<span class="nt"><span class="tag"></<span class="title">groupId</span>></span></span>
<span class="nt"><span class="tag"><<span class="title">artifactId</span>></span></span>mahout-integration<span class="nt"><span class="tag"></<span class="title">artifactId</span>></span></span>
<span class="nt"><span class="tag"><<span class="title">version</span>></span></span>0.9<span class="nt"><span class="tag"></<span class="title">version</span>></span></span>
<span class="nt"><span class="tag"></<span class="title">dependency</span>></span></span>
<span class="nt"><span class="tag"><<span class="title">dependency</span>></span></span>
<span class="nt"><span class="tag"><<span class="title">groupId</span>></span></span>org.apache.mahout<span class="nt"><span class="tag"></<span class="title">groupId</span>></span></span>
<span class="nt"><span class="tag"><<span class="title">artifactId</span>></span></span>mahout-math<span class="nt"><span class="tag"></<span class="title">artifactId</span>></span></span>
<span class="nt"><span class="tag"><<span class="title">version</span>></span></span>0.9<span class="nt"><span class="tag"></<span class="title">version</span>></span></span>
<span class="nt"><span class="tag"></<span class="title">dependency</span>></span></span>
<span class="nt"><span class="tag"><<span class="title">dependency</span>></span></span>
<span class="nt"><span class="tag"><<span class="title">groupId</span>></span></span>org.apache.mahout<span class="nt"><span class="tag"></<span class="title">groupId</span>></span></span>
<span class="nt"><span class="tag"><<span class="title">artifactId</span>></span></span>mahout-examples<span class="nt"><span class="tag"></<span class="title">artifactId</span>></span></span>
<span class="nt"><span class="tag"><<span class="title">version</span>></span></span>0.9<span class="nt"><span class="tag"></<span class="title">version</span>></span></span>
<span class="nt"><span class="tag"></<span class="title">dependency</span>></span></span>
</code>

基於用戶的推薦,以FileDataModel爲例:

<code class="language-java"><span class="n">File</span> <span class="n">modelFile</span> <span class="n">modelFile</span> <span class="o">=</span> <span class="k"><span class="keyword">new</span></span> <span class="nf">File</span><span class="o">(</span><span class="s"><span class="string">"intro.csv"</span></span><span class="o">);</span>

<span class="n">DataModel</span> <span class="n">model</span> <span class="o">=</span> <span class="k"><span class="keyword">new</span></span> <span class="nf">FileDataModel</span><span class="o">(</span><span class="n">modelFile</span><span class="o">);</span>

<span class="c1"><span class="comment">//用戶相似度,使用基於皮爾遜相關係數計算相似度</span></span>
<span class="n">UserSimilarity</span> <span class="n">similarity</span> <span class="o">=</span> <span class="k"><span class="keyword">new</span></span> <span class="nf">PearsonCorrelationSimilarity</span><span class="o">(</span><span class="n">model</span><span class="o">);</span>

<span class="c1"><span class="comment">//選擇鄰居用戶,使用NearestNUserNeighborhood實現UserNeighborhood接口,選擇鄰近的4個用戶</span></span>
<span class="n">UserNeighborhood</span> <span class="n">neighborhood</span> <span class="o">=</span> <span class="k"><span class="keyword">new</span></span> <span class="nf">NearestNUserNeighborhood</span><span class="o">(</span><span class="mi"><span class="number">4</span></span><span class="o">,</span> <span class="n">similarity</span><span class="o">,</span> <span class="n">model</span><span class="o">);</span>

<span class="n">Recommender</span> <span class="n">recommender</span> <span class="o">=</span> <span class="k"><span class="keyword">new</span></span> <span class="nf">GenericUserBasedRecommender</span><span class="o">(</span><span class="n">model</span><span class="o">,</span> <span class="n">neighborhood</span><span class="o">,</span> <span class="n">similarity</span><span class="o">);</span>

<span class="c1"><span class="comment">//給用戶1推薦4個物品</span></span>
<span class="n">List</span><span class="o"><</span><span class="n">RecommendedItem</span><span class="o">></span> <span class="n">recommendations</span> <span class="o">=</span> <span class="n">recommender</span><span class="o">.</span><span class="na">recommend</span><span class="o">(</span><span class="mi"><span class="number">1</span></span><span class="o">,</span> <span class="mi"><span class="number">4</span></span><span class="o">);</span>

<span class="k"><span class="keyword">for</span></span> <span class="o">(</span><span class="n">RecommendedItem</span> <span class="n">recommendation</span> <span class="o">:</span> <span class="n">recommendations</span><span class="o">)</span> <span class="o">{</span>
    <span class="n">System</span><span class="o">.</span><span class="na"><span class="keyword">out</span></span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="n">recommendation</span><span class="o">);</span>
<span class="o">}</span>
</code>

注意:

FileDataModel要求輸入文件中的字段分隔符爲逗號或者製表符,如果你想使用其他分隔符,你可以擴展一個FileDataModel的實現,例如,mahout中已經提供了一個解析MoiveLens的數據集(分隔符爲 :: )的實現GroupLensDataModel。

對相同用戶重複獲得推薦結果,我們可以改用CachingRecommender來包裝GenericUserBasedRecommender對象,將推薦結果緩存起來:

<code class="language-java"><span class="n">Recommender</span> <span class="n">cachingRecommender</span> <span class="o">=</span> <span class="k"><span class="keyword">new</span></span> <span class="nf">CachingRecommender</span><span class="o">(</span><span class="n">recommender</span><span class="o">);</span>
</code>

上面代碼可以在main方法中直接運行,然後,我們可以獲取推薦模型的評分:

<code class="language-java"><span class="c1"><span class="comment">//使用平均絕對差值獲得評分</span></span>
<span class="n">RecommenderEvaluator</span> <span class="n">evaluator</span> <span class="o">=</span> <span class="k"><span class="keyword">new</span></span> <span class="nf">AverageAbsoluteDifferenceRecommenderEvaluator</span><span class="o">();</span>
<span class="c1"><span class="comment">// 用RecommenderBuilder構建推薦引擎</span></span>
<span class="n">RecommenderBuilder</span> <span class="n">recommenderBuilder</span> <span class="o">=</span> <span class="k"><span class="keyword">new</span></span> <span class="nf">RecommenderBuilder</span><span class="o">()</span> <span class="o">{</span>
<span class="nd"><span class="annotation">@Override</span></span>
<span class="kd"><span class="keyword">public</span></span> <span class="n">Recommender</span> <span class="nf">buildRecommender</span><span class="o">(</span><span class="n">DataModel</span> <span class="n">model</span><span class="o">)</span> <span class="kd"><span class="keyword">throws</span></span> <span class="n">TasteException</span> <span class="o">{</span>
<span class="n">UserSimilarity</span> <span class="n">similarity</span> <span class="o">=</span> <span class="k"><span class="keyword">new</span></span> <span class="nf">PearsonCorrelationSimilarity</span><span class="o">(</span><span class="n">model</span><span class="o">);</span>
<span class="n">UserNeighborhood</span> <span class="n">neighborhood</span> <span class="o">=</span> <span class="k"><span class="keyword">new</span></span> <span class="nf">NearestNUserNeighborhood</span><span class="o">(</span><span class="mi"><span class="number">4</span></span><span class="o">,</span> <span class="n">similarity</span><span class="o">,</span> <span class="n">model</span><span class="o">);</span>
<span class="k"><span class="keyword">return</span></span> <span class="k"><span class="keyword">new</span></span> <span class="nf">GenericUserBasedRecommender</span><span class="o">(</span><span class="n">model</span><span class="o">,</span> <span class="n">neighborhood</span><span class="o">,</span> <span class="n">similarity</span><span class="o">);</span>
<span class="o">}</span>
<span class="o">};</span>
<span class="c1"><span class="comment">// Use 70% of the data to train; test using the other 30%.</span></span>
<span class="kt"><span class="keyword">double</span></span> <span class="n">score</span> <span class="o">=</span> <span class="n">evaluator</span><span class="o">.</span><span class="na">evaluate</span><span class="o">(</span><span class="n">recommenderBuilder</span><span class="o">,</span> <span class="kc"><span class="keyword">null</span></span><span class="o">,</span> <span class="n">model</span><span class="o">,</span> <span class="mf"><span class="number">0.7</span></span><span class="o">,</span> <span class="mf"><span class="number">1.0</span></span><span class="o">);</span>
<span class="n">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="n">score</span><span class="o">);</span>
</code>

接下來,可以獲取推薦結果的查準率和召回率:

<code class="language-java"><span class="n">RecommenderIRStatsEvaluator</span> <span class="n">statsEvaluator</span> <span class="o">=</span> <span class="k"><span class="keyword">new</span></span> <span class="nf">GenericRecommenderIRStatsEvaluator</span><span class="o">();</span>
<span class="c1"><span class="comment">// Build the same recommender for testing that we did last time:</span></span>
<span class="n">RecommenderBuilder</span> <span class="n">recommenderBuilder</span> <span class="o">=</span> <span class="k"><span class="keyword">new</span></span> <span class="nf">RecommenderBuilder</span><span class="o">()</span> <span class="o">{</span>
<span class="nd">@Override</span>
<span class="kd"><span class="keyword">public</span></span> <span class="n">Recommender</span> <span class="nf">buildRecommender</span><span class="o">(</span><span class="n">DataModel</span> <span class="n">model</span><span class="o">)</span> <span class="kd">throws</span> <span class="n">TasteException</span> <span class="o">{</span>
<span class="n">UserSimilarity</span> <span class="n">similarity</span> <span class="o">=</span> <span class="k"><span class="keyword">new</span></span> <span class="nf">PearsonCorrelationSimilarity</span><span class="o">(</span><span class="n">model</span><span class="o">);</span>
<span class="n">UserNeighborhood</span> <span class="n">neighborhood</span> <span class="o">=</span> <span class="k"><span class="keyword">new</span></span> <span class="nf">NearestNUserNeighborhood</span><span class="o">(</span><span class="mi"><span class="number">4</span></span><span class="o">,</span> <span class="n">similarity</span><span class="o">,</span> <span class="n">model</span><span class="o">);</span>
<span class="k"><span class="keyword">return</span></span> <span class="k"><span class="keyword">new</span></span> <span class="nf">GenericUserBasedRecommender</span><span class="o">(</span><span class="n">model</span><span class="o">,</span> <span class="n">neighborhood</span><span class="o">,</span> <span class="n">similarity</span><span class="o">);</span>
<span class="o">}</span>
<span class="o">};</span>
<span class="c1"><span class="comment">// 計算推薦4個結果時的查準率和召回率</span></span>
<span class="n">IRStatistics</span> <span class="n">stats</span> <span class="o">=</span> <span class="n">statsEvaluator</span><span class="o">.</span><span class="na">evaluate</span><span class="o">(</span><span class="n">recommenderBuilder</span><span class="o">,</span><span class="kc"><span class="keyword">null</span></span><span class="o">,</span> <span class="n">model</span><span class="o">,</span> <span class="kc"><span class="keyword">null</span></span><span class="o">,</span> <span class="mi"><span class="number">4</span></span><span class="o">,</span>
<span class="n">GenericRecommenderIRStatsEvaluator</span><span class="o">.</span><span class="na">CHOOSE_THRESHOLD</span><span class="o">,</span><span class="mf"><span class="number">1.0</span></span><span class="o">);</span>
<span class="n">System</span><span class="o">.</span><span class="na"><span class="keyword">out</span></span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="n">stats</span><span class="o">.</span><span class="na">getPrecision</span><span class="o">());</span>
<span class="n">System</span><span class="o">.</span><span class="na"><span class="keyword">out</span></span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="n">stats</span><span class="o">.</span><span class="na">getRecall</span><span class="o">());</span>
</code>

如果是基於物品的推薦,代碼大體相似,只是沒有了UserNeighborhood,然後將上面代碼中的User換成Item即可,完整代碼如下:

<code class="language-java"><span class="n">File</span> <span class="n">modelFile</span> <span class="n">modelFile</span> <span class="o">=</span> <span class="k"><span class="keyword">new</span></span> <span class="nf">File</span><span class="o">(</span><span class="s"><span class="string">"intro.csv"</span></span><span class="o">);</span>
<span class="n">DataModel</span> <span class="n">model</span> <span class="o">=</span> <span class="k"><span class="keyword">new</span></span> <span class="nf">FileDataModel</span><span class="o">(</span><span class="k"><span class="keyword">new</span></span> <span class="nf">File</span><span class="o">(</span><span class="n">file</span><span class="o">));</span>
<span class="c1"><span class="comment">// Build the same recommender for testing that we did last time:</span></span>
<span class="n">RecommenderBuilder</span> <span class="n">recommenderBuilder</span> <span class="o">=</span> <span class="k"><span class="keyword">new</span></span> <span class="nf">RecommenderBuilder</span><span class="o">()</span> <span class="o">{</span>
    <span class="nd">@Override</span>
    <span class="kd"><span class="keyword">public</span></span> <span class="n">Recommender</span> <span class="nf">buildRecommender</span><span class="o">(</span><span class="n">DataModel</span> <span class="n">model</span><span class="o">)</span> <span class="kd">throws</span> <span class="n">TasteException</span> <span class="o">{</span>
<span class="n">ItemSimilarity</span> <span class="n">similarity</span> <span class="o">=</span> <span class="k"><span class="keyword">new</span></span> <span class="nf">PearsonCorrelationSimilarity</span><span class="o">(</span><span class="n">model</span><span class="o">);</span>
<span class="k"><span class="keyword">return</span></span> <span class="k"><span class="keyword">new</span></span> <span class="nf">GenericItemBasedRecommender</span><span class="o">(</span><span class="n">model</span><span class="o">,</span> <span class="n">similarity</span><span class="o">);</span>
    <span class="o">}</span>
<span class="o">};</span>
<span class="c1"><span class="comment">//獲取推薦結果</span></span>
<span class="n">List</span><span class="o"><</span><span class="n">RecommendedItem</span><span class="o">></span> <span class="n">recommendations</span> <span class="o">=</span> <span class="n">recommenderBuilder</span><span class="o">.</span><span class="na">buildRecommender</span><span class="o">(</span><span class="n">model</span><span class="o">).</span><span class="na">recommend</span><span class="o">(</span><span class="mi"><span class="number">1</span></span><span class="o">,</span> <span class="mi"><span class="number">4</span></span><span class="o">);</span>
<span class="k"><span class="keyword">for</span></span> <span class="o">(</span><span class="n">RecommendedItem</span> <span class="n">recommendation</span> <span class="o">:</span> <span class="n">recommendations</span><span class="o">)</span> <span class="o">{</span>
    <span class="n">System</span><span class="o">.</span><span class="na"><span class="keyword">out</span></span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="n">recommendation</span><span class="o">);</span>
<span class="o">}</span>
<span class="c1"><span class="comment">//計算評分</span></span>
<span class="n">RecommenderEvaluator</span> <span class="n">evaluator</span> <span class="o">=</span>
<span class="k"><span class="keyword">new</span></span> <span class="nf">AverageAbsoluteDifferenceRecommenderEvaluator</span><span class="o">();</span>
<span class="c1"><span class="comment">// Use 70% of the data to train; test using the other 30%.</span></span>
<span class="kt"><span class="keyword">double</span></span> <span class="n">score</span> <span class="o">=</span> <span class="n">evaluator</span><span class="o">.</span><span class="na">evaluate</span><span class="o">(</span><span class="n">recommenderBuilder</span><span class="o">,</span> <span class="kc"><span class="keyword">null</span></span><span class="o">,</span> <span class="n">model</span><span class="o">,</span> <span class="mf"><span class="number">0.7</span></span><span class="o">,</span> <span class="mf"><span class="number">1.0</span></span><span class="o">);</span>
<span class="n">System</span><span class="o">.</span><span class="na"><span class="keyword">out</span></span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="n">score</span><span class="o">);</span>
<span class="c1"><span class="comment">//計算查全率和查準率</span></span>
<span class="n">RecommenderIRStatsEvaluator</span> <span class="n">statsEvaluator</span> <span class="o">=</span> <span class="k"><span class="keyword">new</span></span> <span class="nf">GenericRecommenderIRStatsEvaluator</span><span class="o">();</span>
<span class="c1"><span class="comment">// Evaluate precision and recall "at 2":</span></span>
<span class="n">IRStatistics</span> <span class="n">stats</span> <span class="o">=</span> <span class="n">statsEvaluator</span><span class="o">.</span><span class="na">evaluate</span><span class="o">(</span><span class="n">recommenderBuilder</span><span class="o">,</span>
<span class="kc"><span class="keyword">null</span></span><span class="o">,</span> <span class="n">model</span><span class="o">,</span> <span class="kc"><span class="keyword">null</span></span><span class="o">,</span> <span class="mi"><span class="number">4</span></span><span class="o">,</span>
<span class="n">GenericRecommenderIRStatsEvaluator</span><span class="o">.</span><span class="na">CHOOSE_THRESHOLD</span><span class="o">,</span>
<span class="mf"><span class="number">1.0</span></span><span class="o">);</span>
<span class="n">System</span><span class="o">.</span><span class="na"><span class="keyword">out</span></span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="n">stats</span><span class="o">.</span><span class="na">getPrecision</span><span class="o">());</span>
<span class="n">System</span><span class="o">.</span><span class="na"><span class="keyword">out</span></span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="n">stats</span><span class="o">.</span><span class="na">getRecall</span><span class="o">());</span>
</code>

在Spark中運行

在Spark中運行,需要將Mahout相關的jar添加到Spark的classpath中,修改/etc/spark/conf/spark-env.sh,添加下面兩行代碼:

<code class="language-properties"><span class="na"><span class="setting" style="color: rgb(102, 0, 102);">SPARK_DIST_CLASSPATH</span></span><span class="setting" style="color: rgb(102, 0, 102);"><span class="o">=</span><span class="s"><span class="value"><span class="string">"$SPARK_DIST_CLASSPATH:/usr/lib/mahout/lib/*"</span></span></span></span>
<span class="na"><span class="setting" style="color: rgb(102, 0, 102);">SPARK_DIST_CLASSPATH</span></span><span class="setting" style="color: rgb(102, 0, 102);"><span class="o">=</span><span class="s"><span class="value"><span class="string">"$SPARK_DIST_CLASSPATH:/usr/lib/mahout/*"</span></span></span></span>
</code>

然後,以本地模式在spark-shell中運行下面代碼交互測試:

<code class="language-scala"><span class="c1">//注意:這裏是本地目錄</span>
<span class="k">val</span> <span class="n">model</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">FileDataModel</span><span class="o">(</span><span class="k">new</span> <span class="nc">File</span><span class="o">(</span><span class="s"><span class="string">"intro.csv"</span></span><span class="o">))</span>

<span class="k">val</span> <span class="n">evaluator</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">RMSRecommenderEvaluator</span><span class="o">()</span>
<span class="k">val</span> <span class="n">recommenderBuilder</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">RecommenderBuilder</span> <span class="o">{</span>
  <span class="k">override</span> <span class="k"><span class="function"><span class="keyword">def</span></span></span><span class="function"> <span class="n"><span class="title">buildRecommender</span></span><span class="o"><span class="params">(</span></span><span class="params"><span class="n">dataModel</span><span class="k">:</span> <span class="kt">DataModel</span><span class="o">)</span></span><span class="k">:</span></span> <span class="kt">Recommender</span> <span class="o">=</span> <span class="o">{</span>
    <span class="k">val</span> <span class="n">similarity</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">LogLikelihoodSimilarity</span><span class="o">(</span><span class="n">dataModel</span><span class="o">)</span>
    <span class="k">new</span> <span class="nc">GenericItemBasedRecommender</span><span class="o">(</span><span class="n">dataModel</span><span class="o">,</span> <span class="n">similarity</span><span class="o">)</span>
  <span class="o">}</span>
<span class="o">}</span>

<span class="k">val</span> <span class="n">score</span> <span class="k">=</span> <span class="n">evaluator</span><span class="o">.</span><span class="n">evaluate</span><span class="o">(</span><span class="n">recommenderBuilder</span><span class="o">,</span> <span class="kc">null</span><span class="o">,</span> <span class="n">model</span><span class="o">,</span> <span class="mf"><span class="number">0.95</span></span><span class="o">,</span> <span class="mf"><span class="number">0.05</span></span><span class="o">)</span>
<span class="n">println</span><span class="o">(</span><span class="n">s</span><span class="s"><span class="string">"Score=$score"</span></span><span class="o">)</span>

<span class="k">val</span> <span class="n">recommender</span><span class="k">=</span><span class="n">recommenderBuilder</span><span class="o">.</span><span class="n">buildRecommender</span><span class="o">(</span><span class="n">model</span><span class="o">)</span>
<span class="k">val</span> <span class="n">users</span><span class="k">=</span><span class="n">trainingRatings</span><span class="o">.</span><span class="n">map</span><span class="o">(</span><span class="k">_</span><span class="o">.</span><span class="n">user</span><span class="o">).</span><span class="n">distinct</span><span class="o">().</span><span class="n">take</span><span class="o">(</span><span class="mi"><span class="number">20</span></span><span class="o">)</span>

<span class="k"><span class="keyword">import</span></span> <span class="nn">scala.collection.JavaConversions._</span>

<span class="k">val</span> <span class="n">result</span><span class="k">=</span><span class="n">users</span><span class="o">.</span><span class="n">par</span><span class="o">.</span><span class="n">map</span><span class="o">{</span><span class="n">user</span><span class="k">=></span>
  <span class="n">user</span><span class="o">+</span><span class="s"><span class="string">","</span></span><span class="o">+</span><span class="n">recommender</span><span class="o">.</span><span class="n">recommend</span><span class="o">(</span><span class="n">user</span><span class="o">,</span><span class="mi"><span class="number">40</span></span><span class="o">).</span><span class="n">map</span><span class="o">(</span><span class="k">_</span><span class="o">.</span><span class="n">getItemID</span><span class="o">).</span><span class="n">mkString</span><span class="o">(</span><span class="s"><span class="string">","</span></span><span class="o">)</span>
<span class="o">}</span>
</code>

https://github.com/sujitpal/mia-scala-examples 上面有一個評估基於物品或是用戶的各種相似度下的評分的類,叫做 RecommenderEvaluator,供大家學習參考。

分佈式運行

Mahout提供了 org.apache.mahout.cf.taste.hadoop.item.RecommenderJob 類以MapReduce的方式來實現基於物品的協同過濾,查看該類的使用說明:

<code class="language-bash"><span class="nv">$ </span>hadoop jar /usr/lib/mahout/mahout-examples-0.9-cdh5.4.0-job.jar org.apache.mahout.cf.taste.hadoop.item.RecommenderJob
15/06/10 16:19:34 ERROR common.AbstractJob: Missing required option --similarityClassname
Missing required option --similarityClassname
Usage:
 <span class="o">[</span>--input <span class="tag"><<span class="title">input</span>></span> --output <span class="tag"><<span class="title">output</span>></span> --numRecommendations <span class="tag"><<span class="title">numRecommendations</span>></span>
--usersFile <span class="tag"><<span class="title">usersFile</span>></span> --itemsFile <span class="tag"><<span class="title">itemsFile</span>></span> --filterFile <span class="tag"><<span class="title">filterFile</span>></span>
--booleanData <span class="tag"><<span class="title">booleanData</span>></span> --maxPrefsPerUser <span class="tag"><<span class="title">maxPrefsPerUser</span>></span>
--minPrefsPerUser <span class="tag"><<span class="title">minPrefsPerUser</span>></span> --maxSimilaritiesPerItem
<span class="tag"><<span class="title">maxSimilaritiesPerItem</span>></span> --maxPrefsInItemSimilarity <span class="tag"><<span class="title">maxPrefsInItemSimilarity</span>></span>
--similarityClassname <span class="tag"><<span class="title">similarityClassname</span>></span> --threshold <span class="tag"><<span class="title">threshold</span>></span>
--outputPathForSimilarityMatrix <span class="tag"><<span class="title">outputPathForSimilarityMatrix</span>></span> --randomSeed
<span class="tag"><<span class="title">randomSeed</span>></span> --sequencefileOutput --help --tempDir <span class="tag"><<span class="title">tempDir</span>></span> --startPhase
<span class="tag"><<span class="title">startPhase</span>></span> --endPhase <span class="tag"><<span class="title">endPhase</span>></span><span class="o">]</span>
--similarityClassname <span class="o">(</span>-s<span class="o">)</span> similarityClassname    Name of distributed
similarity measures class to
instantiate, alternatively
use one of the predefined
similarities
<span class="o">([</span>SIMILARITY_COOCCURRENCE,
SIMILARITY_LOGLIKELIHOOD,
SIMILARITY_TANIMOTO_COEFFICIEN
T, SIMILARITY_CITY_BLOCK,
SIMILARITY_COSINE,
SIMILARITY_PEARSON_CORRELATION
,
SIMILARITY_EUCLIDEAN_DISTANCE<span class="o">]</span>
<span class="o">)</span>
</code>

可見,該類可以接收的命令行參數如下:

  • --input(path) : 存儲用戶偏好數據的目錄,該目錄下可以包含一個或多個存儲用戶偏好數據的文本文件;
  • --output(path) : 結算結果的輸出目錄
  • --numRecommendations (integer) : 爲每個用戶推薦的item數量,默認爲10
  • --usersFile (path) : 指定一個包含了一個或多個存儲userID的文件路徑,僅爲該路徑下所有文件包含的userID做推薦計算 (該選項可選)
  • --itemsFile (path) : 指定一個包含了一個或多個存儲itemID的文件路徑,僅爲該路徑下所有文件包含的itemID做推薦計算 (該選項可選)
  • --filterFile (path) : 指定一個路徑,該路徑下的文件包含了 [userID,itemID] 值對,userID和itemID用逗號分隔。計算結果將不會爲user推薦 [userID,itemID] 值對中包含的item (該選項可選)
  • --booleanData (boolean) : 如果輸入數據不包含偏好數值,則將該參數設置爲true,默認爲false
  • --maxPrefsPerUser (integer) : 在最後計算推薦結果的階段,針對每一個user使用的偏好數據的最大數量,默認爲10
  • --minPrefsPerUser (integer) : 在相似度計算中,忽略所有偏好數據量少於該值的用戶,默認爲1
  • --maxSimilaritiesPerItem (integer) : 針對每個item的相似度最大值,默認爲100
  • --maxPrefsPerUserInItemSimilarity (integer) : 在item相似度計算階段,針對每個用戶考慮的偏好數據最大數量,默認爲1000
  • --similarityClassname (classname) : 向量相似度計算類
  • outputPathForSimilarityMatrix :SimilarityMatrix輸出目錄
  • --randomSeed :隨機種子 -- sequencefileOutput :序列文件輸出路徑
  • --tempDir (path) : 存儲臨時文件的目錄,默認爲當前用戶的home目錄下的temp目錄
  • --startPhase
  • --endPhase
  • --threshold (double) : 忽略相似度低於該閥值的item對

一個例子如下,使用SIMILARITY_LOGLIKELIHOOD相似度推薦物品:

<code class="language-bash"><span class="nv">$ </span>hadoop jar /usr/lib/mahout/mahout-examples-<span class="number">0.9</span>-cdh5<span class="number">.4</span><span class="number">.0</span>-job.jar org.apache.mahout.cf.taste.hadoop.item.RecommenderJob --input /tmp/mahout/part-<span class="number">00000</span> --output /tmp/mahout-<span class="keyword">out</span>  -s SIMILARITY_LOGLIKELIHOOD
</code>

上面命令運行完成之後,會在當前用戶的hdfs主目錄生成temp目錄,該目錄可由 --tempDir (path) 參數設置:

<code class="language-bash"><span class="nv">$ </span>hadoop fs -ls temp
Found <span class="m">10</span> items
<span class="deletion">-rw-r--r--   <span class="m">3</span> root hadoop          <span class="m">7</span> 2015-06-10 14:42 temp/maxValues.bin</span>
<span class="deletion">-rw-r--r--   <span class="m">3</span> root hadoop    <span class="m">5522717</span> 2015-06-10 14:42 temp/norms.bin</span>
drwxr-xr-x   - root hadoop          <span class="m">0</span> 2015-06-10 14:41 temp/notUsed
<span class="deletion">-rw-r--r--   <span class="m">3</span> root hadoop          <span class="m">7</span> 2015-06-10 14:42 temp/numNonZeroEntries.bin</span>
<span class="deletion">-rw-r--r--   <span class="m">3</span> root hadoop    <span class="m">3452222</span> 2015-06-10 14:41 temp/observationsPerColumn.bin</span>
drwxr-xr-x   - root hadoop          <span class="m">0</span> 2015-06-10 14:47 temp/pairwiseSimilarity
drwxr-xr-x   - root hadoop          <span class="m">0</span> 2015-06-10 14:52 temp/partialMultiply
drwxr-xr-x   - root hadoop          <span class="m">0</span> 2015-06-10 14:39 temp/preparePreferenceMatrix
drwxr-xr-x   - root hadoop          <span class="m">0</span> 2015-06-10 14:50 temp/similarityMatrix
drwxr-xr-x   - root hadoop          <span class="m">0</span> 2015-06-10 14:42 temp/weights
</code>

觀察yarn的管理界面,該命令會生成9個任務,任務名稱依次是:

  • PreparePreferenceMatrixJob-ItemIDIndexMapper-Reducer
  • PreparePreferenceMatrixJob-ToItemPrefsMapper-Reducer
  • PreparePreferenceMatrixJob-ToItemVectorsMapper-Reducer
  • RowSimilarityJob-CountObservationsMapper-Reducer
  • RowSimilarityJob-VectorNormMapper-Reducer
  • RowSimilarityJob-CooccurrencesMapper-Reducer
  • RowSimilarityJob-UnsymmetrifyMapper-Reducer
  • partialMultiply
  • RecommenderJob-PartialMultiplyMapper-Reducer

從任務名稱,大概可以知道每個任務在做什麼,如果你的輸入參數不一樣,生成的任務數可能不一樣,這個需要測試一下才能確認。

在hdfs上查看輸出的結果:

<code class="language-text"><span class="number">843</span> [<span class="number">10709679</span>:<span class="number">4.8334665</span>,<span class="number">8389878</span>:<span class="number">4.833426</span>,<span class="number">9133835</span>:<span class="number">4.7503786</span>,<span class="number">10366169</span>:<span class="number">4.7503185</span>,<span class="number">9007487</span>:<span class="number">4.750272</span>,<span class="number">8149253</span>:<span class="number">4.7501993</span>,<span class="number">10366165</span>:<span class="number">4.750115</span>,<span class="number">9780049</span>:<span class="number">4.750108</span>,<span class="number">8581254</span>:<span class="number">4.750071</span>,<span class="number">10456307</span>:<span class="number">4.7500467</span>]
<span class="number">6253</span>    [<span class="number">10117445</span>:<span class="number">3.0375953</span>,<span class="number">10340299</span>:<span class="number">3.0340924</span>,<span class="number">8321090</span>:<span class="number">3.0340924</span>,<span class="number">10086615</span>:<span class="number">3.032164</span>,<span class="number">10436801</span>:<span class="number">3.0187714</span>,<span class="number">9668385</span>:<span class="number">3.0141575</span>,<span class="number">8502110</span>:<span class="number">3.013954</span>,<span class="number">10476325</span>:<span class="number">3.0074399</span>,<span class="number">10318667</span>:<span class="number">3.0004222</span>,<span class="number">8320987</span>:<span class="number">3.0003839</span>]
</code>

使用Java API方式執行:

<code class="language-java"><span class="n">StringBuilder</span> <span class="n">sb</span> <span class="o">=</span> <span class="k">new</span> <span class="nf">StringBuilder</span><span class="o">();</span>
<span class="n">sb</span><span class="o">.</span><span class="na">append</span><span class="o">(</span><span class="s">"<span class="comment">--input "</span></span><span class="comment"><span class="o">).</span><span class="na">append</span><span class="o">(</span><span class="n">inPath</span><span class="o">);</span></span>
<span class="n">sb</span><span class="o">.</span><span class="na">append</span><span class="o">(</span><span class="s">" <span class="comment">--output "</span></span><span class="comment"><span class="o">).</span><span class="na">append</span><span class="o">(</span><span class="n">outPath</span><span class="o">);</span></span>
<span class="n">sb</span><span class="o">.</span><span class="na">append</span><span class="o">(</span><span class="s">" <span class="comment">--tempDir "</span></span><span class="comment"><span class="o">).</span><span class="na">append</span><span class="o">(</span><span class="n">tmpPath</span><span class="o">);</span></span>
<span class="n">sb</span><span class="o">.</span><span class="na">append</span><span class="o">(</span><span class="s">" <span class="comment">--booleanData true"</span></span><span class="comment"><span class="o">);</span></span>
<span class="n">sb</span><span class="o">.</span><span class="na">append</span><span class="o">(</span><span class="s">" <span class="comment">--similarityClassname </span></span>
<span class="s">org.apache.mahout.math.hadoop.similarity.</span>
<span class="s">cooccurrence.measures.EuclideanDistanceSimilarity"</span><span class="o">);</span>
<span class="n">args</span> <span class="o">=</span> <span class="n">sb</span><span class="o">.</span><span class="na">toString</span><span class="o">().</span><span class="na">split</span><span class="o">(</span><span class="s">" "</span><span class="o">);</span>

<span class="n">JobConf</span> <span class="n">jobConf</span> <span class="o">=</span> <span class="k">new</span> <span class="nf">JobConf</span><span class="o">(</span><span class="n">conf</span><span class="o">);</span>
<span class="n">jobConf</span><span class="o">.</span><span class="na">setJobName</span><span class="o">(</span><span class="s">"MahoutTest"</span><span class="o">);</span>

<span class="n">RecommenderJob</span> <span class="n">job</span> <span class="o">=</span> <span class="k">new</span> <span class="nf">RecommenderJob</span><span class="o">();</span>
<span class="n">job</span><span class="o">.</span><span class="na">setConf</span><span class="o">(</span><span class="n">conf</span><span class="o">);</span>
<span class="n">job</span><span class="o">.</span><span class="na">run</span><span class="o">(</span><span class="n">args</span><span class="o">);</span>
</code>

在Scala或者Spark中,可以以Java API或者命令方式運行,最後還可以通過Spark來處理推薦的結果,例如:過濾、去重、補足數據,這部分內容不做介紹。

 

http://www.tuicool.com/articles/FzmQziz

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章