1.Mahout中推薦過濾算法支持哪兩種算法?
2.用java代碼如何計算男性用戶打分過的圖書?
3.itemEuclidean,userEuclideanNoPref分別是什麼算法?
1. 項目背景
Amazon是最早的電子商務網站之一,以網上圖書起家,最後發展成爲音像,電子消費品,遊戲,生活用品等的綜合性電子商務平臺。Amazon的推薦系統,是互聯網上最早的商品推薦系統,它爲Amazon帶來了至少30%的流量,和可觀的銷售利潤。
如今推薦系統已經成爲電子商務網站的標配,如果還沒有推薦系統都不好意思,說自己是做電商的。
2. 需求分析
推薦系統如此重要,我們應該如果理解?
打開Amazon的Mahout In Action圖書頁面:
http://www.amazon.com/Mahout-Action-Sean-Owen/dp/1935182684/ref=pd_sim_b_1?ie=UTF8&refRID=0H4H2NSSR8F34R76E2TP
網頁上的元素:
- 廣告位:廣告商投放廣告的位置,網站可以靠網絡廣告賺錢,一般是網頁最好的位置。
- 平均分:用戶對圖書的打分
- 關聯規則:通過關聯規則,推薦位
- 協同過濾:通過基於物品的協同過濾算法的,推薦位
- 圖書屬性:包括頁數,出版社,ISBN,語言等
- 作者介紹:有關作者的介紹,和作者的其他著作
- 用戶評分:用戶評分行爲
-
用戶評論:用戶評論的內容
結合上面2張截圖,我們不難發現,推薦對於Amazon的重要性。除了最明顯的廣告位給了能直接帶來利潤的廣告商,網頁中有4處推薦位,分別從不同的維度,用不同的推薦算法,猜用戶喜歡的商品。
3. 數據說明
2個數據文件:- rating.csv :用戶評分行爲數據
-
users.csv :用戶屬性數據
1). book-ratings.csv
- 3列數據:用戶ID,圖書ID, 用戶對圖書的評分
- 記錄數: 4000次的圖書評分
- 用戶數: 200個
- 圖書數: 1000個
-
評分:1-10
數據示例
1,565,3
1,807,2
1,201,1
1,557,9
1,987,10
1,59,5
1,305,6
1,153,3
1,139,7
1,875,5
1,722,10
2,977,4
2,806,3
2,654,8
2,21,8
2,662,5
2,437,6
2,576,3
2,141,8
2,311,4
2,101,3
2,540,9
2,87,3
2,65,8
2,501,6
2,710,5
2,331,9
2,542,4
2,757,9
2,590,7
2). users.csv
- 3列數據:用戶ID,用戶性別,用戶年齡
- 用戶數: 200個
- 用戶性別: M爲男性,F爲女性
-
用戶年齡: 11-80歲之間
數據示例
1,M,40
2,M,27
3,M,41
4,F,43
5,F,16
6,M,36
7,F,36
8,F,46
9,M,50
10,M,21
11,F,11
12,M,42
13,F,40
14,F,28
15,M,25
16,M,68
17,M,53
18,F,69
19,F,48
20,F,56
21,F,36
4. 算法模型
本文主要介紹Mahout的基於物品的協同過濾模型,其他的算法模型將不再這裏解釋。
針對上面的數據,我將用7種算法組合進行測試:有關Mahout算法組合的詳細解釋,請參考文章:從源代碼剖析Mahout推薦引擎
7種算法組合
- userCF1: EuclideanSimilarity+ NearestNUserNeighborhood+ GenericUserBasedRecommender
- userCF2: LogLikelihoodSimilarity+ NearestNUserNeighborhood+ GenericUserBasedRecommender
- userCF3: EuclideanSimilarity+ NearestNUserNeighborhood+ GenericBooleanPrefUserBasedRecommender
- itemCF1: EuclideanSimilarity + GenericItemBasedRecommender
- itemCF2: LogLikelihoodSimilarity + GenericItemBasedRecommender
- itemCF3: EuclideanSimilarity + GenericBooleanPrefItemBasedRecommender
-
slopeOne:SlopeOneRecommender
- 查準率:
-
召回率(查全率):
5. 程序開發
系統架構:Mahout中推薦過濾算法支持單機算法和分步式算法兩種。單機算法: 在單機內存計算,支持多種算法推薦算法,部署運行簡單,修正處理數據量有限
分步式算法: 基於Hadoop集羣運行,支持有限的幾種推薦算法,部署運行復雜,支持海量數據
開發環境
- Win7 64bit
- Java 1.6.0_45
- Maven3
- Eclipse Juno Service Release 2
- Mahout-0.8
-
Hadoop-1.1.2
新建Java類:
- BookEvaluator.java, 選出“評估推薦器”驗證得分較高的算法
- BookResult.java, 對指定數量的結果人工比較
-
BookFilterGenderResult.java,只保留男性用戶的圖書列表
1). BookEvaluator.java, 選出“評估推薦器”驗證得分較高的算法
源代碼
package org.conan.mymahout.recommendation.book;
import java.io.IOException;
import org.apache.mahout.cf.taste.common.TasteException;
import org.apache.mahout.cf.taste.eval.RecommenderBuilder;
import org.apache.mahout.cf.taste.model.DataModel;
import org.apache.mahout.cf.taste.neighborhood.UserNeighborhood;
import org.apache.mahout.cf.taste.similarity.ItemSimilarity;
import org.apache.mahout.cf.taste.similarity.UserSimilarity;
public class BookEvaluator {
final static int NEIGHBORHOOD_NUM = 2;
final static int RECOMMENDER_NUM = 3;
public static void main(String[] args) throws TasteException, IOException {
String file = "datafile/book/rating.csv";
DataModel dataModel = RecommendFactory.buildDataModel(file);
userEuclidean(dataModel);
userLoglikelihood(dataModel);
userEuclideanNoPref(dataModel);
itemEuclidean(dataModel);
itemLoglikelihood(dataModel);
itemEuclideanNoPref(dataModel);
slopeOne(dataModel);
}
public static RecommenderBuilder userEuclidean(DataModel dataModel) throws TasteException, IOException {
System.out.println("userEuclidean");
UserSimilarity userSimilarity = RecommendFactory.userSimilarity(RecommendFactory.SIMILARITY.EUCLIDEAN, dataModel);
UserNeighborhood userNeighborhood = RecommendFactory.userNeighborhood(RecommendFactory.NEIGHBORHOOD.NEAREST, userSimilarity, dataModel, NEIGHBORHOOD_NUM);
RecommenderBuilder recommenderBuilder = RecommendFactory.userRecommender(userSimilarity, userNeighborhood, true);
RecommendFactory.evaluate(RecommendFactory.EVALUATOR.AVERAGE_ABSOLUTE_DIFFERENCE, recommenderBuilder, null, dataModel, 0.7);
RecommendFactory.statsEvaluator(recommenderBuilder, null, dataModel, 2);
return recommenderBuilder;
}
public static RecommenderBuilder userLoglikelihood(DataModel dataModel) throws TasteException, IOException {
System.out.println("userLoglikelihood");
UserSimilarity userSimilarity = RecommendFactory.userSimilarity(RecommendFactory.SIMILARITY.LOGLIKELIHOOD, dataModel);
UserNeighborhood userNeighborhood = RecommendFactory.userNeighborhood(RecommendFactory.NEIGHBORHOOD.NEAREST, userSimilarity, dataModel, NEIGHBORHOOD_NUM);
RecommenderBuilder recommenderBuilder = RecommendFactory.userRecommender(userSimilarity, userNeighborhood, true);
RecommendFactory.evaluate(RecommendFactory.EVALUATOR.AVERAGE_ABSOLUTE_DIFFERENCE, recommenderBuilder, null, dataModel, 0.7);
RecommendFactory.statsEvaluator(recommenderBuilder, null, dataModel, 2);
return recommenderBuilder;
}
public static RecommenderBuilder userEuclideanNoPref(DataModel dataModel) throws TasteException, IOException {
System.out.println("userEuclideanNoPref");
UserSimilarity userSimilarity = RecommendFactory.userSimilarity(RecommendFactory.SIMILARITY.EUCLIDEAN, dataModel);
UserNeighborhood userNeighborhood = RecommendFactory.userNeighborhood(RecommendFactory.NEIGHBORHOOD.NEAREST, userSimilarity, dataModel, NEIGHBORHOOD_NUM);
RecommenderBuilder recommenderBuilder = RecommendFactory.userRecommender(userSimilarity, userNeighborhood, false);
RecommendFactory.evaluate(RecommendFactory.EVALUATOR.AVERAGE_ABSOLUTE_DIFFERENCE, recommenderBuilder, null, dataModel, 0.7);
RecommendFactory.statsEvaluator(recommenderBuilder, null, dataModel, 2);
return recommenderBuilder;
}
public static RecommenderBuilder itemEuclidean(DataModel dataModel) throws TasteException, IOException {
System.out.println("itemEuclidean");
ItemSimilarity itemSimilarity = RecommendFactory.itemSimilarity(RecommendFactory.SIMILARITY.EUCLIDEAN, dataModel);
RecommenderBuilder recommenderBuilder = RecommendFactory.itemRecommender(itemSimilarity, true);
RecommendFactory.evaluate(RecommendFactory.EVALUATOR.AVERAGE_ABSOLUTE_DIFFERENCE, recommenderBuilder, null, dataModel, 0.7);
RecommendFactory.statsEvaluator(recommenderBuilder, null, dataModel, 2);
return recommenderBuilder;
}
public static RecommenderBuilder itemLoglikelihood(DataModel dataModel) throws TasteException, IOException {
System.out.println("itemLoglikelihood");
ItemSimilarity itemSimilarity = RecommendFactory.itemSimilarity(RecommendFactory.SIMILARITY.LOGLIKELIHOOD, dataModel);
RecommenderBuilder recommenderBuilder = RecommendFactory.itemRecommender(itemSimilarity, true);
RecommendFactory.evaluate(RecommendFactory.EVALUATOR.AVERAGE_ABSOLUTE_DIFFERENCE, recommenderBuilder, null, dataModel, 0.7);
RecommendFactory.statsEvaluator(recommenderBuilder, null, dataModel, 2);
return recommenderBuilder;
}
public static RecommenderBuilder itemEuclideanNoPref(DataModel dataModel) throws TasteException, IOException {
System.out.println("itemEuclideanNoPref");
ItemSimilarity itemSimilarity = RecommendFactory.itemSimilarity(RecommendFactory.SIMILARITY.EUCLIDEAN, dataModel);
RecommenderBuilder recommenderBuilder = RecommendFactory.itemRecommender(itemSimilarity, false);
RecommendFactory.evaluate(RecommendFactory.EVALUATOR.AVERAGE_ABSOLUTE_DIFFERENCE, recommenderBuilder, null, dataModel, 0.7);
RecommendFactory.statsEvaluator(recommenderBuilder, null, dataModel, 2);
return recommenderBuilder;
}
public static RecommenderBuilder slopeOne(DataModel dataModel) throws TasteException, IOException {
System.out.println("slopeOne");
RecommenderBuilder recommenderBuilder = RecommendFactory.slopeOneRecommender();
RecommendFactory.evaluate(RecommendFactory.EVALUATOR.AVERAGE_ABSOLUTE_DIFFERENCE, recommenderBuilder, null, dataModel, 0.7);
RecommendFactory.statsEvaluator(recommenderBuilder, null, dataModel, 2);
return recommenderBuilder;
}
}
控制檯輸出:
userEuclidean
AVERAGE_ABSOLUTE_DIFFERENCE Evaluater Score:0.33333325386047363
Recommender IR Evaluator: [Precision:0.3010752688172043,Recall:0.08542713567839195]
userLoglikelihood
AVERAGE_ABSOLUTE_DIFFERENCE Evaluater Score:2.5245869159698486
Recommender IR Evaluator: [Precision:0.11764705882352945,Recall:0.017587939698492466]
userEuclideanNoPref
AVERAGE_ABSOLUTE_DIFFERENCE Evaluater Score:4.288461538461536
Recommender IR Evaluator: [Precision:0.09045226130653267,Recall:0.09296482412060306]
itemEuclidean
AVERAGE_ABSOLUTE_DIFFERENCE Evaluater Score:1.408880928305655
Recommender IR Evaluator: [Precision:0.0,Recall:0.0]
itemLoglikelihood
AVERAGE_ABSOLUTE_DIFFERENCE Evaluater Score:2.448554412835434
Recommender IR Evaluator: [Precision:0.0,Recall:0.0]
itemEuclideanNoPref
AVERAGE_ABSOLUTE_DIFFERENCE Evaluater Score:2.5665197873957957
Recommender IR Evaluator: [Precision:0.6005025125628134,Recall:0.6055276381909548]
slopeOne
AVERAGE_ABSOLUTE_DIFFERENCE Evaluater Score:2.6893078179405814
Recommender IR Evaluator: [Precision:0.0,Recall:0.0]
可視化“評估推薦器”輸出:
推薦的結果的平均距離推薦器的評分
只有itemEuclideanNoPref算法評估的結果是非常好的,其他算法的結果都不太好。
2). BookResult.java, 對指定數量的結果人工比較
爲得到差異化結果,我們分別取4個算法:userEuclidean,itemEuclidean,userEuclideanNoPref,itemEuclideanNoPref,對推薦結果人工比較。源代碼
package org.conan.mymahout.recommendation.book;
import java.io.IOException;
import java.util.List;
import org.apache.mahout.cf.taste.common.TasteException;
import org.apache.mahout.cf.taste.eval.RecommenderBuilder;
import org.apache.mahout.cf.taste.impl.common.LongPrimitiveIterator;
import org.apache.mahout.cf.taste.model.DataModel;
import org.apache.mahout.cf.taste.recommender.RecommendedItem;
public class BookResult {
final static int NEIGHBORHOOD_NUM = 2;
final static int RECOMMENDER_NUM = 3;
public static void main(String[] args) throws TasteException, IOException {
String file = "datafile/book/rating.csv";
DataModel dataModel = RecommendFactory.buildDataModel(file);
RecommenderBuilder rb1 = BookEvaluator.userEuclidean(dataModel);
RecommenderBuilder rb2 = BookEvaluator.itemEuclidean(dataModel);
RecommenderBuilder rb3 = BookEvaluator.userEuclideanNoPref(dataModel);
RecommenderBuilder rb4 = BookEvaluator.itemEuclideanNoPref(dataModel);
LongPrimitiveIterator iter = dataModel.getUserIDs();
while (iter.hasNext()) {
long uid = iter.nextLong();
System.out.print("userEuclidean =>");
result(uid, rb1, dataModel);
System.out.print("itemEuclidean =>");
result(uid, rb2, dataModel);
System.out.print("userEuclideanNoPref =>");
result(uid, rb3, dataModel);
System.out.print("itemEuclideanNoPref =>");
result(uid, rb4, dataModel);
}
}
public static void result(long uid, RecommenderBuilder recommenderBuilder, DataModel dataModel) throws TasteException {
List list = recommenderBuilder.buildRecommender(dataModel).recommend(uid, RECOMMENDER_NUM);
RecommendFactory.showItems(uid, list, false);
}
}
控制檯輸出:只截取部分結果
...
userEuclidean =>uid:63,
itemEuclidean =>uid:63,(984,9.000000)(690,9.000000)(943,8.875000)
userEuclideanNoPref =>uid:63,(4,1.000000)(723,1.000000)(300,1.000000)
itemEuclideanNoPref =>uid:63,(867,3.791667)(947,3.083333)(28,2.750000)
userEuclidean =>uid:64,
itemEuclidean =>uid:64,(368,8.615385)(714,8.200000)(290,8.142858)
userEuclideanNoPref =>uid:64,(860,1.000000)(490,1.000000)(64,1.000000)
itemEuclideanNoPref =>uid:64,(409,3.950000)(715,3.830627)(901,3.444048)
userEuclidean =>uid:65,(939,7.000000)
itemEuclidean =>uid:65,(550,9.000000)(334,9.000000)(469,9.000000)
userEuclideanNoPref =>uid:65,(939,2.000000)(185,1.000000)(736,1.000000)
itemEuclideanNoPref =>uid:65,(666,4.166667)(96,3.093931)(345,2.958333)
userEuclidean =>uid:66,
itemEuclidean =>uid:66,(971,9.900000)(656,9.600000)(918,9.577709)
userEuclideanNoPref =>uid:66,(6,1.000000)(492,1.000000)(676,1.000000)
itemEuclideanNoPref =>uid:66,(185,3.650000)(533,3.617307)(172,3.500000)
userEuclidean =>uid:67,
itemEuclidean =>uid:67,(663,9.700000)(987,9.625000)(486,9.600000)
userEuclideanNoPref =>uid:67,(732,1.000000)(828,1.000000)(113,1.000000)
itemEuclideanNoPref =>uid:67,(724,3.000000)(279,2.950000)(890,2.750000)
...
我們查看uid=65的用戶推薦信息:
查看user.csv數據集
> user[65,]
userid gender age
65 65 M 14
用戶65,男性,14歲。
以itemEuclideanNoPref的算法的推薦結果,查看bookid=666的圖書評分情況
> rating[which(rating$bookid==666),]
userid bookid pref
646 44 666 10
1327 89 666 7
2470 165 666 3
2697 179 666 7
發現有4個用戶對666的圖書評分,查看這4個用戶的屬性數據
> user[c(44,89,165,179),]
userid gender age
44 44 F 76
89 89 M 40
165 165 F 59
179 179 F 68
這4個用戶,3女1男。
我們假設男性和男性有相同的圖書興趣,女性和女性有相同的圖書偏好。因爲用戶65是男性,所以我們接下來排除女性的評分者,只保留男性評分者的評分記錄。
3). BookFilterGenderResult.java,只保留男性用戶的圖書列表
源代碼
package org.conan.mymahout.recommendation.book;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.util.HashSet;
import java.util.List;
import java.util.Set;
import org.apache.mahout.cf.taste.common.TasteException;
import org.apache.mahout.cf.taste.eval.RecommenderBuilder;
import org.apache.mahout.cf.taste.impl.common.LongPrimitiveIterator;
import org.apache.mahout.cf.taste.model.DataModel;
import org.apache.mahout.cf.taste.recommender.IDRescorer;
import org.apache.mahout.cf.taste.recommender.RecommendedItem;
public class BookFilterGenderResult {
final static int NEIGHBORHOOD_NUM = 2;
final static int RECOMMENDER_NUM = 3;
public static void main(String[] args) throws TasteException, IOException {
String file = "datafile/book/rating.csv";
DataModel dataModel = RecommendFactory.buildDataModel(file);
RecommenderBuilder rb1 = BookEvaluator.userEuclidean(dataModel);
RecommenderBuilder rb2 = BookEvaluator.itemEuclidean(dataModel);
RecommenderBuilder rb3 = BookEvaluator.userEuclideanNoPref(dataModel);
RecommenderBuilder rb4 = BookEvaluator.itemEuclideanNoPref(dataModel);
long uid = 65;
System.out.print("userEuclidean =>");
filterGender(uid, rb1, dataModel);
System.out.print("itemEuclidean =>");
filterGender(uid, rb2, dataModel);
System.out.print("userEuclideanNoPref =>");
filterGender(uid, rb3, dataModel);
System.out.print("itemEuclideanNoPref =>");
filterGender(uid, rb4, dataModel);
}
/**
* 對用戶性別進行過濾
*/
public static void filterGender(long uid, RecommenderBuilder recommenderBuilder, DataModel dataModel) throws TasteException, IOException {
Set userids = getMale("datafile/book/user.csv");
//計算男性用戶打分過的圖書
Set bookids = new HashSet();
for (long uids : userids) {
LongPrimitiveIterator iter = dataModel.getItemIDsFromUser(uids).iterator();
while (iter.hasNext()) {
long bookid = iter.next();
bookids.add(bookid);
}
}
IDRescorer rescorer = new FilterRescorer(bookids);
List list = recommenderBuilder.buildRecommender(dataModel).recommend(uid, RECOMMENDER_NUM, rescorer);
RecommendFactory.showItems(uid, list, false);
}
/**
* 獲得男性用戶ID
*/
public static Set getMale(String file) throws IOException {
BufferedReader br = new BufferedReader(new FileReader(new File(file)));
Set userids = new HashSet();
String s = null;
while ((s = br.readLine()) != null) {
String[] cols = s.split(",");
if (cols[1].equals("M")) {// 判斷男性用戶
userids.add(Long.parseLong(cols[0]));
}
}
br.close();
return userids;
}
}
/**
* 對結果重計算
*/
class FilterRescorer implements IDRescorer {
final private Set userids;
public FilterRescorer(Set userids) {
this.userids = userids;
}
@Override
public double rescore(long id, double originalScore) {
return isFiltered(id) ? Double.NaN : originalScore;
}
@Override
public boolean isFiltered(long id) {
return userids.contains(id);
}
}
控制檯輸出:
userEuclidean =>uid:65,
itemEuclidean =>uid:65,(784,8.090909)(276,8.000000)(476,7.666667)
userEuclideanNoPref =>uid:65,
itemEuclideanNoPref =>uid:65,(887,2.250000)(356,2.166667)(430,1.866667)
我們發現,由於只保留男性的評分記錄,數據量就變得比較少了,基於用戶的協同過濾算法,已經沒有輸出的結果了。基於物品的協同過濾算法,結果集也有所變化。
對於itemEuclideanNoPref算法,輸出排名第一條爲ID爲887的圖書。我再進一步向下追蹤:查詢哪些用戶對圖書887進行了打分。
> rating[which(rating$bookid==887),]
userid bookid pref
1280 85 887 2
1743 119 887 8
2757 184 887 4
2791 186 887 5
有4個用戶對圖書887評分,再分別查看這個用戶的屬性
> user[c(85,119,184,186),]
userid gender age
85 85 F 31
119 119 F 49
184 184 M 27
186 186 M 35
其中2男,2女。由於我們的算法,已經排除了女性的評分,我們可以推斷圖書887的推薦應該來自於2個男性的評分者的推薦。
分別計算用戶65,與用戶184和用戶186的評分的圖書交集。
rat65<-rating[which(rating$userid==65),]
rat184<-rating[which(rating$userid==184),]
rat186<-rating[which(rating$userid==186),]
> intersect(rat65$bookid ,rat184$bookid)
integer(0)
> intersect(rat65$bookid ,rat186$bookid)
[1] 65 375
最後發現,用戶65與用戶186都給圖書65和圖書375打過分。我們再打分出用戶186的評分記錄。
> rat186
userid bookid pref
2790 186 65 7
2791 186 887 5
2792 186 529 3
2793 186 375 6
2794 186 566 7
2795 186 169 4
2796 186 907 1
2797 186 821 2
2798 186 720 5
2799 186 642 5
2800 186 137 3
2801 186 744 1
2802 186 896 2
2803 186 156 6
2804 186 392 3
2805 186 386 3
2806 186 901 7
2807 186 69 6
2808 186 845 6
2809 186 998 3
用戶186,還給圖書887打過分,所以對於給65用戶推薦圖書887,是合理的。