初次參加天池比賽,題目很簡單:https://tianchi.aliyun.com/competition/entrance/231721/tab/158。
訓練集一共包含三個文件,分別爲用戶行爲文件、用戶信息表、商品信息表,詳情如下。
user_behavior.csv
爲用戶行爲文件,文件共有4列並以逗號分隔。每列的含義與內容如下:
列名 | 描述 |
---|---|
用戶ID | 正整數,對應一個特定的用戶 |
商品ID | 正整數,對應一個特定的商品 |
行爲類型 | 枚舉類型字符串,取值爲('pv', 'buy', 'cart', 'fav')之一 |
時間戳 | 取值範圍爲[0, 1382400)的整數,表示該行爲發生的時間到某一個星期五的0:00:00的時間偏移(單位爲秒) |
user.csv
爲用戶信息文件,文件共有4列並以逗號分隔。每列的含義與內容如下:
列名 | 描述 |
---|---|
用戶ID | 正整數,對應一個特定的用戶 |
性別 | 正整數,表示用戶性別。0表示男性,1表示女性,2表示未知 |
年齡 | 正整數,用戶年齡 |
購買力 | 取值自[1, 9]的正整數,表示用戶的購買力層級 |
item.csv
爲商品信息表,文件共有4列並以逗號分隔。每列的含義與內容如下:
列名 | 描述 |
---|---|
商品ID | 正整數,對應一個特定的商品 |
類目ID | 正整數,表示該商品所屬的類目 |
店鋪ID | 正整數,表示該商品所屬的店鋪 |
品牌ID | 整數,表示該商品的品牌,-1表示未知 |
和訓練集類似,測試集同樣包含如上三個文件。參賽選手需要爲測試集中的每一個用戶預測其未來可能感興趣的top 50的商品。具體的,基於如上的定義,用戶的真實“未來興趣”是指用戶在1382400時間點以後一天內發生的('pv', 'buy', 'cart', 'fav')四種行爲中的任意一種。另外,用戶興趣預估的候選商品庫集合,爲訓練、測試集中商品庫(item.csv
文件)的並集。
面對題目,我的思路如下:用戶行爲數據中對某商品有購買操作,如果該用戶羣對該商品的復購率高,那麼就可以將其設置爲感興趣商品,如果復購率爲0,那麼就不推薦該商品;如果對某商品沒有過購買操作,則根據其其他操作的加權之和作爲推薦指標;接下來再在該用戶所在的羣體(年齡、性別、購買力)中按照操作的加權之和降序推薦商品。
思路很簡單,但是數據文件很大,我使用python做數據分析,直接用pandas.DataFrame操作,速度非常慢(基本上動不了);之前參加天池的離線賽時也碰到這種情況,那個時候搭建的spark這時候就可以派上用場了。方法如下:首先將csv文件放到hadoop中,接着在pyspark中讀取csv文件,保存成parquet格式,這個時候就可以作爲pyspark.sql.dataframe.DataFrame用了,上面的思路就可以轉換成對DataFrame的操作了。代碼如下:
import pyspark.sql.functions as F
import numpy as np
import pandas as pd
## csv -> parquet
user_behaviors=sqlContext.read.format('com.databricks.spark.csv').options(header='false',inferschema='true').load('hdfs://master:8020/item_recommend1_testB/user_behavior.csv')
user_behaviors=user_behaviors.withColumnRenamed('_c0','user_id').withColumnRenamed('_c1','item_id').withColumnRenamed('_c2','behavior_type').withColumnRenamed('_c3','time')
user_behaviors.write.format('parquet').mode('overwrite').save('/item_recommend1_testB/user_behaviors.parquet')
users=sqlContext.read.format('com.databricks.spark.csv').options(header='false',inferschema='true').load('hdfs://master:8020/item_recommend1_testB/user.csv')
users=users.withColumnRenamed('_c0','user_id').withColumnRenamed('_c1','gender').withColumnRenamed('_c2','age').withColumnRenamed('_c3','buy_cap')
users.write.format('parquet').mode('overwrite').save('/item_recommend1_testB/users.parquet')
items=sqlContext.read.format('com.databricks.spark.csv').options(header='false',inferschema='true').load('hdfs://master:8020/item_recommend1_testB/item.csv')
items=items.withColumnRenamed('_c0','item_id').withColumnRenamed('_c1','category_id').withColumnRenamed('_c2','shop_id').withColumnRenamed('_c3','brand_id')
items.write.format('parquet').mode('overwrite').save('/item_recommend1_testB/items.parquet')
## parquet 原始數據
users_test=spark.read.parquet('/item_recommend1_testB/users.parquet')
items_test=spark.read.parquet('/item_recommend1_testB/items.parquet')
user_behaviors_test=spark.read.parquet('/item_recommend1_testB/user_behaviors.parquet')
users=spark.read.parquet('/item_recommend1/users.parquet')
items=spark.read.parquet('/item_recommend1/items.parquet')
user_behaviors=spark.read.parquet('/item_recommend1/user_behaviors.parquet')
## total 數據
items_total=items.union(items_test).distinct()
users_total=users.union(users_test).distinct()
user_behaviors_total=user_behaviors.union(user_behaviors_test).distinct()
items_total.write.format('parquet').mode('overwrite').save('/item_recommend1_totalB/items.parquet')
users_total.write.format('parquet').mode('overwrite').save('/item_recommend1_totalB/users.parquet')
user_behaviors_total.write.format('parquet').mode('overwrite').save('/item_recommend1_totalB/user_behaviors.parquet')
## user_behaviors_allaction
user_behaviors=spark.read.parquet('/item_recommend1_totalB/user_behaviors.parquet')
user_behaviors_allaction=user_behaviors.withColumn('behavior_value',F.when(user_behaviors['behavior_type']=='pv',1).when(user_behaviors['behavior_type']=='fav',2).when(user_behaviors['behavior_type']=='cart',3).when(user_behaviors['behavior_type']=='buy',4))
user_behaviors_allaction.write.format('parquet').mode('overwrite').save('/item_recommend1_totalB/user_behaviors_allaction.parquet')
user_behaviors_allaction=spark.read.parquet('/item_recommend1_totalB/user_behaviors_allaction.parquet')
## 總user item 數據
users=spark.read.parquet('/item_recommend1_totalB/users.parquet')
items=spark.read.parquet('/item_recommend1_totalB/items.parquet')
## 所有天
full_user_behaviors=user_behaviors_allaction.join(users,on='user_id').join(items,on='item_id')
full_user_behaviors=full_user_behaviors.select(['*',(full_user_behaviors.behavior_value/F.ceil(16-full_user_behaviors.time/86400)).alias('behavior_value_new')])
full_user_behaviors.write.format("parquet").mode("overwrite").save('/item_recommend1_totalB/full_user_behaviors.parquet')
full_user_behaviors=spark.read.parquet('/item_recommend1_totalB/full_user_behaviors.parquet')
##根據'user_id','item_id'分組
full_user_behaviors_user_item=full_user_behaviors.groupBy(['user_id','item_id']).agg({'behavior_value_new':'sum','category_id':'first','shop_id':'first','brand_id':'first','*':'count'}) #,'behavior_type':'count_distinct'
full_user_behaviors_user_item=full_user_behaviors_user_item.withColumnRenamed('sum(behavior_value_new)','behavior_value_sum').withColumnRenamed('first(category_id)','category_id').withColumnRenamed('first(shop_id)','shop_id').withColumnRenamed('first(brand_id)','brand_id').withColumnRenamed('count(1)','count')
full_user_behaviors_user_item.write.format('parquet').mode('overwrite').save('/item_recommend1_totalB/full_user_behaviors_user_item.parquet')
full_user_behaviors_user_item=spark.read.parquet('/item_recommend1_totalB/full_user_behaviors_user_item.parquet')
full_user_behaviors_user_item_user=users.join(full_user_behaviors_user_item,on='user_id')
full_user_behaviors_user_item_user_age_item=full_user_behaviors_user_item_user.groupBy(['age','gender','buy_cap','item_id']).agg({'behavior_value_sum':'sum','category_id':'first','shop_id':'first','brand_id':'first','*':'count'})
full_user_behaviors_user_item_user_age_item=full_user_behaviors_user_item_user_age_item.withColumnRenamed('sum(behavior_value_sum)','behavior_value_sum').withColumnRenamed('first(category_id)','category_id').withColumnRenamed('first(shop_id)','shop_id').withColumnRenamed('first(brand_id)','brand_id').withColumnRenamed('count(1)','count')
full_user_behaviors_user_item_user_age_item.write.format('parquet').mode('overwrite').save('/item_recommend1_totalB/full_user_behaviors_user_item_user_age_item.parquet')
full_user_behaviors_user_item_user_age_item=spark.read.parquet('/item_recommend1_totalB/full_user_behaviors_user_item_user_age_item.parquet')
## 3天內
START_TIME=86400*8
full_user_behaviors_3=full_user_behaviors.filter('time>'+str(START_TIME))
full_user_behaviors_3.write.format("parquet").mode("overwrite").save('/item_recommend1_totalB/full_user_behaviors_3.parquet')
full_user_behaviors_3=spark.read.parquet('/item_recommend1_totalB/full_user_behaviors_3.parquet')
##根據'user_id','item_id'分組
#count 表示買了該商品的人數
full_user_behaviors_user_item_3=full_user_behaviors_3.groupBy(['user_id','item_id']).agg({'behavior_value_new':'sum','category_id':'first','shop_id':'first','brand_id':'first','*':'count'}) #,'behavior_type':'count_distinct'
full_user_behaviors_user_item_3=full_user_behaviors_user_item_3.withColumnRenamed('sum(behavior_value_new)','behavior_value_sum').withColumnRenamed('first(category_id)','category_id').withColumnRenamed('first(shop_id)','shop_id').withColumnRenamed('first(brand_id)','brand_id').withColumnRenamed('count(1)','count')
full_user_behaviors_user_item_3.write.format('parquet').mode('overwrite').save('/item_recommend1_totalB/full_user_behaviors_user_item_3.parquet')
full_user_behaviors_user_item_3=spark.read.parquet('/item_recommend1_totalB/full_user_behaviors_user_item_3.parquet')
full_user_behaviors_user_item_user_3=users.join(full_user_behaviors_user_item_3,on='user_id')
full_user_behaviors_user_item_user_age_item_3=full_user_behaviors_user_item_user_3.groupBy(['age','gender','buy_cap','item_id']).agg({'behavior_value_sum':'sum','category_id':'first','shop_id':'first','brand_id':'first','*':'count'})
full_user_behaviors_user_item_user_age_item_3=full_user_behaviors_user_item_user_age_item_3.withColumnRenamed('sum(behavior_value_sum)','behavior_value_sum').withColumnRenamed('first(category_id)','category_id').withColumnRenamed('first(shop_id)','shop_id').withColumnRenamed('first(brand_id)','brand_id').withColumnRenamed('count(1)','count')
full_user_behaviors_user_item_user_age_item_3.write.format('parquet').mode('overwrite').save('/item_recommend1_totalB/full_user_behaviors_user_item_user_age_item_3.parquet')
full_user_behaviors_user_item_user_age_item_3=spark.read.parquet('/item_recommend1_totalB/full_user_behaviors_user_item_user_age_item_3.parquet')
## _3
#.filter('count>1')
dup_buyed_items=full_user_behaviors.filter('behavior_value==4').groupBy(['user_id','item_id']).count().groupBy('item_id').agg({'count':'avg'}).withColumnRenamed('avg(count)','count')
dup_buyed_items.approxQuantile('count',np.linspace(0,1,50).tolist(),0.01)
dup_buyed_items=dup_buyed_items.filter('count>1.25')
buyed_items=full_user_behaviors.join(users_test,how='left_semi',on='user_id')
full_user_behaviors_buy=buyed_items.filter('behavior_value==4')
full_user_behaviors_buy_dup=full_user_behaviors_buy.select(['user_id','item_id']).distinct().join(dup_buyed_items,how='inner',on='item_id')
#購買次數大於1次的商品
# full_user_behaviors_buy_dup_count=full_user_behaviors_buy_dup.groupBy('user_id').count()
# full_user_behaviors_buy_dup_count.approxQuantile('count',np.linspace(0,1,50).tolist(),0.01)
#三天內的動作
full_user_behaviors_3_test=full_user_behaviors_3.join(users_test,how='left_semi',on='user_id')
#去掉已經購買了的商品
full_user_behaviors_3_notbuy=full_user_behaviors_3_test.join(full_user_behaviors_buy,how='left_anti',on=['user_id','item_id'])
full_user_behaviors_3_notbuy_group=full_user_behaviors_3_notbuy.groupBy(['user_id','item_id']).agg({'behavior_value_new':'sum','category_id':'first','shop_id':'first','brand_id':'first','*':'count'})
full_user_behaviors_3_notbuy_group=full_user_behaviors_3_notbuy_group.withColumnRenamed('sum(behavior_value_new)','behavior_value_sum').withColumnRenamed('first(category_id)','category_id').withColumnRenamed('first(shop_id)','shop_id').withColumnRenamed('first(brand_id)','brand_id').withColumnRenamed('count(1)','count')
full_user_behaviors_3_notbuy_group.approxQuantile('count',np.linspace(0,1,50).tolist(),0.01)
full_user_behaviors_3_notbuy_group.approxQuantile('behavior_value_sum',np.linspace(0,1,50).tolist(),0.01)
#未購買的商品中,其他動作>4的部分
recommended_notbuy=full_user_behaviors_3_notbuy_group.filter('behavior_value_sum>16')
full_user_behaviors_buy_dup.write.format('parquet').mode('overwrite').save('/item_recommend1_totalB/full_user_behaviors_buy.parquet')
full_user_behaviors_buy_dup=spark.read.parquet('/item_recommend1_totalB/full_user_behaviors_buy.parquet')
recommended_notbuy.write.format('parquet').mode('overwrite').save('/item_recommend1_totalB/recommended_notbuy.parquet')
recommended_notbuy=spark.read.parquet('/item_recommend1_totalB/recommended_notbuy.parquet')
recommended=recommended_notbuy.select(['user_id','item_id','count','behavior_value_sum']).union(full_user_behaviors_buy_dup.selectExpr(['user_id','item_id','count','10000 as behavior_value_sum']))
# recommended.groupBy('user_id').count().approxQuantile('count',np.linspace(0,1,50).tolist(),0.01)
recommended1=recommended.selectExpr(['user_id','item_id','count','behavior_value_sum'])
# full_user_behaviors_user_item_user_age_item_3v=full_user_behaviors_user_item_user_age_item_3.selectExpr(['age','gender','buy_cap','item_id','behavior_value_sum/count'])
# full_user_behaviors_user_item_user_age_item_3v.approxQuantile('(behavior_value_sum / count)',np.linspace(0,1,50).tolist(),0.01)
# full_user_behaviors_user_item_user_age_itemP=full_user_behaviors_user_item_user_age_item_3.toPandas()
# full_user_behaviors_user_item_user_age_itemP['ac']=full_user_behaviors_user_item_user_age_itemP['behavior_value_sum']/full_user_behaviors_user_item_user_age_itemP['count']
#
# full_user_behaviors_user_item.filter('behavior_value_sum>1').stat.approxQuantile('behavior_value_sum',np.linspace(0,1,50).tolist(),0.01)
#
# full_user_behaviors_user_item.filter('behavior_value_sum>1').stat.approxQuantile('count',np.linspace(0,1,50).tolist(),0.01)
@F.pandas_udf("age int,gender int,buy_cap int,item_id int,count int,behavior_value_sum double", F.PandasUDFType.GROUPED_MAP)
def trim(df):
return df.nlargest(50,'behavior_value_sum')
recommend_items_age=full_user_behaviors_user_item_user_age_item_3.select(['age','gender','buy_cap','item_id','count', 'behavior_value_sum']).groupby(['age','gender','buy_cap']).apply(trim)
recommend_items_user=users_test.join(recommend_items_age,on=['age','gender','buy_cap']).select(['user_id','item_id','count','behavior_value_sum'])
recommend_items_user.write.format('parquet').mode('overwrite').save('/item_recommend1_totalB/recommend_items_user.parquet')
recommend_items_user=spark.read.parquet('/item_recommend1_totalB/recommend_items_user.parquet')
recommend_items_user_all=recommend_items_user.join(recommended1,how='left_anti',on=['user_id','item_id']).union(recommended1)
recommend_items_user_df=recommend_items_user_all.toPandas()
def gen_itemids(r):
if 'user_id' not in r.columns:
return
user_id=r['user_id'].iloc[0]
l = [user_id]
r=r.sort_values(by='behavior_value_sum',ascending=False)
l.extend(list(r['item_id'])[:50])
return l
recommend_items_user_series=recommend_items_user_df.groupby('user_id').apply(gen_itemids)
notmatched_users=users_test.select('user_id').subtract(recommend_items_user.select('user_id').distinct()).collect()
for a in notmatched_users:
recommend_items_user_series[a.user_id]=[a.user_id]
need_more_recommends=recommend_items_user_series[recommend_items_user_series.apply(len)<51]
#如果商品推薦不足,從全部數據中得到推薦數據
for a in need_more_recommends:
print(a[0])
user=users_test.filter(' user_id='+str(a[0])).collect()[0]
j = 0
while len(a) < 51 and j < 4:
if j == 0:
pre_condition = ' age=%d and gender=%d and buy_cap=%d ' % (user.age, user.gender, user.buy_cap)
elif j == 1:
pre_condition = ' age between %d and %d and gender=%d and buy_cap=%d ' % (
user.age - 3, user.age + 3, user.gender, user.buy_cap)
elif j == 2:
pre_condition = ' age =%d and gender=%d and buy_cap between %d and %d ' % (
user.age, user.gender, user.buy_cap - 2, user.buy_cap + 2)
else:
pre_condition = ' age between %d and %d and gender=%d and buy_cap between %d and %d ' % (
user.age - 3, user.age + 3, user.gender, user.buy_cap - 2, user.buy_cap + 2)
condition = pre_condition
if len(a) > 1:
condition += (' and item_id not in (%s)' % ','.join([str(i) for i in a[1:]]) )
print(condition)
recommend_items = full_user_behaviors_user_item_user_age_item.filter(condition).orderBy(F.desc('behavior_value_sum')).limit(51-len(a)).collect()
for i in recommend_items:
if i.item_id not in a[1:]:
a.append(i.item_id)
if len(a) >= 51:
break
j=j+1
recommend_items_user_series[a[0]]=a
# need_more_recommends=recommend_items_user_series[recommend_items_user_series.apply(len)<51]
# for a in need_more_recommends:
# user=users_test.filter(' user_id='+str(a[0])).collect()[0]
# j = 1
# while len(a) < 51 and j < 4:
# if j == 0:
# pre_condition = ' age=%d and gender=%d and buy_cap=%d ' % (user.age, user.gender, user.buy_cap)
# elif j == 1:
# pre_condition = ' age between %d and %d and gender=%d and buy_cap=%d ' % (
# user.age - 3, user.age + 3, user.gender, user.buy_cap)
# elif j == 2:
# pre_condition = ' age =%d and gender=%d and buy_cap between %d and %d ' % (
# user.age, user.gender, user.buy_cap - 2, user.buy_cap + 2)
# else:
# pre_condition = ' age between %d and %d and gender=%d and buy_cap between %d and %d ' % (
# user.age - 3, user.age + 3, user.gender, user.buy_cap - 2, user.buy_cap + 2)
# condition = pre_condition
# if len(a) > 1:
# condition += (' and item_id not in (%s)' % ','.join([str(i) for i in a[1:]]) )
# print(condition)
# recommend_items = full_user_behaviors_user_item_user_age_item_3.filter(condition).orderBy(
# F.desc('count'),F.desc('behavior_value_sum')).limit(51-len(a)).collect()
# for i in recommend_items:
# if i.item_id not in a[1:]:
# a.append(i.item_id)
# if len(a) >= 51:
# break
# j=j+1
# recommend_items_user_series[a[0]]=a
df=pd.DataFrame(list(recommend_items_user_series.values),dtype=int)
df.to_csv('/Users/zhangyugu/Downloads/testb_result_081416.csv',float_format='%d',header=False,index=False)
這是初賽,靠着這段邏輯進入了複賽;複賽中要求提交代碼了。我開始思考:這個篩選的規則都是我自己定的,不一定就是最好的推薦策略;那麼如何通過機器學習得到最優的推薦策略呢?
首先,我可以用的數據有用戶信息(年齡、性別、購買力),商品信息(分類、品牌、店鋪),還有用戶的歷史數據,目標是用戶對商品的興趣度。興趣度的計算方式是可以自己定的,最簡單的就是用戶對商品的歷史動作加權之和。前面的數據就是特徵,其中比較複雜的是用戶的歷史行爲中蘊含的特徵提取。這裏選用spark的als做矩陣分解,拿到在歷史行爲中蘊含的用戶、商品特徵:
from pyspark.ml.recommendation import ALS
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import RegressionEvaluator
hadoop_preffix="hdfs://master:8020/item_recommend1.db"
sc.setCheckpointDir(hadoop_preffix+'/item_recommend1_als_sc')
user_behaviors=spark.read.parquet('/item_recommend1_totalB/full_user_behaviors_user_item.parquet')
users_test=spark.read.parquet('/item_recommend1_testB/users.parquet')
user_behaviors_test=user_behaviors.join(users_test,how='left_semi',on='user_id')
users=spark.read.parquet('/item_recommend1/users.parquet')
user_behaviors_test_full=user_behaviors.join(users_test,how='inner',on='user_id')
user_behaviors_test=user_behaviors_test.select(['user_id','item_id',"behavior_value_sum"])
import pyspark.mllib.recommendation as rd
user_behaviors_test_rdd=user_behaviors_test.rdd
user_behaviors_test_rddRating=user_behaviors_test.rdd.map(lambda r:rd.Rating(r.user_id,r.item_id,r.behavior_value_sum))
user_behaviors_test_rddRating.checkpoint()
user_behaviors_test_rddRating.cache()
model=rd.ALS.trainImplicit(user_behaviors_test_rddRating,8,50,0.01)
userFeatures=model.userFeatures()
def feature_to_row(a):
l=list(a[1])
l.insert(0,a[0])
return l
userFeaturesRowed=userFeatures.map(feature_to_row)
productFeatures=model.productFeatures()
productFeaturesRowed=productFeatures.map(feature_to_row)
userFeaturesDf=sqlContext.createDataFrame(userFeaturesRowed,['user_id','feature_0','feature_1','feature_2','feature_3','feature_4','feature_5','feature_6','feature_7'])
itemFeaturesDf=sqlContext.createDataFrame(productFeaturesRowed,['item_id','item_feature_0','item_feature_1','item_feature_2','item_feature_3','item_feature_4','item_feature_5','item_feature_6','item_feature_7'])
userFeaturesDf.write.format('parquet').mode('overwrite').save('/item_recommend1_totalB/userFeaturesDf.parquet')
itemFeaturesDf.write.format('parquet').mode('overwrite').save('/item_recommend1_totalB/itemFeaturesDf.parquet')
user_behaviors_test_fullFeatured=user_behaviors_test_full.join(userFeaturesDf,how='inner',on='user_id').join(itemFeaturesDf,how='inner',on='item_id')
user_behaviors_test_fullFeatured.write.format('parquet').mode('overwrite').save('/item_recommend1_totalB/user_behaviors_test_fullFeatured.parquet')
有了特徵數據,接下來就是用特徵來得到用戶對某商品的興趣了,這一塊正在做。
思路如下:可以選擇一般的迴歸算法,如邏輯迴歸;也可以選擇比較強大的擬合工具,如神經網絡。
這裏選擇了DeepFM的訓練模型,代碼如下,該模型介紹見(https://www.jianshu.com/p/6f1c2643d31b):
import numpy as np
import tensorflow as tf
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.metrics import roc_auc_score
from time import time
from tensorflow.contrib.layers.python.layers import batch_norm as batch_norm
class DeepFM(BaseEstimator, TransformerMixin):
def __init__(self, feature_size, field_size,
embedding_size=8, dropout_fm=[1.0, 1.0],
deep_layers=[32, 32], dropout_deep=[0.5, 0.5, 0.5],
deep_layers_activation=tf.nn.relu,
epoch=10, batch_size=256,
learning_rate=0.001, optimizer_type="adam",
batch_norm=0, batch_norm_decay=0.995,
verbose=False, random_seed=2016,
use_fm=True, use_deep=True,
loss_type="logloss", eval_metric=roc_auc_score,
l2_reg=0.0, greater_is_better=True):
assert (use_fm or use_deep)
assert loss_type in ["logloss", "mse"], \
"loss_type can be either 'logloss' for classification task or 'mse' for regression task"
self.feature_size = feature_size # denote as M, size of the feature dictionary
self.field_size = field_size # denote as F, size of the feature fields
self.embedding_size = embedding_size # denote as K, size of the feature embedding
self.dropout_fm = dropout_fm
self.deep_layers = deep_layers
self.dropout_deep = dropout_deep
self.deep_layers_activation = deep_layers_activation
self.use_fm = use_fm
self.use_deep = use_deep
self.l2_reg = l2_reg
self.epoch = epoch
self.batch_size = batch_size
self.learning_rate = learning_rate
self.optimizer_type = optimizer_type
self.batch_norm = batch_norm
self.batch_norm_decay = batch_norm_decay
self.verbose = verbose
self.random_seed = random_seed
self.loss_type = loss_type
self.eval_metric = eval_metric
self.greater_is_better = greater_is_better
self.train_result, self.valid_result = [], []
self._init_graph()
def _init_graph(self):
self.graph = tf.Graph()
with self.graph.as_default():
tf.set_random_seed(self.random_seed)
self.feat_index = tf.placeholder(tf.int32, shape=[None, None],
name="feat_index") # None * F
self.feat_value = tf.placeholder(tf.float32, shape=[None, None],
name="feat_value") # None * F
self.label = tf.placeholder(tf.float32, shape=[None, 1], name="label") # None * 1
self.dropout_keep_fm = tf.placeholder(tf.float32, shape=[None], name="dropout_keep_fm")
self.dropout_keep_deep = tf.placeholder(tf.float32, shape=[None], name="dropout_keep_deep")
self.train_phase = tf.placeholder(tf.bool, name="train_phase")
self.weights = self._initialize_weights()
# model
self.embeddings = tf.nn.embedding_lookup(self.weights["feature_embeddings"],
self.feat_index) # None * F * K
feat_value = tf.reshape(self.feat_value, shape=[-1, self.field_size, 1])
self.embeddings = tf.multiply(self.embeddings, feat_value)
# ---------- first order term ----------
self.y_first_order = tf.nn.embedding_lookup(self.weights["feature_bias"], self.feat_index) # None * F * 1
self.y_first_order = tf.reduce_sum(tf.multiply(self.y_first_order, feat_value), 2) # None * F
self.y_first_order = tf.nn.dropout(self.y_first_order, self.dropout_keep_fm[0]) # None * F
# ---------- second order term ---------------
# sum_square part
self.summed_features_emb = tf.reduce_sum(self.embeddings, 1) # None * K
self.summed_features_emb_square = tf.square(self.summed_features_emb) # None * K
# square_sum part
self.squared_features_emb = tf.square(self.embeddings)
self.squared_features_emb_sum = tf.reduce_sum(self.squared_features_emb, 1) # None * K
# second order
self.y_second_order = 0.5 * tf.subtract(self.summed_features_emb_square, self.squared_features_emb_sum) # None * K
self.y_second_order = tf.nn.dropout(self.y_second_order, self.dropout_keep_fm[1]) # None * K
# ---------- Deep component ----------
self.y_deep = tf.reshape(self.embeddings, shape=[-1, self.field_size * self.embedding_size]) # None * (F*K)
self.y_deep = tf.nn.dropout(self.y_deep, self.dropout_keep_deep[0])
for i in range(0, len(self.deep_layers)):
self.y_deep = tf.add(tf.matmul(self.y_deep, self.weights["layer_%d" %i]), self.weights["bias_%d"%i]) # None * layer[i] * 1
if self.batch_norm:
self.y_deep = self.batch_norm_layer(self.y_deep, train_phase=self.train_phase, scope_bn="bn_%d" %i) # None * layer[i] * 1
self.y_deep = self.deep_layers_activation(self.y_deep)
self.y_deep = tf.nn.dropout(self.y_deep, self.dropout_keep_deep[1+i]) # dropout at each Deep layer
# ---------- concat ----------
if self.use_fm and self.use_deep:
concat_input = tf.concat([self.y_first_order, self.y_second_order, self.y_deep], axis=1)
elif self.use_fm:
concat_input = tf.concat([self.y_first_order, self.y_second_order], axis=1)
elif self.use_deep:
concat_input = self.y_deep
self.out = tf.add(tf.matmul(concat_input, self.weights["concat_projection"]), self.weights["concat_bias"])
# loss
if self.loss_type == "logloss":
self.out = tf.nn.sigmoid(self.out)
self.loss = tf.losses.log_loss(self.label, self.out)
elif self.loss_type == "mse":
self.loss = tf.nn.l2_loss(tf.subtract(self.label, self.out))
# l2 regularization on weights
if self.l2_reg > 0:
self.loss += tf.contrib.layers.l2_regularizer(
self.l2_reg)(self.weights["concat_projection"])
if self.use_deep:
for i in range(len(self.deep_layers)):
self.loss += tf.contrib.layers.l2_regularizer(
self.l2_reg)(self.weights["layer_%d"%i])
# optimizer
if self.optimizer_type == "adam":
self.optimizer = tf.train.AdamOptimizer(learning_rate=self.learning_rate, beta1=0.9, beta2=0.999,
epsilon=1e-8).minimize(self.loss)
elif self.optimizer_type == "adagrad":
self.optimizer = tf.train.AdagradOptimizer(learning_rate=self.learning_rate,
initial_accumulator_value=1e-8).minimize(self.loss)
elif self.optimizer_type == "gd":
self.optimizer = tf.train.GradientDescentOptimizer(learning_rate=self.learning_rate).minimize(self.loss)
elif self.optimizer_type == "momentum":
self.optimizer = tf.train.MomentumOptimizer(learning_rate=self.learning_rate, momentum=0.95).minimize(
self.loss)
# elif self.optimizer_type == "yellowfin":
# self.optimizer = YFOptimizer(learning_rate=self.learning_rate, momentum=0.0).minimize(
# self.loss)
# init
self.saver = tf.train.Saver()
init = tf.global_variables_initializer()
self.sess = self._init_session()
self.sess.run(init)
# number of params
total_parameters = 0
for variable in self.weights.values():
shape = variable.get_shape()
variable_parameters = 1
for dim in shape:
variable_parameters *= dim.value
total_parameters += variable_parameters
if self.verbose > 0:
print("#params: %d" % total_parameters)
def _init_session(self):
config = tf.ConfigProto(device_count={"gpu": 1})
config.gpu_options.allow_growth = True
return tf.Session(config=config)
def _initialize_weights(self):
weights = dict()
# embeddings
weights["feature_embeddings"] = tf.Variable(
tf.random_normal([self.feature_size, self.embedding_size], 0.0, 0.01),
name="feature_embeddings") # feature_size * K
weights["feature_bias"] = tf.Variable(
tf.random_uniform([self.feature_size, 1], 0.0, 1.0), name="feature_bias") # feature_size * 1
# deep layers
num_layer = len(self.deep_layers)
input_size = self.field_size * self.embedding_size
glorot = np.sqrt(2.0 / (input_size + self.deep_layers[0]))
weights["layer_0"] = tf.Variable(
np.random.normal(loc=0, scale=glorot, size=(input_size, self.deep_layers[0])), dtype=np.float32)
weights["bias_0"] = tf.Variable(np.random.normal(loc=0, scale=glorot, size=(1, self.deep_layers[0])),
dtype=np.float32) # 1 * layers[0]
for i in range(1, num_layer):
glorot = np.sqrt(2.0 / (self.deep_layers[i-1] + self.deep_layers[i]))
weights["layer_%d" % i] = tf.Variable(
np.random.normal(loc=0, scale=glorot, size=(self.deep_layers[i-1], self.deep_layers[i])),
dtype=np.float32) # layers[i-1] * layers[i]
weights["bias_%d" % i] = tf.Variable(
np.random.normal(loc=0, scale=glorot, size=(1, self.deep_layers[i])),
dtype=np.float32) # 1 * layer[i]
# final concat projection layer
if self.use_fm and self.use_deep:
input_size = self.field_size + self.embedding_size + self.deep_layers[-1]
elif self.use_fm:
input_size = self.field_size + self.embedding_size
elif self.use_deep:
input_size = self.deep_layers[-1]
glorot = np.sqrt(2.0 / (input_size + 1))
weights["concat_projection"] = tf.Variable(
np.random.normal(loc=0, scale=glorot, size=(input_size, 1)),
dtype=np.float32) # layers[i-1]*layers[i]
weights["concat_bias"] = tf.Variable(tf.constant(0.01), dtype=np.float32)
return weights
def batch_norm_layer(self, x, train_phase, scope_bn):
bn_train = batch_norm(x, decay=self.batch_norm_decay, center=True, scale=True, updates_collections=None,
is_training=True, reuse=None, trainable=True, scope=scope_bn)
bn_inference = batch_norm(x, decay=self.batch_norm_decay, center=True, scale=True, updates_collections=None,
is_training=False, reuse=True, trainable=True, scope=scope_bn)
z = tf.cond(train_phase, lambda: bn_train, lambda: bn_inference)
return z
def get_batch(self, Xi, Xv, y, batch_size, index):
start = index * batch_size
end = (index+1) * batch_size
end = end if end < len(y) else len(y)
return Xi[start:end], Xv[start:end], [[y_] for y_ in y[start:end]]
# shuffle three lists simutaneously
def shuffle_in_unison_scary(self, a, b, c):
rng_state = np.random.get_state()
np.random.shuffle(a)
np.random.set_state(rng_state)
np.random.shuffle(b)
np.random.set_state(rng_state)
np.random.shuffle(c)
def fit_on_batch(self, Xi, Xv, y):
feed_dict = {self.feat_index: Xi,
self.feat_value: Xv,
self.label: y,
self.dropout_keep_fm: self.dropout_fm,
self.dropout_keep_deep: self.dropout_deep,
self.train_phase: True}
loss, opt = self.sess.run((self.loss, self.optimizer), feed_dict=feed_dict)
return loss
def fit(self, Xi_train, Xv_train, y_train,
Xi_valid=None, Xv_valid=None, y_valid=None,
early_stopping=False, refit=False):
"""
:param Xi_train: [[ind1_1, ind1_2, ...], [ind2_1, ind2_2, ...], ..., [indi_1, indi_2, ..., indi_j, ...], ...]
indi_j is the feature index of feature field j of sample i in the training set
:param Xv_train: [[val1_1, val1_2, ...], [val2_1, val2_2, ...], ..., [vali_1, vali_2, ..., vali_j, ...], ...]
vali_j is the feature value of feature field j of sample i in the training set
vali_j can be either binary (1/0, for binary/categorical features) or float (e.g., 10.24, for numerical features)
:param y_train: label of each sample in the training set
:param Xi_valid: list of list of feature indices of each sample in the validation set
:param Xv_valid: list of list of feature values of each sample in the validation set
:param y_valid: label of each sample in the validation set
:param early_stopping: perform early stopping or not
:param refit: refit the model on the train+valid dataset or not
:return: None
"""
has_valid = Xv_valid is not None
for epoch in range(self.epoch):
t1 = time()
self.shuffle_in_unison_scary(Xi_train, Xv_train, y_train)
total_batch = int(len(y_train) / self.batch_size)
for i in range(total_batch):
Xi_batch, Xv_batch, y_batch = self.get_batch(Xi_train, Xv_train, y_train, self.batch_size, i)
self.fit_on_batch(Xi_batch, Xv_batch, y_batch)
# evaluate training and validation datasets
train_result = self.evaluate(Xi_train, Xv_train, y_train)
self.train_result.append(train_result)
if has_valid:
valid_result = self.evaluate(Xi_valid, Xv_valid, y_valid)
self.valid_result.append(valid_result)
if self.verbose > 0 and epoch % self.verbose == 0:
if has_valid:
print("[%d] train-result=%.4f, valid-result=%.4f [%.1f s]"
% (epoch + 1, train_result, valid_result, time() - t1))
else:
print("[%d] train-result=%.4f [%.1f s]"
% (epoch + 1, train_result, time() - t1))
if has_valid and early_stopping and self.training_termination(self.valid_result):
break
# fit a few more epoch on train+valid until result reaches the best_train_score
if has_valid and refit:
if self.greater_is_better:
best_valid_score = max(self.valid_result)
else:
best_valid_score = min(self.valid_result)
best_epoch = self.valid_result.index(best_valid_score)
best_train_score = self.train_result[best_epoch]
Xi_train = Xi_train + Xi_valid
Xv_train = Xv_train + Xv_valid
y_train = y_train + y_valid
for epoch in range(100):
self.shuffle_in_unison_scary(Xi_train, Xv_train, y_train)
total_batch = int(len(y_train) / self.batch_size)
for i in range(total_batch):
Xi_batch, Xv_batch, y_batch = self.get_batch(Xi_train, Xv_train, y_train,
self.batch_size, i)
self.fit_on_batch(Xi_batch, Xv_batch, y_batch)
# check
train_result = self.evaluate(Xi_train, Xv_train, y_train)
if abs(train_result - best_train_score) < 0.001 or \
(self.greater_is_better and train_result > best_train_score) or \
((not self.greater_is_better) and train_result < best_train_score):
break
def training_termination(self, valid_result):
if len(valid_result) > 5:
if self.greater_is_better:
if valid_result[-1] < valid_result[-2] and \
valid_result[-2] < valid_result[-3] and \
valid_result[-3] < valid_result[-4] and \
valid_result[-4] < valid_result[-5]:
return True
else:
if valid_result[-1] > valid_result[-2] and \
valid_result[-2] > valid_result[-3] and \
valid_result[-3] > valid_result[-4] and \
valid_result[-4] > valid_result[-5]:
return True
return False
def predict(self, Xi, Xv):
"""
:param Xi: list of list of feature indices of each sample in the dataset
:param Xv: list of list of feature values of each sample in the dataset
:return: predicted probability of each sample
"""
# dummy y
dummy_y = [1] * len(Xi)
batch_index = 0
Xi_batch, Xv_batch, y_batch = self.get_batch(Xi, Xv, dummy_y, self.batch_size, batch_index)
y_pred = None
while len(Xi_batch) > 0:
num_batch = len(y_batch)
feed_dict = {self.feat_index: Xi_batch,
self.feat_value: Xv_batch,
self.label: y_batch,
self.dropout_keep_fm: [1.0] * len(self.dropout_fm),
self.dropout_keep_deep: [1.0] * len(self.dropout_deep),
self.train_phase: False}
batch_out = self.sess.run(self.out, feed_dict=feed_dict)
if batch_index == 0:
y_pred = np.reshape(batch_out, (num_batch,))
else:
y_pred = np.concatenate((y_pred, np.reshape(batch_out, (num_batch,))))
batch_index += 1
Xi_batch, Xv_batch, y_batch = self.get_batch(Xi, Xv, dummy_y, self.batch_size, batch_index)
return y_pred
def evaluate(self, Xi, Xv, y):
"""
:param Xi: list of list of feature indices of each sample in the dataset
:param Xv: list of list of feature values of each sample in the dataset
:param y: label of each sample in the dataset
:return: metric of the evaluation
"""
y_pred = self.predict(Xi, Xv)
return self.eval_metric(y, y_pred)
import numpy as np
import tensorflow as tf
import pandas as pd
from sklearn.metrics import make_scorer
from sklearn.model_selection import StratifiedKFold,train_test_split
import sys
sys.path.append('../')
from DeepFM import DeepFM
import matplotlib.pyplot as plt
def _plot_fig(train_results, valid_results, filename):
colors = ["red", "blue", "green"]
xs = np.arange(1, train_results.shape[1]+1)
plt.figure()
legends = []
for i in range(train_results.shape[0]):
plt.plot(xs, train_results[i], color=colors[i], linestyle="solid", marker="o")
plt.plot(xs, valid_results[i], color=colors[i], linestyle="dashed", marker="o")
legends.append("train-%d"%(i+1))
legends.append("valid-%d"%(i+1))
plt.xlabel("Epoch")
plt.ylabel("Normalized Gini")
plt.legend(legends)
plt.savefig(filename)
plt.close()
# gini_scorer = make_scorer(gini_norm,greater_is_better=True,needs_proba=True)
data_dir = '/Users/zhangyugu/Downloads/'
dfTrain=pd.read_csv(data_dir+'user_behaviors_test_full.csv')
# dfTrain=dfTrain.iloc[:400]
cols=['user_id','item_id','brand_id','shop_id','category_id','gender','age','buy_cap']
X_train=dfTrain[cols]
Y_train=dfTrain['behavior_value_sum'].values
X_test=X_train
cat_features_indices=[]
feat_dim=0
feat_dict = {}
numeric_cols=['buy_cap','age']
for col in cols:
us=dfTrain[col].unique()
if col in numeric_cols:
feat_dim+=1
feat_dict[col]=feat_dim
else:
feat_dict[col] = dict(zip(us, range(feat_dim, len(us) + feat_dim)))
feat_dim+=len(us)
dfTrain_i=X_train.copy()
dfTrain_v=X_train.copy()
for col in cols:
dfTrain_i[col] = dfTrain_i[col].map(feat_dict[col])
if col not in numeric_cols:
dfTrain_v[col] = 1.
dfTrain_i=dfTrain_i.values.tolist()
dfTrain_v=dfTrain_v.values.tolist()
folds=[0,1,2]
# list(StratifiedKFold(n_splits=3,shuffle=True,random_state=2017).split(X_train,Y_train))
def gini_norm(actual,predict):
return ((np.array(actual)-np.array(predict))**2).sum()/len(actual)
dfm_params={
'use_fm':True,
'use_deep':True,
'embedding_size':24,
'dropout_fm':[1.0,1.0],
'deep_layers':[48,16],
'dropout_deep':[0.5,0.5,0.5],
'deep_layers_activation':tf.nn.relu,
'epoch':30,
'batch_size':1024,
'learning_rate':0.001,
'optimizer_type':'adam',
'batch_norm':1,
'batch_norm_decay':0.995,
'l2_reg':0.01,
'verbose':True,
'eval_metric':gini_norm,
'random_seed':2017,
'greater_is_better':False,
'loss_type':'mse'
}
dfm_params["feature_size"] = feat_dim #特徵數
dfm_params["field_size"] = len(cols) #字段數
y_train_meta = np.zeros((dfTrain.shape[0],1),dtype=float)
y_test_meta = np.zeros((dfTrain.shape[0],1),dtype=float)
gini_results_cv=np.zeros(len(folds),dtype=float)
gini_results_epoch_train=np.zeros((len(folds),dfm_params['epoch']),dtype=float)
gini_results_epoch_valid=np.zeros((len(folds),dfm_params['epoch']),dtype=float)
_get = lambda x, l: [x[i] for i in l]
for i in range(len(folds)):
train_idx,valid_idx=train_test_split(range(len(dfTrain)),random_state=2017,train_size=2.0/3.0)
Xi_train_,Xv_train_,y_train_ = _get(dfTrain_i,train_idx), _get(dfTrain_v,train_idx),_get(Y_train,train_idx)
Xi_valid_, Xv_valid_, y_valid_ = _get(dfTrain_i, valid_idx), _get(dfTrain_v, valid_idx), _get(Y_train, valid_idx)
dfm=DeepFM(**dfm_params)
dfm.fit(Xi_train_,Xv_train_,y_train_,Xi_valid_,Xv_valid_,y_valid_)
y_train_meta[valid_idx,0]=dfm.predict(Xi_valid_,Xv_valid_)
y_test_meta[:,0]=dfm.predict(dfTrain_i,dfTrain_v)
gini_results_cv[i]=gini_norm(y_valid_,y_train_meta[valid_idx])
gini_results_epoch_train[i]=dfm.train_result
gini_results_epoch_valid[i]=dfm.valid_result
y_test_meta/=float(len(folds))
print("DeepFm: %.5f (%.5f)" % (gini_results_cv.mean(),gini_results_cv.std()))
filename = data_dir+"DeepFm_Mean%.5f_Std%.5f.csv" % (gini_results_cv.mean(), gini_results_cv.std())
pd.DataFrame({"user_id": X_train['user_id'],"item_id":X_train['item_id'], "target": y_test_meta.flatten()}).to_csv(
filename, index=False, float_format="%.5f")
_plot_fig(gini_results_epoch_train, gini_results_epoch_valid,data_dir+'DeepFm_ItemRecommend1.png')
在本機跑,速度很慢,爲了用上gpu,嘗試了google colab;在此推薦這款神器,一個方便遠程使用gpu跑機器學習任務的環境。
接着對當前的模型進行分析,deepfm其實是分成兩步,將題目中的變量 'user_id','item_id','brand_id','shop_id','category_id','gender','age','buy_cap' 因子分解成幾個特徵變量,然後分別對fm的結果和deepnn的結果組合得到最終的評分。
最後還剩下一個問題,對一個用戶預測他明天最有可能買或者瀏覽的商品,商品推薦源巨大,如何根據其特徵值進行篩選?阿里的tdm中每個用戶對應的商品排序結構中,每個商品的打分如何計算?有沒有可能在模型構造的過程中就把該打分、樹結構構造好?
迴歸本源,用戶購買商品,是因爲一系列興趣指標,而商品在這些興趣指標上也有相應的得分,反映到模型上,就是針對一列特徵,用戶和商品分別有自己的取值向量。那麼答案就出來了,如果直接通過在向量上快速篩選得到需要的商品?
這裏講一下tdm的做法,首先它基於商品本身的分類初始構造樹層次,最底層的葉子節點爲商品,接着對於每個商品根據在特徵上的取值向量通過層次聚類方法進行分組,最後基於每個分組中的向量取值進行排序;排序結果再作爲一個特徵,放到機器學習過程中優化。這是一個很好的思路,只是如何對一組商品的得分快速計算,以及如何快速進行組間排序是一個問題。