背景
“《野蠻時代》(Brutal Age)是一款風靡全球的SLG類型手機遊戲。根據App Annie統計,《野蠻時代》在12個國家取得遊戲暢銷榜第1,在82個國家取得遊戲暢銷榜前10。準確瞭解每個玩家的價值,對遊戲的廣告投放策略和高效的運營活動(如精準的促銷活動和禮包推薦)具有重要意義,有助於給玩家帶來更個性化的體驗。因此,我們希望能在玩家進入遊戲的前期就對於他們的價值進行準確的估算。在這個競賽裏,想請各位選手利用玩家在遊戲內前7日的行爲數據,預測他們每個人在45日內的付費總金額”。
下面就針對這個遊戲數據集進行分析和建模,因爲比賽已經結束了,後面的分析只針對訓練集樣本。
數據一覽
訓練集樣本總共2288007條數據,109個變量,主要涉及用戶在註冊7天內的遊戲表現和付費信息。
import warnings
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder, scale
from sklearn.metrics import mean_squared_error
from sklearn.decomposition import PCA
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import KFold
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', 150)
pd.set_option('display.max_rows', 150)
plt.style.use('ggplot')
%matplotlib inline
train = pd.read_csv('tap_fun_train.csv')
float_feat = train.select_dtypes(include='float64').columns.values
int_feat = train.select_dtypes(include='int64').columns.values
train[float_feat] = train[float_feat].astype(np.float32)
train[int_feat] = train[int_feat].astype(np.int32)
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2288007 entries, 0 to 2288006
Columns: 109 entries, user_id to prediction_pay_price
dtypes: float32(13), int32(95), object(1)
memory usage: 960.1+ MB
train.head()
user_id | register_time | wood_add_value | wood_reduce_value | stone_add_value | stone_reduce_value | ivory_add_value | ivory_reduce_value | meat_add_value | meat_reduce_value | magic_add_value | magic_reduce_value | infantry_add_value | infantry_reduce_value | cavalry_add_value | cavalry_reduce_value | shaman_add_value | shaman_reduce_value | wound_infantry_add_value | wound_infantry_reduce_value | wound_cavalry_add_value | wound_cavalry_reduce_value | wound_shaman_add_value | wound_shaman_reduce_value | general_acceleration_add_value | general_acceleration_reduce_value | building_acceleration_add_value | building_acceleration_reduce_value | reaserch_acceleration_add_value | reaserch_acceleration_reduce_value | training_acceleration_add_value | training_acceleration_reduce_value | treatment_acceleraion_add_value | treatment_acceleration_reduce_value | bd_training_hut_level | bd_healing_lodge_level | bd_stronghold_level | bd_outpost_portal_level | bd_barrack_level | bd_healing_spring_level | bd_dolmen_level | bd_guest_cavern_level | bd_warehouse_level | bd_watchtower_level | bd_magic_coin_tree_level | bd_hall_of_war_level | bd_market_level | bd_hero_gacha_level | bd_hero_strengthen_level | bd_hero_pve_level | sr_scout_level | sr_training_speed_level | sr_infantry_tier_2_level | sr_cavalry_tier_2_level | sr_shaman_tier_2_level | sr_infantry_atk_level | sr_cavalry_atk_level | sr_shaman_atk_level | sr_infantry_tier_3_level | sr_cavalry_tier_3_level | sr_shaman_tier_3_level | sr_troop_defense_level | sr_infantry_def_level | sr_cavalry_def_level | sr_shaman_def_level | sr_infantry_hp_level | sr_cavalry_hp_level | sr_shaman_hp_level | sr_infantry_tier_4_level | sr_cavalry_tier_4_level | sr_shaman_tier_4_level | sr_troop_attack_level | sr_construction_speed_level | sr_hide_storage_level | sr_troop_consumption_level | sr_rss_a_prod_levell | sr_rss_b_prod_level | sr_rss_c_prod_level | sr_rss_d_prod_level | sr_rss_a_gather_level | sr_rss_b_gather_level | sr_rss_c_gather_level | sr_rss_d_gather_level | sr_troop_load_level | sr_rss_e_gather_level | sr_rss_e_prod_level | sr_outpost_durability_level | sr_outpost_tier_2_level | sr_healing_space_level | sr_gathering_hunter_buff_level | sr_healing_speed_level | sr_outpost_tier_3_level | sr_alliance_march_speed_level | sr_pvp_march_speed_level | sr_gathering_march_speed_level | sr_outpost_tier_4_level | sr_guest_troop_capacity_level | sr_march_size_level | sr_rss_help_bonus_level | pvp_battle_count | pvp_lanch_count | pvp_win_count | pve_battle_count | pve_lanch_count | pve_win_count | avg_online_minutes | pay_price | pay_count | prediction_pay_price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2018-02-02 19:47:15 | 20125.0 | 3700.0 | 0.0 | 0.0 | 0.0 | 0.0 | 16375.0 | 2000.0 | 0.0 | 0.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 50 | 0 | 50 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.333333 | 0.0 | 0 | 0.0 |
1 | 1593 | 2018-01-26 00:01:05 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.333333 | 0.0 | 0 | 0.0 |
2 | 1594 | 2018-01-26 00:01:58 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1.166667 | 0.0 | 0 | 0.0 |
3 | 1595 | 2018-01-26 00:02:13 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3.166667 | 0.0 | 0 | 0.0 |
4 | 1596 | 2018-01-26 00:02:46 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2.333333 | 0.0 | 0 | 0.0 |
數據分析
付費金額分佈
絕大部分用戶的付費金額爲0,45天內的人均付費金額爲1.79元,即使是付費用戶的付費金額也是集中在低金額區間,付費率情況後面和細看;最高金額高達3萬以上,分佈非常不均衡。
#整體金額分佈,付費用戶的金額分佈
train['prediction_pay_price'].describe()
count 2.288007e+06
mean 1.793456e+00
std 8.844339e+01
min 0.000000e+00
25% 0.000000e+00
50% 0.000000e+00
75% 0.000000e+00
max 3.297781e+04
Name: prediction_pay_price, dtype: float64
fig, ax = plt.subplots(1, 2, figsize=(14, 6))
train['prediction_pay_price'].plot.hist(ax=ax[0])
train.loc[train['prediction_pay_price'] > 0, 'prediction_pay_price'].plot.hist(ax=ax[1])
付費金額和人數
註冊後,
- 7天內的付費玩家41439人、付費總金額122萬元
- 45天內的付費玩家45988人、付費總金額410萬元,人數增長11%,金額增長卻超過235%
- 新增加的付費玩家(7天內未付費,45天內有付費)4549人,只貢獻了增長總額的6.5%,也就是說93.5%的增長來源於7天內就付費的玩家
#付費金額和人數,新增加人數貢獻的金額
first_7_num = len(train[train['pay_price'] > 0])
first_45_num = len(train[train['prediction_pay_price'] > 0])
first_7_amt = train.loc[train['pay_price'] > 0, 'pay_price'].sum()
first_45_amt = train.loc[train['prediction_pay_price'] > 0, 'prediction_pay_price'].sum()
fig, ax = plt.subplots(1, 2, figsize=(14,6))
ax[0].bar(['first_7_num', 'first_45_num'], [first_7_num, first_45_num], color='lightblue')
ax[0].set_xlabel('First_7 VS First_45')
ax[0].set_ylabel('Number of Player Who Paid')
for i, j in zip([0, 1], [first_7_num, first_45_num]):
ax[0].text(i, j+500, j, ha='center', fontsize=12)
ax[1].bar(['first_7_amt', 'first_45_amt'], [first_7_amt, first_45_amt], color='pink')
ax[1].set_xlabel('First_7 VS First_45')
ax[1].set_ylabel('Paid Amount')
for i, j in zip([0, 1], [first_7_amt, first_45_amt]):
ax[1].text(i, j+10000, j, ha='center', fontsize=12)
train.loc[(train['pay_price'] == 0) & (train['prediction_pay_price'] > 0), 'prediction_pay_price'].sum()/(first_45_amt-first_7_amt)
0.06452517
付費率
7天內的付費率1.81%,再次付費的比例57%,45天內付費率2.01%。
#付費率,二次付費率
pay_rate_7 = len(train[train['pay_price'] > 0])/len(train)
pay_rate_45 = len(train[train['prediction_pay_price'] > 0])/len(train)
fig, ax = plt.subplots(figsize=(8,6))
ax.bar(['pay_rate_7', 'pay_rate_45'], [pay_rate_7, pay_rate_45], color='gold')
ax.set_xlabel('First_7 VS First_45')
ax.set_ylabel('Pay Rate')
for i, j in zip([0, 1], [pay_rate_7, pay_rate_45]):
ax.text(i, j+0.0002, str(round(j*100, 2))+'%', ha='center', fontsize=12)
len(train[train['pay_count'] > 1])/len(train[train['pay_count'] > 0])
0.5747484253963657
人均付費金額(ARPU)
付費玩家的人均付費金額89元,45天內達到這個水平,很厲害了。
train.loc[train['prediction_pay_price'] > 0, 'prediction_pay_price'].sum()/len(train[train['prediction_pay_price'] > 0])
89.21306101591719
註冊日期和人數
樣本數據的時間從2018年1月26日到3月06日,在這段時間內每日的註冊人數有隱隱的下降趨勢;2月19日是一個峯值,註冊人數達到117311。
train['register_time'] = pd.to_datetime(train['register_time'])
train['register_date'] = train['register_time'].dt.date
train.groupby('register_date')['prediction_pay_price'].count().plot(figsize=(12,6), color='grey')
在線時長
人均一週遊戲時長爲10分鐘,最高時長超過2000分鐘,分佈集中在偏低區間;遊戲時長和付費金額沒有很明顯的線性關係,也可能是離羣點太多了。
train['avg_online_minutes'].describe()
count 2.288007e+06
mean 1.016729e+01
std 3.869698e+01
min 0.000000e+00
25% 5.000000e-01
50% 1.833333e+00
75% 4.833333e+00
max 2.049667e+03
Name: avg_online_minutes, dtype: float64
train['avg_online_minutes'].plot.hist()
特徵工程
下面進行特徵工程,會增加一些變量,也會刪除部分變量,最後會對遊戲數據進行標準化。
train['register_month'] = train['register_time'].dt.month
train['register_day'] = train['register_time'].dt.day
train['register_week'] = train['register_time'].dt.week
train['register_weekday'] = train['register_time'].dt.weekday
# 計算一下達到某種等級建築的數量
def level_num(num):
data = train.loc[:,'bd_training_hut_level':'bd_hero_pve_level'].applymap(lambda x: 1 if x > num else 0)
result = data.apply(sum, axis=1)
return result
train['over_0_level'] = level_num(0)
train['over_3_level'] = level_num(3)
train['over_5_level'] = level_num(5)
train['over_10_level'] = level_num(10)
# 計算完成科研的數量
train['sr_done_num'] = train.loc[:,'sr_scout_level':'sr_rss_help_bonus_level'].apply(sum, axis=1)
train['pvp_lanch_rate'] = train['pvp_lanch_count']/train['pvp_battle_count']
train['pvp_win_rate'] = train['pvp_win_count']/train['pvp_battle_count']
train['pve_lanch_rate'] = train['pve_lanch_count']/train['pve_battle_count']
train['pve_win_rate'] = train['pve_win_count']/train['pve_battle_count']
train['pvp_pve'] = train['pvp_battle_count']+train['pve_battle_count']
train['pvp_pve_win_rate'] = (train['pvp_win_count']+train['pve_win_count'])/train['pvp_pve']
train.fillna(0, inplace=True)
# 找出方差爲0或者非常小的變量,sklearn中有個類似的方法可以調用
def low_var(df):
columns = df.columns.values
col_list = []
for column in columns:
unique_num = df[column].nunique()
if unique_num > 1:
unique_rate = unique_num/len(df)
first_two = list(df[column].value_counts())[0:2]
first_two_rate = first_two[0]/first_two[1]
if (unique_rate < 0.01) and (first_two_rate > 80):
col_list.append(column)
else:
col_list.append(column)
return col_list
feats = list(train.columns.values)
del feats[106:109]
low_var_feat = low_var(train[feats])
# 對遊戲數據進行標準化,同時使用主成分分析
train.loc[:,'wood_add_value':'avg_online_minutes'] = train.loc[:,'wood_add_value':'avg_online_minutes'].apply(scale)
pca = PCA(n_components=20)
pca.fit(train.loc[:,'wood_add_value':'avg_online_minutes'])
plt.plot(range(0, 20), np.cumsum(pca.explained_variance_ratio_))
pca_df = pd.DataFrame(pca.transform(train.loc[:,'wood_add_value':'avg_online_minutes']), columns=['components_'+str(i) for i in range(20)])
train = pd.concat([train, pca_df], axis=1)
train.drop(low_var_feat, axis=1, inplace=True)
train.drop(['register_time', 'user_id'], axis=1, inplace=True)
train['register_date'] = LabelEncoder().fit_transform(train['register_date'])
train['pay_price'] = np.log(train['pay_price'] + 1)
train['prediction_pay_price'] = np.log(train['prediction_pay_price'] + 1)
float_feat = train.select_dtypes(include='float64').columns.values
int_feat = train.select_dtypes(include='int64').columns.values
train[float_feat] = train[float_feat].astype(np.float32)
train[int_feat] = train[int_feat].astype(np.int32)
del pca_df
建模
X = list(train.columns.values)
X.remove('prediction_pay_price')
y = 'prediction_pay_price'
gbm = GradientBoostingRegressor(verbose=1, random_state=123)
kfold = KFold(n_splits=4, shuffle=True, random_state=123)
rmse_list = []
feature_importance = pd.DataFrame()
for fold, (train_idx, test_idx) in enumerate(kfold.split(train)):
print('='*50)
print('fold {}'.format(fold))
gbm.fit(train.iloc[train_idx][X], train.iloc[train_idx][y])
pred = gbm.predict(train.iloc[test_idx][X])
# rmse = np.sqrt(mean_squared_error(train.iloc[test_idx][y], pred))
rmse = np.sqrt(mean_squared_error(np.exp(train.iloc[test_idx][y])-1, np.exp(pred)-1))
rmse_list.append(rmse)
print('rmse {}'.format(rmse))
importance_df = pd.DataFrame()
importance_df['feature'] = X
importance_df['importance'] = gbm.feature_importances_
importance_df['fold'] = fold + 1
feature_importance = pd.concat([feature_importance, importance_df], axis=0)
==================================================
fold 0
Iter Train Loss Remaining Time
1 0.1259 23.10m
2 0.1065 22.28m
3 0.0908 21.97m
4 0.0779 21.70m
5 0.0675 21.52m
6 0.0591 21.26m
7 0.0522 21.12m
8 0.0466 20.85m
9 0.0421 20.59m
10 0.0383 20.32m
20 0.0240 17.91m
30 0.0219 15.75m
40 0.0215 13.55m
50 0.0213 11.28m
60 0.0213 9.01m
70 0.0212 6.75m
80 0.0211 4.49m
90 0.0211 2.24m
100 0.0210 0.00s
rmse 76.81655221125601
==================================================
fold 1
Iter Train Loss Remaining Time
1 0.1260 22.07m
2 0.1066 21.79m
3 0.0908 21.59m
4 0.0780 21.68m
5 0.0675 21.39m
6 0.0591 21.11m
7 0.0522 20.91m
8 0.0466 20.71m
9 0.0420 20.55m
10 0.0383 20.36m
20 0.0239 18.40m
30 0.0218 16.33m
40 0.0214 14.15m
50 0.0213 11.87m
60 0.0212 9.51m
70 0.0211 7.14m
80 0.0210 4.77m
90 0.0210 2.39m
100 0.0209 0.00s
rmse 80.4590172800173
==================================================
fold 2
Iter Train Loss Remaining Time
1 0.1259 22.36m
2 0.1065 22.10m
3 0.0908 21.89m
4 0.0780 21.69m
5 0.0676 21.45m
6 0.0591 21.29m
7 0.0522 21.24m
8 0.0466 21.17m
9 0.0421 21.08m
10 0.0384 20.96m
20 0.0240 19.00m
30 0.0219 16.78m
40 0.0215 14.45m
50 0.0213 12.06m
60 0.0212 9.66m
70 0.0212 7.26m
80 0.0211 4.84m
90 0.0210 2.42m
100 0.0210 0.00s
rmse 53.03963382354496
==================================================
fold 3
Iter Train Loss Remaining Time
1 0.1261 24.34m
2 0.1067 24.01m
3 0.0908 23.76m
4 0.0780 23.52m
5 0.0675 23.23m
6 0.0590 22.94m
7 0.0522 22.69m
8 0.0465 22.43m
9 0.0420 22.19m
10 0.0382 21.93m
20 0.0238 19.49m
30 0.0217 17.07m
40 0.0213 14.64m
50 0.0211 12.19m
60 0.0210 9.74m
70 0.0210 7.24m
80 0.0209 4.79m
90 0.0209 2.37m
100 0.0208 0.00s
rmse 49.03841446816272
print('通過交叉驗證,訓練集的均方根誤差爲{}'.format(np.mean(rmse_list)))
通過交叉驗證,訓練集的均方根誤差爲64.83840444574524
feature_importance.groupby('feature')['importance'].mean().sort_values(ascending=True).plot.barh(figsize=(8, 16))
結語
因爲沒有玩過這款遊戲,對其中的遊戲內容和指標還有理解不到位的地方,相信玩過遊戲後應該會有新的理解,找到其它的重要特徵。上文中的模型沒有調參,模型優化後結果還能提升。同時,還可以根據遊戲的參與度對用戶進行不同類型的劃分,觀察高氪用戶和普通用戶在遊戲屬性上的差異等其它方面。