遊戲付費金額 —— 基於DC遊戲數據(Brutal Age)

背景

在這裏插入圖片描述

“《野蠻時代》(Brutal Age)是一款風靡全球的SLG類型手機遊戲。根據App Annie統計,《野蠻時代》在12個國家取得遊戲暢銷榜第1,在82個國家取得遊戲暢銷榜前10。準確瞭解每個玩家的價值,對遊戲的廣告投放策略和高效的運營活動(如精準的促銷活動和禮包推薦)具有重要意義,有助於給玩家帶來更個性化的體驗。因此,我們希望能在玩家進入遊戲的前期就對於他們的價值進行準確的估算。在這個競賽裏,想請各位選手利用玩家在遊戲內前7日的行爲數據,預測他們每個人在45日內的付費總金額”。

下面就針對這個遊戲數據集進行分析和建模,因爲比賽已經結束了,後面的分析只針對訓練集樣本。

數據一覽

訓練集樣本總共2288007條數據,109個變量,主要涉及用戶在註冊7天內的遊戲表現和付費信息。

import warnings
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder, scale
from sklearn.metrics import mean_squared_error
from sklearn.decomposition import PCA
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import KFold

warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', 150)
pd.set_option('display.max_rows', 150)
plt.style.use('ggplot')
%matplotlib inline
train = pd.read_csv('tap_fun_train.csv')
float_feat = train.select_dtypes(include='float64').columns.values
int_feat = train.select_dtypes(include='int64').columns.values
train[float_feat] = train[float_feat].astype(np.float32)
train[int_feat] = train[int_feat].astype(np.int32)
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2288007 entries, 0 to 2288006
Columns: 109 entries, user_id to prediction_pay_price
dtypes: float32(13), int32(95), object(1)
memory usage: 960.1+ MB
train.head()
user_id register_time wood_add_value wood_reduce_value stone_add_value stone_reduce_value ivory_add_value ivory_reduce_value meat_add_value meat_reduce_value magic_add_value magic_reduce_value infantry_add_value infantry_reduce_value cavalry_add_value cavalry_reduce_value shaman_add_value shaman_reduce_value wound_infantry_add_value wound_infantry_reduce_value wound_cavalry_add_value wound_cavalry_reduce_value wound_shaman_add_value wound_shaman_reduce_value general_acceleration_add_value general_acceleration_reduce_value building_acceleration_add_value building_acceleration_reduce_value reaserch_acceleration_add_value reaserch_acceleration_reduce_value training_acceleration_add_value training_acceleration_reduce_value treatment_acceleraion_add_value treatment_acceleration_reduce_value bd_training_hut_level bd_healing_lodge_level bd_stronghold_level bd_outpost_portal_level bd_barrack_level bd_healing_spring_level bd_dolmen_level bd_guest_cavern_level bd_warehouse_level bd_watchtower_level bd_magic_coin_tree_level bd_hall_of_war_level bd_market_level bd_hero_gacha_level bd_hero_strengthen_level bd_hero_pve_level sr_scout_level sr_training_speed_level sr_infantry_tier_2_level sr_cavalry_tier_2_level sr_shaman_tier_2_level sr_infantry_atk_level sr_cavalry_atk_level sr_shaman_atk_level sr_infantry_tier_3_level sr_cavalry_tier_3_level sr_shaman_tier_3_level sr_troop_defense_level sr_infantry_def_level sr_cavalry_def_level sr_shaman_def_level sr_infantry_hp_level sr_cavalry_hp_level sr_shaman_hp_level sr_infantry_tier_4_level sr_cavalry_tier_4_level sr_shaman_tier_4_level sr_troop_attack_level sr_construction_speed_level sr_hide_storage_level sr_troop_consumption_level sr_rss_a_prod_levell sr_rss_b_prod_level sr_rss_c_prod_level sr_rss_d_prod_level sr_rss_a_gather_level sr_rss_b_gather_level sr_rss_c_gather_level sr_rss_d_gather_level sr_troop_load_level sr_rss_e_gather_level sr_rss_e_prod_level sr_outpost_durability_level sr_outpost_tier_2_level sr_healing_space_level sr_gathering_hunter_buff_level sr_healing_speed_level sr_outpost_tier_3_level sr_alliance_march_speed_level sr_pvp_march_speed_level sr_gathering_march_speed_level sr_outpost_tier_4_level sr_guest_troop_capacity_level sr_march_size_level sr_rss_help_bonus_level pvp_battle_count pvp_lanch_count pvp_win_count pve_battle_count pve_lanch_count pve_win_count avg_online_minutes pay_price pay_count prediction_pay_price
0 1 2018-02-02 19:47:15 20125.0 3700.0 0.0 0.0 0.0 0.0 16375.0 2000.0 0.0 0.0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 50 0 50 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.333333 0.0 0 0.0
1 1593 2018-01-26 00:01:05 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.333333 0.0 0 0.0
2 1594 2018-01-26 00:01:58 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.166667 0.0 0 0.0
3 1595 2018-01-26 00:02:13 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3.166667 0.0 0 0.0
4 1596 2018-01-26 00:02:46 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2.333333 0.0 0 0.0

數據分析

付費金額分佈

絕大部分用戶的付費金額爲0,45天內的人均付費金額爲1.79元,即使是付費用戶的付費金額也是集中在低金額區間,付費率情況後面和細看;最高金額高達3萬以上,分佈非常不均衡。

#整體金額分佈,付費用戶的金額分佈
train['prediction_pay_price'].describe()
count    2.288007e+06
mean     1.793456e+00
std      8.844339e+01
min      0.000000e+00
25%      0.000000e+00
50%      0.000000e+00
75%      0.000000e+00
max      3.297781e+04
Name: prediction_pay_price, dtype: float64
fig, ax = plt.subplots(1, 2, figsize=(14, 6))
train['prediction_pay_price'].plot.hist(ax=ax[0])
train.loc[train['prediction_pay_price'] > 0, 'prediction_pay_price'].plot.hist(ax=ax[1])

在這裏插入圖片描述

付費金額和人數

註冊後,

  • 7天內的付費玩家41439人、付費總金額122萬元
  • 45天內的付費玩家45988人、付費總金額410萬元,人數增長11%,金額增長卻超過235%
  • 新增加的付費玩家(7天內未付費,45天內有付費)4549人,只貢獻了增長總額的6.5%,也就是說93.5%的增長來源於7天內就付費的玩家
#付費金額和人數,新增加人數貢獻的金額
first_7_num = len(train[train['pay_price'] > 0])
first_45_num = len(train[train['prediction_pay_price'] > 0])
first_7_amt = train.loc[train['pay_price'] > 0, 'pay_price'].sum()
first_45_amt = train.loc[train['prediction_pay_price'] > 0, 'prediction_pay_price'].sum()
fig, ax = plt.subplots(1, 2, figsize=(14,6))
ax[0].bar(['first_7_num', 'first_45_num'], [first_7_num, first_45_num], color='lightblue')
ax[0].set_xlabel('First_7 VS First_45')
ax[0].set_ylabel('Number of Player Who Paid')
for i, j in zip([0, 1], [first_7_num, first_45_num]):
    ax[0].text(i, j+500, j, ha='center', fontsize=12)
ax[1].bar(['first_7_amt', 'first_45_amt'], [first_7_amt, first_45_amt], color='pink')
ax[1].set_xlabel('First_7 VS First_45')
ax[1].set_ylabel('Paid Amount')
for i, j in zip([0, 1], [first_7_amt, first_45_amt]):
    ax[1].text(i, j+10000, j, ha='center', fontsize=12)

在這裏插入圖片描述

train.loc[(train['pay_price'] == 0) & (train['prediction_pay_price'] > 0), 'prediction_pay_price'].sum()/(first_45_amt-first_7_amt)
0.06452517

付費率

7天內的付費率1.81%,再次付費的比例57%,45天內付費率2.01%。

#付費率,二次付費率
pay_rate_7 = len(train[train['pay_price'] > 0])/len(train)
pay_rate_45 = len(train[train['prediction_pay_price'] > 0])/len(train)
fig, ax = plt.subplots(figsize=(8,6))
ax.bar(['pay_rate_7', 'pay_rate_45'], [pay_rate_7, pay_rate_45], color='gold')
ax.set_xlabel('First_7 VS First_45')
ax.set_ylabel('Pay Rate')
for i, j in zip([0, 1], [pay_rate_7, pay_rate_45]):
    ax.text(i, j+0.0002, str(round(j*100, 2))+'%', ha='center', fontsize=12)

在這裏插入圖片描述

len(train[train['pay_count'] > 1])/len(train[train['pay_count'] > 0])
0.5747484253963657

人均付費金額(ARPU)

付費玩家的人均付費金額89元,45天內達到這個水平,很厲害了。

train.loc[train['prediction_pay_price'] > 0, 'prediction_pay_price'].sum()/len(train[train['prediction_pay_price'] > 0])
89.21306101591719

註冊日期和人數

樣本數據的時間從2018年1月26日到3月06日,在這段時間內每日的註冊人數有隱隱的下降趨勢;2月19日是一個峯值,註冊人數達到117311。

train['register_time'] = pd.to_datetime(train['register_time'])
train['register_date'] = train['register_time'].dt.date
train.groupby('register_date')['prediction_pay_price'].count().plot(figsize=(12,6), color='grey')

在這裏插入圖片描述

在線時長

人均一週遊戲時長爲10分鐘,最高時長超過2000分鐘,分佈集中在偏低區間;遊戲時長和付費金額沒有很明顯的線性關係,也可能是離羣點太多了。

train['avg_online_minutes'].describe()
count    2.288007e+06
mean     1.016729e+01
std      3.869698e+01
min      0.000000e+00
25%      5.000000e-01
50%      1.833333e+00
75%      4.833333e+00
max      2.049667e+03
Name: avg_online_minutes, dtype: float64
train['avg_online_minutes'].plot.hist()

在這裏插入圖片描述

特徵工程

下面進行特徵工程,會增加一些變量,也會刪除部分變量,最後會對遊戲數據進行標準化。

train['register_month'] = train['register_time'].dt.month
train['register_day'] = train['register_time'].dt.day
train['register_week'] = train['register_time'].dt.week
train['register_weekday'] = train['register_time'].dt.weekday
# 計算一下達到某種等級建築的數量
def level_num(num):
    data = train.loc[:,'bd_training_hut_level':'bd_hero_pve_level'].applymap(lambda x: 1 if x > num else 0)
    result = data.apply(sum, axis=1)
    return result
train['over_0_level'] = level_num(0)
train['over_3_level'] = level_num(3)
train['over_5_level'] = level_num(5)
train['over_10_level'] = level_num(10) 
# 計算完成科研的數量
train['sr_done_num'] = train.loc[:,'sr_scout_level':'sr_rss_help_bonus_level'].apply(sum, axis=1)
train['pvp_lanch_rate'] = train['pvp_lanch_count']/train['pvp_battle_count']
train['pvp_win_rate'] = train['pvp_win_count']/train['pvp_battle_count']
train['pve_lanch_rate'] = train['pve_lanch_count']/train['pve_battle_count']
train['pve_win_rate'] = train['pve_win_count']/train['pve_battle_count']
train['pvp_pve'] = train['pvp_battle_count']+train['pve_battle_count']
train['pvp_pve_win_rate'] = (train['pvp_win_count']+train['pve_win_count'])/train['pvp_pve']
train.fillna(0, inplace=True)
# 找出方差爲0或者非常小的變量,sklearn中有個類似的方法可以調用
def low_var(df):
    columns = df.columns.values
    col_list = []
    for column in columns:
        unique_num = df[column].nunique()
        if unique_num > 1:
            unique_rate = unique_num/len(df)
            first_two = list(df[column].value_counts())[0:2]
            first_two_rate = first_two[0]/first_two[1]
            if (unique_rate < 0.01) and (first_two_rate > 80):
                col_list.append(column)
        else:
            col_list.append(column)
    return col_list
feats = list(train.columns.values)
del feats[106:109]
low_var_feat = low_var(train[feats])
# 對遊戲數據進行標準化,同時使用主成分分析
train.loc[:,'wood_add_value':'avg_online_minutes'] = train.loc[:,'wood_add_value':'avg_online_minutes'].apply(scale)
pca = PCA(n_components=20)
pca.fit(train.loc[:,'wood_add_value':'avg_online_minutes'])
plt.plot(range(0, 20), np.cumsum(pca.explained_variance_ratio_))
pca_df = pd.DataFrame(pca.transform(train.loc[:,'wood_add_value':'avg_online_minutes']), columns=['components_'+str(i) for i in range(20)])

在這裏插入圖片描述

train = pd.concat([train, pca_df], axis=1)
train.drop(low_var_feat, axis=1, inplace=True)
train.drop(['register_time', 'user_id'], axis=1, inplace=True)
train['register_date'] = LabelEncoder().fit_transform(train['register_date'])
train['pay_price'] = np.log(train['pay_price'] + 1)
train['prediction_pay_price'] = np.log(train['prediction_pay_price'] + 1)
float_feat = train.select_dtypes(include='float64').columns.values
int_feat = train.select_dtypes(include='int64').columns.values
train[float_feat] = train[float_feat].astype(np.float32)
train[int_feat] = train[int_feat].astype(np.int32)
del pca_df

建模

X = list(train.columns.values)
X.remove('prediction_pay_price')
y = 'prediction_pay_price'
gbm = GradientBoostingRegressor(verbose=1, random_state=123)
kfold = KFold(n_splits=4, shuffle=True, random_state=123)
rmse_list = []
feature_importance = pd.DataFrame()
for fold, (train_idx, test_idx) in enumerate(kfold.split(train)):
    print('='*50)
    print('fold {}'.format(fold))
    gbm.fit(train.iloc[train_idx][X], train.iloc[train_idx][y])
    pred = gbm.predict(train.iloc[test_idx][X])
#    rmse = np.sqrt(mean_squared_error(train.iloc[test_idx][y], pred))
    rmse = np.sqrt(mean_squared_error(np.exp(train.iloc[test_idx][y])-1, np.exp(pred)-1))
    rmse_list.append(rmse)
    print('rmse {}'.format(rmse))
    
    importance_df = pd.DataFrame()
    importance_df['feature'] = X
    importance_df['importance'] = gbm.feature_importances_
    importance_df['fold'] = fold + 1
    feature_importance = pd.concat([feature_importance, importance_df], axis=0)
==================================================
fold 0
      Iter       Train Loss   Remaining Time 
         1           0.1259           23.10m
         2           0.1065           22.28m
         3           0.0908           21.97m
         4           0.0779           21.70m
         5           0.0675           21.52m
         6           0.0591           21.26m
         7           0.0522           21.12m
         8           0.0466           20.85m
         9           0.0421           20.59m
        10           0.0383           20.32m
        20           0.0240           17.91m
        30           0.0219           15.75m
        40           0.0215           13.55m
        50           0.0213           11.28m
        60           0.0213            9.01m
        70           0.0212            6.75m
        80           0.0211            4.49m
        90           0.0211            2.24m
       100           0.0210            0.00s
rmse 76.81655221125601
==================================================
fold 1
      Iter       Train Loss   Remaining Time 
         1           0.1260           22.07m
         2           0.1066           21.79m
         3           0.0908           21.59m
         4           0.0780           21.68m
         5           0.0675           21.39m
         6           0.0591           21.11m
         7           0.0522           20.91m
         8           0.0466           20.71m
         9           0.0420           20.55m
        10           0.0383           20.36m
        20           0.0239           18.40m
        30           0.0218           16.33m
        40           0.0214           14.15m
        50           0.0213           11.87m
        60           0.0212            9.51m
        70           0.0211            7.14m
        80           0.0210            4.77m
        90           0.0210            2.39m
       100           0.0209            0.00s
rmse 80.4590172800173
==================================================
fold 2
      Iter       Train Loss   Remaining Time 
         1           0.1259           22.36m
         2           0.1065           22.10m
         3           0.0908           21.89m
         4           0.0780           21.69m
         5           0.0676           21.45m
         6           0.0591           21.29m
         7           0.0522           21.24m
         8           0.0466           21.17m
         9           0.0421           21.08m
        10           0.0384           20.96m
        20           0.0240           19.00m
        30           0.0219           16.78m
        40           0.0215           14.45m
        50           0.0213           12.06m
        60           0.0212            9.66m
        70           0.0212            7.26m
        80           0.0211            4.84m
        90           0.0210            2.42m
       100           0.0210            0.00s
rmse 53.03963382354496
==================================================
fold 3
      Iter       Train Loss   Remaining Time 
         1           0.1261           24.34m
         2           0.1067           24.01m
         3           0.0908           23.76m
         4           0.0780           23.52m
         5           0.0675           23.23m
         6           0.0590           22.94m
         7           0.0522           22.69m
         8           0.0465           22.43m
         9           0.0420           22.19m
        10           0.0382           21.93m
        20           0.0238           19.49m
        30           0.0217           17.07m
        40           0.0213           14.64m
        50           0.0211           12.19m
        60           0.0210            9.74m
        70           0.0210            7.24m
        80           0.0209            4.79m
        90           0.0209            2.37m
       100           0.0208            0.00s
rmse 49.03841446816272
print('通過交叉驗證,訓練集的均方根誤差爲{}'.format(np.mean(rmse_list)))
通過交叉驗證,訓練集的均方根誤差爲64.83840444574524
feature_importance.groupby('feature')['importance'].mean().sort_values(ascending=True).plot.barh(figsize=(8, 16))

在這裏插入圖片描述

結語

因爲沒有玩過這款遊戲,對其中的遊戲內容和指標還有理解不到位的地方,相信玩過遊戲後應該會有新的理解,找到其它的重要特徵。上文中的模型沒有調參,模型優化後結果還能提升。同時,還可以根據遊戲的參與度對用戶進行不同類型的劃分,觀察高氪用戶和普通用戶在遊戲屬性上的差異等其它方面。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章