[Kaggle競賽] IEEE-CIS Fraud Detection

0.寫在前面

Kaggle競賽——IEEE-CIS Fraud Detection

  • 賽題描述:
    In this competition, you’ll benchmark machine learning models on a challenging large-scale dataset. The data comes from Vesta’s real-world e-commerce transactions and contains a wide range of features from device type to product features. You also have the opportunity to create new features to improve your results.
    In this competition you are predicting the probability that an online transaction is fraudulent, as denoted by the binary target isFraud.
    The data is broken into two files identity and transaction, which are joined by TransactionID. Not all transactions have corresponding identity information.

  • LB:利用測試集前20%的數據進行驗證的auc得分。

  • Private Leaderboard最終得分:利用測試集剩餘80%的數據進行驗證的auc得分。

  • 本次比賽可以提交兩份結果。

    之前參加了Kaggle的幾個入門級比賽,這次試試看IEEE和Vesta主辦的二分類預測比賽,使用Python基於Jupyter Notebook用LightGBM建立模型進行預測,本比賽提分的關鍵在於對於數據的挖掘以及數據處理生成特徵的策略選取,需要進行非常細緻的EDA以及FE。
    本次比賽的結果是銅牌:373/6381-Top 6% Private Leaderboard:0.928512
    output
    本文給出的思路,旨在輔助對於題目的理解並幫助解釋貼出的Python代碼,並不是最優做法。本文思路及代碼僅供參考,思路中涉及到的方法以及詳細步驟等請移步至參考鏈接。代碼中變量命名、註釋、試驗記錄等比較亂,僅供參考。

1.EDA

請參考以下Kaggle_kernels:
Nanashi:Fraud complete EDA_Nanashi

1.1 觀察數據

官方數據描述及相關答疑:Data Description (Details and Discussion)

  1. 先來看Transaction表:
    TransactionDT: 不是真實的時間戳,而是與某一時間開始以秒爲單位的時間差。
    TransactionAMT: transaction payment amount in USD,小數部分值得關注。
    ProductCD: product code,有W\H\C\S\R五種。不一定是實際商品也有可能指某種服務。
    card1 - card6: payment card information, such as card type, card category, issue bank, country, etc.
    addr1-addr2: 是billing region和billing country
    dist: distances between (not limited) billing address, mailing address, zip code, IP address, phone area, etc.
    P_ and (R__) emaildomain: purchaser and recipient email domain,有一部分交易是不需要recipient的,其對應Remaildomain爲空
    C1-C14: counting, such as how many addresses are found to be associated with the payment card, etc. Plus like device, ipaddr, billingaddr, etc. Also these are for both purchaser and recipient, which doubles the number.
    D1-D15: timedelta, such as days between previous transaction, etc.
    M1-M9: match, such as names on card and address, etc.均爲01變量
    Vxxx: Vesta engineered rich features, including ranking, counting, and other entity relations.不同部分的V特徵有不同比例的缺失,其真正含義和處理方式仍不明。

  2. 再來看Identity表:
    id01 to id11 are numerical features for identity, which is collected by Vesta and security partners such as device rating, ip_domain rating, proxy rating, etc. Also it recorded behavioral fingerprint like account login times/failed to login times, how long an account stayed on the page, etc. All of these are not able to elaborate due to security partner T&C.
    DeviceType、DeviceInfo、id12 - id38是Categorical Features

在許多EDA相關Kernels中我們可以發現數據的一些特徵,尤其是數據隨時間變化分佈上的特徵,還有訓練集與測試集分佈的不同之處。

1.2 處理缺失值

  1. 缺失比例:缺失值的比例也請查看EDA Kernels。
  2. 利用特徵相關度判斷相關特徵來填充缺失值
    請參考:Gunes Evitan——IEEE-CIS Fraud Detection Dependency Check
def check_dependency(independent_var, dependent_var):
    
    independent_uniques = []
    temp_df = pd.concat([train_df[[independent_var, dependent_var]], test_df[[independent_var, dependent_var]]])
    
    for value in temp_df[independent_var].unique():
        independent_uniques.append(temp_df[temp_df[independent_var] == value][dependent_var].value_counts().shape[0])

    values = pd.Series(data=independent_uniques, index=temp_df[independent_var].unique())
    
    N = len(values)
    N_dependent = len(values[values == 1])
    N_notdependent = len(values[values > 1])
    N_null = len(values[values == 0])
        
    print(f'In {independent_var}, there are {N} unique values')
    print(f'{N_dependent}/{N} have one unique {dependent_var} value')
    print(f'{N_notdependent}/{N} have more than one unique {dependent_var} values')
    print(f'{N_null}/{N} have only missing {dependent_var} values\n')

舉個例子:

check_dependency('R_emaildomain', 'C5')
print(train_df['C10'].isnull().sum()/train_df.shape[0])
print(test_df['C10'].isnull().sum()/test_df.shape[0])
print(test_df[~test_df['R_emaildomain'].isnull()]['C5'].value_counts())
In R_emaildomain, there are 61 unique values
60/61 have one unique C5 value
0/61 have more than one unique C5 values
1/61 have only missing C5 values
0.0
5.920768278891869e-06
0.0    135867
Name: C5, dtype: int64

可見 R_emaildomain和C5相關度很高,且C5特徵於測試集中有少量缺失,而R_emaildomain不缺失的時候C5缺失,R_emaildomain不缺失時C5均爲0,將C5缺失值用0補上便是比較合理的。
按這個思路找到了幾組相關度很高的特徵,將測試集中的缺失值補上:

#1.1 find dependency and fillna
#'dist1', 'C3',只有test有C3的缺失,且只在dist1不缺失的時候缺失,dist1不缺失的時候C3全都是0
test_df['C3'] = test_df['C3'].fillna(0)
#'R_emaildomain', 'C5',只有test有C5的缺失,基本上都是在R_emaildomain不缺失的時候缺失,R_emaildomain缺失的C5缺失只有3個
test_df['C5'] = test_df['C5'].fillna(0)
#'id_30','C7',只有test有C7的缺失,只在id_30不缺失的時候缺失,id_30不缺失的C7缺失只有3個,其他都是0(Device)
test_df['C7'] = test_df['C7'].fillna(0)
#'id_31','C9',只有test有C9的缺失,只在id_31不缺失的時候缺失,id_31不缺失的C9缺失只有3個,其他都是0(Browser)
test_df['C9'] = test_df['C9'].fillna(0)
  1. 利用card1對應其餘card特徵的信息來填補card23456的缺失值
#1. More interaction between card features + fill nans
i_cols = ['TransactionID','card1','card2','card3','card4','card5','card6']

full_df = pd.concat([train_df[i_cols], test_df[i_cols]])

## I've used frequency encoding before so we have ints here
## we will drop very rare cards
full_df['card6'] = np.where(full_df['card6']==30, np.nan, full_df['card6'])
full_df['card6'] = np.where(full_df['card6']==16, np.nan, full_df['card6'])

i_cols = ['card2','card3','card4','card5','card6']

## We will find best match for nan values and fill with it 把23456都補上好多了
for col in i_cols:
    temp_df = full_df.groupby(['card1',col])[col].agg(['count']).reset_index()
    temp_df = temp_df.sort_values(by=['card1','count'], ascending=False).reset_index(drop=True)
    del temp_df['count']
    temp_df = temp_df.drop_duplicates(keep='first').reset_index(drop=True)
    temp_df.index = temp_df['card1'].values
    temp_df = temp_df[col].to_dict()
    full_df[col] = np.where(full_df[col].isna(), full_df['card1'].map(temp_df), full_df[col])
    
    
i_cols = ['card1','card2','card3','card4','card5','card6']
for col in i_cols:
    train_df[col] = full_df[full_df['TransactionID'].isin(train_df['TransactionID'])][col].values
    test_df[col] = full_df[full_df['TransactionID'].isin(test_df['TransactionID'])][col].values

1.3 挖掘數據隱含信息以便模型利用

爲了保護用戶信息官方對特徵做了許多處理也隱瞞了特徵的真實意義,需要通過對數據細緻的觀察分析來判斷特徵的意義及其蘊含的信息,以選擇特徵處理的合理手段。

  1. 日期
    Kevin——TransactionDT startdate
    這樣Black Friday和Cyber Monday可以更好重合,這裏選取2017-11-30作爲起始日期點,加以TransactionDT這個timedelta可以獲得交易的日期信息。
  2. D系列特徵
    Akasyanama——EDA what’s behind D features?
    A Humphrey——Understanding the D features (updated)
    tuttifrutti——Creating features from D columns (guessing userID)
    取幾個意思明晰的:
    D1: timedelta (days, rounded down) since first transaction for one card.
    D2: this appears to be the same as D1, except D1 = 0 values have been replaced by NaN.
    D3: timedelta since the previous transaction for one card. As with D1 and D2, the this feature appears to count different cards separately.
    D4: timedelta since first transaction for all cards on the account. Using the example of a husband and wife each using their own card on a joint credit card account, this feature would not distinguish between which card was used.
    D5: timedelta since the previous transaction for all cards on the account.
    D6 and D7: 是D4和D5某種組合變形,丟掉任何一個auc都會下降。
    D8: timedelta (float) since some event.
    D9:是D8的小數部分,也就是the hour of day,由於每個小時對應的fraud_rate,也就是IsFraud的平均值變化相差很小,這個特徵無法爲模型的預測提供較大的幫助,計劃丟掉這個特徵。
    D10:some kind of timedelta for domestic transactions.

選取處理策略:

  • 由於Ds特徵具有時間相關性,會隨TransactionDT變化,可以考慮取部分D特徵(如D1,D4)和TransactionDT,用兩者求差得到時間差,從而顯示開卡時間、上一筆交易具體時間等因素,單純利用不加處理的Ds特徵只能反映距離某一操作的時間差累積,且會引入時間變化。得到DminusDT類特徵後可用於進行用戶uid和卡cardid的合成,可以更加清晰地確定用戶。關於得到的DminusDT類特徵,雖然有可能帶來過擬合的風險,但本模型還是選擇保留它了。
  • Ds特徵也可進行以不同時間段內的min_max_scaling處理以及std_score處理,用自定義的value_normalization函數實現。
dt_df[new_col+'_min_max'] = (dt_df[col]-dt_df['temp_min'])/(dt_df['temp_max']-dt_df['temp_min'])
dt_df[new_col+'_std_score'] = (dt_df[col]-dt_df['temp_mean'])/(dt_df['temp_std'])
  1. C系列特徵
    在這裏插入圖片描述
    分佈請參考EDA相關kernels。
    前文也提到了,由於Cs特徵是對於交易付款人和收款人信息(如賬單地址、郵箱地址)個數的統計,部分C與其他特徵有較高的關聯度,可考慮通過這個思路填充其測試集的缺失值。
    訓練集和測試集的分佈有較大差別,考慮去除離羣值改善分佈。
  2. V系列特徵
    請參考:Laevatein——Interesting finding about the V columns
    可以根據Vs特徵缺失率將Vs特徵分塊,各部分內應該是由相同數據生成的。
    V1 ~ V11
    V12 ~ V34
    V35 ~ V52
    V53 ~ V74
    V75 ~ V94
    V95 ~ V137 高相關度
    V126-V138
    V138 ~ V166 (high null ratio)
    V167 ~ V216 (high null ratio)
    V217 ~ V278 (high null ratio, 2 different null ratios)
    V279 ~ V321 (2 different null ratios)
    V289-V318
    V319-V321高相關度
    V322 ~ V339 (high null ratio)

其中numerical類型的Vs特徵有:

'V126' 'V127' 'V128' 'V130' 'V131' 'V132' 'V133' 'V134' 'V136' 'V137'
 'V143' 'V144' 'V145' 'V150' 'V159' 'V160' 'V164' 'V165' 'V166' 'V202'
 'V203' 'V204' 'V205' 'V206' 'V207' 'V208' 'V209' 'V210' 'V211' 'V212'
 'V213' 'V214' 'V215' 'V216' 'V263' 'V264' 'V265' 'V266' 'V267' 'V268'
 'V270' 'V271' 'V272' 'V273' 'V274' 'V275' 'V276' 'V277' 'V278' 'V306'
 'V307' 'V308' 'V309' 'V310' 'V312' 'V313' 'V314' 'V315' 'V316' 'V317'
 'V318' 'V320' 'V321' 'V331' 'V332' 'V333' 'V335'

選取處理方式:

  • 對numerical的V做scaling和pca
  • 對Vs做Group PCA、一些其他處理,但是LB沒有提升便放棄了

2.Deep Feature Engineering

初步特徵處理思路(LB–>0.9487)請參考:
Konstantin Yakovlev——IEEE - Internal Blend
David Cairuz——Feature Engineering & LightGBM
後期特徵處理思路(LB:0.9487–>0.9526)請參見其他實驗記錄,以下爲最終採用的特徵工程代碼:

import numpy as np
import pandas as pd
import gc
import os, sys, random, datetime

將數據集縮小,佔用更小內存,並得到更高的處理效率,請參考:Konstantin Yakovlev——IEEE Data minification

def seed_everything(seed=0):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)

## Memory Reducer
# :df pandas dataframe to reduce size             # type: pd.DataFrame()
# :verbose                                        # type: bool
def reduce_mem_usage(df, verbose=True):
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2    
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)    
    end_mem = df.memory_usage().sum() / 1024**2
    if verbose: print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
    return df

載入訓練集和測試集,縮小其佔用空間。

print('Load Data')
train_df = pd.read_csv('../input/train_transaction.csv')
test_df = pd.read_csv('../input/test_transaction.csv')
test_df['isFraud'] = 0
train_identity = pd.read_csv('../input/train_identity.csv')
test_identity = pd.read_csv('../input/test_identity.csv')
print('Reduce Memory')
train_df = reduce_mem_usage(train_df)
test_df  = reduce_mem_usage(test_df)
train_identity = reduce_mem_usage(train_identity)
test_identity  = reduce_mem_usage(test_identity)
Load Data
Reduce Memory
Mem. usage decreased to 542.35 Mb (69.4% reduction)
Mem. usage decreased to 473.07 Mb (68.9% reduction)
Mem. usage decreased to 25.86 Mb (42.7% reduction)
Mem. usage decreased to 25.44 Mb (42.7% reduction)

對identity部分數據進行初步處理,主要是將字符串特徵,如DeviceInfo、id_30(系統信息)、id_31(瀏覽器信息),split生成新的特徵,用id_33(分辨率)生成設備特徵;並將其餘類別特徵從字符串轉爲numerical,部分信息bin處理:

def id_split(dataframe):
    
    dataframe['device_name'] = dataframe['DeviceInfo'].str.split('/', expand=True)[0]
    dataframe['device_version'] = dataframe['DeviceInfo'].str.split('/', expand=True)[1]

    dataframe['OS_id_30'] = dataframe['id_30'].str.split(' ', expand=True)[0]
    dataframe['version_id_30'] = dataframe['id_30'].str.split(' ', expand=True)[1]
 
    dataframe['browser_id_31'] = dataframe['id_31'].str.split(' ', expand=True)[0]
    dataframe['version_id_31'] = dataframe['id_31'].str.split(' ', expand=True)[1]

    dataframe['screen_width'] = dataframe['id_33'].str.split('x', expand=True)[0]
    dataframe['screen_height'] = dataframe['id_33'].str.split('x', expand=True)[1]
    dataframe['id_12'] = dataframe['id_12'].map({'Found':1, 'NotFound':0})
    dataframe['id_15'] = dataframe['id_15'].map({'New':2, 'Found':1, 'Unknown':0})
    dataframe['id_16'] = dataframe['id_16'].map({'Found':1, 'NotFound':0})

    dataframe['id_23'] = dataframe['id_23'].map({'TRANSPARENT':4, 'IP_PROXY':3, 'IP_PROXY:ANONYMOUS':2, 'IP_PROXY:HIDDEN':1})

    dataframe['id_27'] = dataframe['id_27'].map({'Found':1, 'NotFound':0})
    dataframe['id_28'] = dataframe['id_28'].map({'New':2, 'Found':1})

    dataframe['id_29'] = dataframe['id_29'].map({'Found':1, 'NotFound':0})

    dataframe['id_35'] = dataframe['id_35'].map({'T':1, 'F':0})
    dataframe['id_36'] = dataframe['id_36'].map({'T':1, 'F':0})
    dataframe['id_37'] = dataframe['id_37'].map({'T':1, 'F':0})
    dataframe['id_38'] = dataframe['id_38'].map({'T':1, 'F':0})

    dataframe['id_34'] = dataframe['id_34'].fillna(':0')
    dataframe['id_34'] = dataframe['id_34'].apply(lambda x: x.split(':')[1]).astype(np.int8)
    dataframe['id_34'] = np.where(dataframe['id_34']==0, np.nan, dataframe['id_34'])
    
    dataframe['id_33'] = dataframe['id_33'].fillna('0x0')
    dataframe['id_33_0'] = dataframe['id_33'].apply(lambda x: x.split('x')[0]).astype(int)
    dataframe['id_33_1'] = dataframe['id_33'].apply(lambda x: x.split('x')[1]).astype(int)
    dataframe['id_33'] = np.where(dataframe['id_33']=='0x0', np.nan, dataframe['id_33'])
    
    for feature in ['id_01', 'id_31', 'id_33', 'id_36']:
        dataframe[feature + '_count_dist'] = dataframe[feature].map(dataframe[feature].value_counts(dropna=False))
    
    dataframe['DeviceType'].map({'desktop':1, 'mobile':0})
    
    dataframe.loc[dataframe['device_name'].str.contains('SM', na=False), 'device_name'] = 'Samsung'
    dataframe.loc[dataframe['device_name'].str.contains('SAMSUNG', na=False), 'device_name'] = 'Samsung'
    dataframe.loc[dataframe['device_name'].str.contains('GT-', na=False), 'device_name'] = 'Samsung'
    dataframe.loc[dataframe['device_name'].str.contains('Moto G', na=False), 'device_name'] = 'Motorola'
    dataframe.loc[dataframe['device_name'].str.contains('Moto', na=False), 'device_name'] = 'Motorola'
    dataframe.loc[dataframe['device_name'].str.contains('moto', na=False), 'device_name'] = 'Motorola'
    dataframe.loc[dataframe['device_name'].str.contains('LG-', na=False), 'device_name'] = 'LG'
    dataframe.loc[dataframe['device_name'].str.contains('rv:', na=False), 'device_name'] = 'RV'
    dataframe.loc[dataframe['device_name'].str.contains('HUAWEI', na=False), 'device_name'] = 'Huawei'
    dataframe.loc[dataframe['device_name'].str.contains('ALE-', na=False), 'device_name'] = 'Huawei'
    dataframe.loc[dataframe['device_name'].str.contains('-L', na=False), 'device_name'] = 'Huawei'
    dataframe.loc[dataframe['device_name'].str.contains('Blade', na=False), 'device_name'] = 'ZTE'
    dataframe.loc[dataframe['device_name'].str.contains('BLADE', na=False), 'device_name'] = 'ZTE'
    dataframe.loc[dataframe['device_name'].str.contains('Linux', na=False), 'device_name'] = 'Linux'
    dataframe.loc[dataframe['device_name'].str.contains('XT', na=False), 'device_name'] = 'Sony'
    dataframe.loc[dataframe['device_name'].str.contains('HTC', na=False), 'device_name'] = 'HTC'
    dataframe.loc[dataframe['device_name'].str.contains('ASUS', na=False), 'device_name'] = 'Asus'

    dataframe.loc[dataframe.device_name.isin(dataframe.device_name.value_counts()[dataframe.device_name.value_counts() < 200].index), 'device_name'] = "Others"
    dataframe['had_id'] = 1
    gc.collect()
    return dataframe
train_identity = id_split(train_identity)
test_identity = id_split(test_identity)

對Transaction部分數據進行初步處理:

  • 對TransactionAmt做一定變換(Log,取小數部分)
  • emaildomain信息作一定處理:bin\前後綴\us\缺失值特徵
  • 對TransactionDT做一定處理(轉換爲明確的datetime,timedelta起始日期的探索見1.3,DTs 特徵留作對其他特徵進行aggregation操作,其本身並無價值,對於模型來講屬於噪音)
  • 對Ms特徵進行01編碼
  • 將ProductCD和M4組合在一起
  • 將card特徵結合其他特徵(addr\email\Ds)組合形成模擬uid,留作對其他特徵進行aggregation操作,card1和大部分uid特徵對於模型來講也屬於噪聲,可能帶來過擬合
  • 對TransactionAmt作Clip去除離羣值,並檢查訓練集和測試集TransactionAmt數值上的重合
#new features trans
def gen_new(train_trans,test_trans):

    # New feature - log of transaction amount.
    train_trans['TransactionAmt_Log'] = np.log1p(train_trans['TransactionAmt'])
    test_trans['TransactionAmt_Log'] = np.log1p(test_trans['TransactionAmt'])

    # New feature - decimal part of the transaction amount.
    train_trans['TransactionAmt_decimal'] = ((train_trans['TransactionAmt'] - train_trans['TransactionAmt'].astype(int)) * 1000).astype(int)
    test_trans['TransactionAmt_decimal'] = ((test_trans['TransactionAmt'] - test_trans['TransactionAmt'].astype(int)) * 1000).astype(int)

    
    # New feature - day of week in which a transaction happened.
    train_trans['Transaction_day_of_week'] = np.floor((train_trans['TransactionDT'] / (3600 * 24) - 1) % 7)
    test_trans['Transaction_day_of_week'] = np.floor((test_trans['TransactionDT'] / (3600 * 24) - 1) % 7)

    # New feature - hour of the day in which a transaction happened.
    train_trans['Transaction_hour'] = np.floor(train_trans['TransactionDT'] / 3600) % 24
    test_trans['Transaction_hour'] = np.floor(test_trans['TransactionDT'] / 3600) % 24
    
    #New feature - emaildomain with suffix
    emails = {'gmail': 'google', 'att.net': 'att', 'twc.com': 'spectrum', 'scranton.edu': 'other', 'optonline.net': 'other', 'hotmail.co.uk': 'microsoft', 'comcast.net': 'other', 'yahoo.com.mx': 'yahoo', 'yahoo.fr': 'yahoo', 'yahoo.es': 'yahoo', 'charter.net': 'spectrum', 'live.com': 'microsoft', 'aim.com': 'aol', 'hotmail.de': 'microsoft', 'centurylink.net': 'centurylink', 'gmail.com': 'google', 'me.com': 'apple', 'earthlink.net': 'other', 'gmx.de': 'other', 'web.de': 'other', 'cfl.rr.com': 'other', 'hotmail.com': 'microsoft', 'protonmail.com': 'other', 'hotmail.fr': 'microsoft', 'windstream.net': 'other', 'outlook.es': 'microsoft', 'yahoo.co.jp': 'yahoo', 'yahoo.de': 'yahoo', 'servicios-ta.com': 'other', 'netzero.net': 'other', 'suddenlink.net': 'other', 'roadrunner.com': 'other', 'sc.rr.com': 'other', 'live.fr': 'microsoft', 'verizon.net': 'yahoo', 'msn.com': 'microsoft', 'q.com': 'centurylink', 'prodigy.net.mx': 'att', 'frontier.com': 'yahoo', 'anonymous.com': 'other', 'rocketmail.com': 'yahoo', 'sbcglobal.net': 'att', 'frontiernet.net': 'yahoo', 'ymail.com': 'yahoo', 'outlook.com': 'microsoft', 'mail.com': 'other', 'bellsouth.net': 'other', 'embarqmail.com': 'centurylink', 'cableone.net': 'other', 'hotmail.es': 'microsoft', 'mac.com': 'apple', 'yahoo.co.uk': 'yahoo', 'netzero.com': 'other', 'yahoo.com': 'yahoo', 'live.com.mx': 'microsoft', 'ptd.net': 'other', 'cox.net': 'other', 'aol.com': 'aol', 'juno.com': 'other', 'icloud.com': 'apple','uknown':'uknown'}
    us_emails = ['gmail', 'net', 'edu']

    for c in ['P_emaildomain', 'R_emaildomain']:
        train_trans[c] = train_trans[c].fillna('uknown')
        test_trans[c] = test_trans[c].fillna('uknown')
        
        train_trans[c + '_bin'] = train_trans[c].map(emails)
        test_trans[c + '_bin'] = test_trans[c].map(emails)
    
        train_trans[c + '_suffix'] = train_trans[c].apply(lambda x: str(x).split('.')[-1])
        test_trans[c + '_suffix'] = test_trans[c].apply(lambda x: str(x).split('.')[-1])
        
        train_trans[c + '_prefix'] = train_trans[c].apply(lambda x: str(x).split('.')[0])
        test_trans[c + '_prefix'] = test_trans[c].apply(lambda x: str(x).split('.')[0])

        train_trans[c + '_suffix_us'] = train_trans[c + '_suffix'].map(lambda x: x if str(x) not in us_emails else 'us')
        test_trans[c + '_suffix_us'] = test_trans[c + '_suffix'].map(lambda x: x if str(x) not in us_emails else 'us')
    train_trans['email_check'] = np.where((train_trans['P_emaildomain']==train_trans['R_emaildomain'])&(train_trans['P_emaildomain']!='uknown'),1,0)
    test_trans['email_check'] = np.where((test_trans['P_emaildomain']==test_trans['R_emaildomain'])&(test_trans['P_emaildomain']!='uknown'),1,0)
    
    #New feature - dates
    START_DATE = datetime.datetime.strptime('2017-11-30', '%Y-%m-%d')
    from pandas.tseries.holiday import USFederalHolidayCalendar as calendar
    dates_range = pd.date_range(start='2017-10-01', end='2019-01-01')
    us_holidays = calendar().holidays(start=dates_range.min(), end=dates_range.max())

    for df in [train_trans, test_trans]:
        # Temporary
        df['DT'] = df['TransactionDT'].apply(lambda x: (START_DATE + datetime.timedelta(seconds = x)))
        df['DT_M'] = (df['DT'].dt.year-2017)*12 + df['DT'].dt.month
        df['DT_W'] = (df['DT'].dt.year-2017)*52 + df['DT'].dt.weekofyear
        df['DT_D'] = (df['DT'].dt.year-2017)*365 + df['DT'].dt.dayofyear

        df['DT_hour'] = df['DT'].dt.hour
        df['DT_day_week'] = df['DT'].dt.dayofweek
        df['DT_day'] = df['DT'].dt.day
        df['DT_day_month'] = (df['DT'].dt.day).astype(np.int8)
        # Possible solo feature
        df['is_december'] = df['DT'].dt.month
        df['is_december'] = (df['is_december']==12).astype(np.int8)

        # Holidays
        df['is_holiday'] = (df['DT'].dt.date.astype('datetime64').isin(us_holidays)).astype(np.int8)
   
    #New feature - binary encoded 1/0 gen new
    i_cols = ['M1','M2','M3','M5','M6','M7','M8','M9']
    for df in [train_trans, test_trans]:
        df['M_sum'] = df[i_cols].sum(axis=1).astype(np.int8)
        df['M_na'] = df[i_cols].isna().sum(axis=1).astype(np.int8)

    #New feature - ProductCD and M4 Target mean
    for col in ['ProductCD','M4']:
        temp_dict = train_trans.groupby([col])['isFraud'].agg(['mean']).reset_index().rename(columns={'mean': col+'_target_mean'})
        temp_dict.index = temp_dict[col].values
        temp_dict = temp_dict[col+'_target_mean'].to_dict()

        train_trans[col+'_target_mean'] = train_trans[col].map(temp_dict)
        test_trans[col+'_target_mean']  = test_trans[col].map(temp_dict)
    
    #New feature - use it for aggregations
    train_trans['uid1'] = train_trans['card1'].astype(str)+'_'+train_trans['card2'].astype(str) 
    test_trans['uid1'] = test_trans['card1'].astype(str)+'_'+test_trans['card2'].astype(str)

    train_trans['uid2'] = train_trans['uid1'].astype(str)+'_'+train_trans['card3'].astype(str)+'_'+train_trans['card5'].astype(str)
    test_trans['uid2'] = test_trans['uid1'].astype(str)+'_'+test_trans['card3'].astype(str)+'_'+test_trans['card5'].astype(str)

    train_trans['uid3'] = train_trans['uid2'].astype(str)+'_'+train_trans['addr1'].astype(str)+'_'+train_trans['addr2'].astype(str)
    test_trans['uid3'] = test_trans['uid2'].astype(str)+'_'+test_trans['addr1'].astype(str)+'_'+test_trans['addr2'].astype(str)

    # Check if the Transaction Amount is common or not (we can use freq encoding here)
    # In our dialog with a model we are telling to trust or not to these values   
    # Clip Values
    train_trans['TransactionAmt'] = train_trans['TransactionAmt'].clip(0,5000)
    test_trans['TransactionAmt']  = test_trans['TransactionAmt'].clip(0,5000)

    train_trans['TransactionAmt_check'] = np.where(train_trans['TransactionAmt'].isin(test_trans['TransactionAmt']), 1, 0)
    test_trans['TransactionAmt_check']  = np.where(test_trans['TransactionAmt'].isin(train_trans['TransactionAmt']), 1, 0)

    return train_trans,test_trans
train_df,test_df = gen_new(train_df,test_df)

定義aggregation用函數,按一定時長計算出現頻率的timeblock_frequency_encoding,以uid做agg類處理的uid_aggregation,uid_aggregation_and_normalization,計算頻率進行編碼的frequency_encoding:

def timeblock_frequency_encoding(train_df, test_df, periods, columns, 
                                 with_proportions=True, only_proportions=False):
    for period in periods:
        for col in columns:
            new_col = col +'_'+ period
            train_df[new_col] = train_df[col].astype(str)+'_'+train_df[period].astype(str)
            test_df[new_col]  = test_df[col].astype(str)+'_'+test_df[period].astype(str)

            temp_df = pd.concat([train_df[[new_col]], test_df[[new_col]]])
            fq_encode = temp_df[new_col].value_counts().to_dict()

            train_df[new_col] = train_df[new_col].map(fq_encode)
            test_df[new_col]  = test_df[new_col].map(fq_encode)
            
            if only_proportions:
                train_df[new_col] = train_df[new_col]/train_df[period+'_total']
                test_df[new_col]  = test_df[new_col]/test_df[period+'_total']

            if with_proportions:
                train_df[new_col+'_proportions'] = train_df[new_col]/train_df[period+'_total']
                test_df[new_col+'_proportions']  = test_df[new_col]/test_df[period+'_total']

    return train_df, test_df
def uid_aggregation(train_df, test_df, main_columns, uids, aggregations):
    for main_column in main_columns:  
        for col in uids:
            for agg_type in aggregations:
                new_col_name = col+'_'+main_column+'_'+agg_type
                temp_df = pd.concat([train_df[[col, main_column]], test_df[[col,main_column]]])
                temp_df = temp_df.groupby([col])[main_column].agg([agg_type]).reset_index().rename(
                                                        columns={agg_type: new_col_name})

                temp_df.index = list(temp_df[col])
                temp_df = temp_df[new_col_name].to_dict()   

                train_df[new_col_name] = train_df[col].map(temp_df)
                test_df[new_col_name]  = test_df[col].map(temp_df)
    return train_df, test_df

def uid_aggregation_and_normalization(train_df, test_df, main_columns, uids, aggregations):
    for main_column in main_columns:  
        for col in uids:
            
            new_norm_col_name = col+'_'+main_column+'_std_norm'
            norm_cols = []
            
            for agg_type in aggregations:
                new_col_name = col+'_'+main_column+'_'+agg_type
                temp_df = pd.concat([train_df[[col, main_column]], test_df[[col,main_column]]])
                temp_df = temp_df.groupby([col])[main_column].agg([agg_type]).reset_index().rename(
                                                        columns={agg_type: new_col_name})

                temp_df.index = list(temp_df[col])
                temp_df = temp_df[new_col_name].to_dict()   

                train_df[new_col_name] = train_df[col].map(temp_df)
                test_df[new_col_name]  = test_df[col].map(temp_df)
                norm_cols.append(new_col_name)
            
            train_df[new_norm_col_name] = (train_df[main_column]-train_df[norm_cols[0]])/train_df[norm_cols[1]]
            test_df[new_norm_col_name]  = (test_df[main_column]-test_df[norm_cols[0]])/test_df[norm_cols[1]]          
            
            del train_df[norm_cols[0]], train_df[norm_cols[1]]
            del test_df[norm_cols[0]], test_df[norm_cols[1]]
                                              
    return train_df, test_df

def frequency_encoding(train_df, test_df, columns, self_encoding=False):
    for col in columns:
        temp_df = pd.concat([train_df[[col]], test_df[[col]]])
        fq_encode = temp_df[col].value_counts(dropna=False).to_dict()
        if self_encoding:
            train_df[col] = train_df[col].map(fq_encode)
            test_df[col]  = test_df[col].map(fq_encode)            
        else:
            train_df[col+'_fq_enc'] = train_df[col].map(fq_encode)
            test_df[col+'_fq_enc']  = test_df[col].map(fq_encode)
    return train_df, test_df

接下來開始進一步的特徵工程:

#2. Keep intersactions
for col in ['card1']: 
    valid_card = pd.concat([train_df[[col]], test_df[[col]]])
    valid_card = valid_card[col].value_counts()
    valid_card_std = valid_card.values.std()

    invalid_cards = valid_card[valid_card<=2]
    print('Rare cards',len(invalid_cards))

    valid_card = valid_card[valid_card>2]
    valid_card = list(valid_card.index)

    print('No intersection in Train', len(train_df[~train_df[col].isin(test_df[col])]))
    print('Intersection in Train', len(train_df[train_df[col].isin(test_df[col])]))
    
    train_df[col] = np.where(train_df[col].isin(test_df[col]), train_df[col], np.nan)
    test_df[col]  = np.where(test_df[col].isin(train_df[col]), test_df[col], np.nan)

    train_df[col] = np.where(train_df[col].isin(valid_card), train_df[col], np.nan)
    test_df[col]  = np.where(test_df[col].isin(valid_card), test_df[col], np.nan)
    print('#'*20)

for col in ['card2','card3','card4','card5','card6']: 
    print('No intersection in Train', col, len(train_df[~train_df[col].isin(test_df[col])]))
    print('Intersection in Train', col, len(train_df[train_df[col].isin(test_df[col])]))
    
    train_df[col] = np.where(train_df[col].isin(test_df[col]), train_df[col], np.nan)
    test_df[col]  = np.where(test_df[col].isin(train_df[col]), test_df[col], np.nan)
    print('#'*20)
Rare cards 5993
No intersection in Train 10396
Intersection in Train 580144
####################
No intersection in Train card2 0
Intersection in Train card2 590540
####################
No intersection in Train card3 47
Intersection in Train card3 590493
####################
No intersection in Train card4 0
Intersection in Train card4 590540
####################
No intersection in Train card5 176
Intersection in Train card5 590364
####################
No intersection in Train card6 30
Intersection in Train card6 590510
####################
#3.generate accurate userids and cardids
train_df['uid4'] = train_df['uid3'].astype(str)+'_'+train_df['P_emaildomain'].astype(str)
test_df['uid4'] = test_df['uid3'].astype(str)+'_'+test_df['P_emaildomain'].astype(str)

train_df['uid5'] = train_df['uid3'].astype(str)+'_'+train_df['R_emaildomain'].astype(str)
test_df['uid5'] = test_df['uid3'].astype(str)+'_'+test_df['R_emaildomain'].astype(str)

train_df['uid6'] = train_df['card1'].astype(str)+'_'+train_df['D15'].astype(str)
test_df['uid6'] = test_df['card1'].astype(str)+'_'+test_df['D15'].astype(str)

#try to generate more accuracy card_id and user_id
#uid1\2 不太有使用的價值了

#guess_card_id
train_df['TransactionDTday'] = (train_df['TransactionDT']/(60*60*24)).map(int)
test_df['TransactionDTday'] = (test_df['TransactionDT']/(60*60*24)).map(int)
train_df['D1minusday'] = train_df['D1'] - train_df['TransactionDTday'] #髮卡日
test_df['D1minusday'] = test_df['D1'] - test_df['TransactionDTday']
train_df['D4minusday'] = train_df['D4'] - train_df['TransactionDTday'] #髮卡日
test_df['D4minusday'] = test_df['D4'] - test_df['TransactionDTday']

#這個應該對D1\D2\D3\D8有效果,D2沒必要動,D3/D8應該有別的用法
train_df['cid_1'] = train_df['uid4'].astype(str)+'_'+train_df['D1minusday'].astype(str)
test_df['cid_1'] = test_df['uid4'].astype(str)+'_'+test_df['D1minusday'].astype(str)

#guess_user_id 用D4
train_df['uid7'] = train_df['uid4'].astype(str)+'_'+train_df['D4minusday'].astype(str)
test_df['uid7'] = test_df['uid4'].astype(str)+'_'+test_df['D4minusday'].astype(str)

print('#'*10)
print('Most common uIds:')
new_columns = ['uid1','uid2','uid3','uid4','uid5','uid6','uid7','cid_1']
for col in new_columns:
    print('#'*10, col)
    print(train_df[col].value_counts()[:10])

# Do Global frequency encoding 

i_cols = ['card1','card2','card3','card5'] + new_columns
train_df, test_df = frequency_encoding(train_df, test_df, i_cols, self_encoding=False)
##########
Most common uIds:
########## uid1
7919_194.0     14891
9500_321.0     14112
15885_545.0    10332
17188_321.0    10312
15066_170.0     7918
12695_490.0     7079
6019_583.0      6766
12544_321.0     6760
2803_100.0      6126
7585_553.0      5325
Name: uid1, dtype: int64
########## uid2
9500_321.0_150.0_226.0     14112
15885_545.0_185.0_138.0    10332
17188_321.0_150.0_226.0    10312
7919_194.0_150.0_166.0      8844
15066_170.0_150.0_102.0     7918
12695_490.0_150.0_226.0     7079
6019_583.0_150.0_226.0      6766
12544_321.0_150.0_226.0     6760
2803_100.0_150.0_226.0      6126
7919_194.0_150.0_202.0      6047
Name: uid2, dtype: int64
########## uid3
15885_545.0_185.0_138.0_nan_nan       9900
17188_321.0_150.0_226.0_299.0_87.0    5862
12695_490.0_150.0_226.0_325.0_87.0    5766
9500_321.0_150.0_226.0_204.0_87.0     4647
3154_408.0_185.0_224.0_nan_nan        4398
12839_321.0_150.0_226.0_264.0_87.0    3538
16132_111.0_150.0_226.0_299.0_87.0    3523
15497_490.0_150.0_226.0_299.0_87.0    3419
9500_321.0_150.0_226.0_272.0_87.0     2715
5812_408.0_185.0_224.0_nan_nan        2639
Name: uid3, dtype: int64
########## uid4
15885_545.0_185.0_138.0_nan_nan_hotmail.com     4002
15885_545.0_185.0_138.0_nan_nan_gmail.com       3830
17188_321.0_150.0_226.0_299.0_87.0_gmail.com    2235
12695_490.0_150.0_226.0_325.0_87.0_gmail.com    2045
9500_321.0_150.0_226.0_204.0_87.0_gmail.com     1947
3154_408.0_185.0_224.0_nan_nan_hotmail.com      1890
3154_408.0_185.0_224.0_nan_nan_gmail.com        1537
12839_321.0_150.0_226.0_264.0_87.0_gmail.com    1473
15775_481.0_150.0_102.0_330.0_87.0_uknown       1453
15497_490.0_150.0_226.0_299.0_87.0_gmail.com    1383
Name: uid4, dtype: int64
########## uid5
12695_490.0_150.0_226.0_325.0_87.0_uknown      5446
17188_321.0_150.0_226.0_299.0_87.0_uknown      5322
9500_321.0_150.0_226.0_204.0_87.0_uknown       4403
15885_545.0_185.0_138.0_nan_nan_hotmail.com    4002
15885_545.0_185.0_138.0_nan_nan_gmail.com      3830
12839_321.0_150.0_226.0_264.0_87.0_uknown      3365
16132_111.0_150.0_226.0_299.0_87.0_uknown      3212
15497_490.0_150.0_226.0_299.0_87.0_uknown      3027
9500_321.0_150.0_226.0_272.0_87.0_uknown       2601
7664_490.0_150.0_226.0_264.0_87.0_uknown       2396
Name: uid5, dtype: int64
########## uid6
15885.0_0.0    7398
7919.0_0.0     4170
6019.0_nan     3962
nan_0.0        3754
9500.0_0.0     3414
3154.0_0.0     3016
15066.0_0.0    2995
9633.0_0.0     2968
nan_nan        2794
17188.0_0.0    2434
Name: uid6, dtype: int64
########## uid7
15775_481.0_150.0_102.0_330.0_87.0_uknown_nan    1453
12695_490.0_150.0_226.0_325.0_87.0_uknown_nan     928
17188_321.0_150.0_226.0_299.0_87.0_uknown_nan     923
9500_321.0_150.0_226.0_204.0_87.0_uknown_nan      622
16132_111.0_150.0_226.0_299.0_87.0_uknown_nan     622
12839_321.0_150.0_226.0_264.0_87.0_uknown_nan     580
7207_111.0_150.0_226.0_204.0_87.0_uknown_nan      551
7664_490.0_150.0_226.0_264.0_87.0_uknown_nan      545
15497_490.0_150.0_226.0_299.0_87.0_uknown_nan     480
9112_250.0_150.0_226.0_441.0_87.0_uknown_nan      439
Name: uid7, dtype: int64
########## cid_1
15775_481.0_150.0_102.0_330.0_87.0_uknown_-129.0       1414
9500_321.0_150.0_226.0_126.0_87.0_aol.com_85.0          404
8528_215.0_150.0_226.0_387.0_87.0_uknown_159.0          207
7207_111.0_150.0_226.0_204.0_87.0_uknown_465.0          189
12741_106.0_150.0_226.0_143.0_87.0_gmail.com_202.0      156
13597_198.0_150.0_226.0_191.0_87.0_yahoo.com_48.0       145
4121_361.0_150.0_226.0_476.0_87.0_hotmail.com_8.0       141
8900_385.0_150.0_226.0_231.0_87.0_uknown_60.0           132
9323_111.0_150.0_226.0_191.0_87.0_charter.net_50.0      109
3898_281.0_150.0_226.0_181.0_87.0_hotmail.com_188.0     106
Name: cid_1, dtype: int64
#4. period counts
for col in ['DT_M','DT_W','DT_D']:
    temp_df = pd.concat([train_df[[col]], test_df[[col]]])
    fq_encode = temp_df[col].value_counts().to_dict()
            
    train_df[col+'_total'] = train_df[col].map(fq_encode)
    test_df[col+'_total']  = test_df[col].map(fq_encode)
        
#User period counts
periods = ['DT_M','DT_W','DT_D']
i_cols = ['uid4','uid5','uid6','uid7','cid_1']
for period in periods:
    for col in i_cols:
        new_column = col + '_' + period
            
        temp_df = pd.concat([train_df[[col,period]], test_df[[col,period]]])
        temp_df[new_column] = temp_df[col].astype(str) + '_' + (temp_df[period]).astype(str)
        fq_encode = temp_df[new_column].value_counts().to_dict()
            
        train_df[new_column] = (train_df[col].astype(str) + '_' + train_df[period].astype(str)).map(fq_encode)
        test_df[new_column]  = (test_df[col].astype(str) + '_' + test_df[period].astype(str)).map(fq_encode)
        
        train_df[new_column] /= train_df[period+'_total']
        test_df[new_column]  /= test_df[period+'_total']
#5. Prepare bank type feature
for df in [train_df, test_df]:
    df['bank_type'] = df['card3'].astype(str) +'_'+ df['card5'].astype(str)

encoding_mean = {
    1: ['DT_D','DT_hour','_hour_dist','DT_hour_mean'],
    2: ['DT_W','DT_day_week','_week_day_dist','DT_day_week_mean'],
    3: ['DT_M','DT_day_month','_month_day_dist','DT_day_month_mean'],
    }

encoding_best = {
    1: ['DT_D','DT_hour','_hour_dist_best','DT_hour_best'],
    2: ['DT_W','DT_day_week','_week_day_dist_best','DT_day_week_best'],
    3: ['DT_M','DT_day_month','_month_day_dist_best','DT_day_month_best'],   
    }

train_df['DT_day_month'] = (train_df['DT'].dt.day).astype(np.int8)
test_df['DT_day_month'] = (test_df['DT'].dt.day).astype(np.int8)
# Some ugly code here (even worse than in other parts)
for col in ['card3','card5','bank_type']:
    for df in [train_df, test_df]:
        for encode in encoding_mean:
            encode = encoding_mean[encode].copy()
            new_col = col + '_' + encode[0] + encode[2]
            df[new_col] = df[col].astype(str) +'_'+ df[encode[0]].astype(str)

            temp_dict = df.groupby([new_col])[encode[1]].agg(['mean']).reset_index().rename(
                                                                    columns={'mean': encode[3]})
            temp_dict.index = temp_dict[new_col].values
            temp_dict = temp_dict[encode[3]].to_dict()
            df[new_col] = df[encode[1]] - df[new_col].map(temp_dict)

        for encode in encoding_best:
            encode = encoding_best[encode].copy()
            new_col = col + '_' + encode[0] + encode[2]
            df[new_col] = df[col].astype(str) +'_'+ df[encode[0]].astype(str)
            temp_dict = df.groupby([col,encode[0],encode[1]])[encode[1]].agg(['count']).reset_index().rename(
                                                                    columns={'count': encode[3]})

            temp_dict.sort_values(by=[col,encode[0],encode[3]], inplace=True)
            temp_dict = temp_dict.drop_duplicates(subset=[col,encode[0]], keep='last')
            temp_dict[new_col] = temp_dict[col].astype(str) +'_'+ temp_dict[encode[0]].astype(str)
            temp_dict.index = temp_dict[new_col].values
            temp_dict = temp_dict[encode[1]].to_dict()
            df[new_col] = df[encode[1]] - df[new_col].map(temp_dict)
#6. BankType timeblock_frequency_encoding
i_cols = ['bank_type'] 
periods = ['DT_M','DT_W','DT_D']

# We have few options to encode it here:
# - Just count transactions
# (but some timblocks have more transactions than others)
# - Devide to total transactions per timeblock (proportions)
# - Use both
# - Use only proportions
train_df, test_df = timeblock_frequency_encoding(train_df, test_df, periods, i_cols, 
                                 with_proportions=False, only_proportions=True)
#7. Ds uid aggregations (maybe not useful)
i_cols = ['D'+str(i) for i in range(1,16)]
uids = ['uid3','uid4','uid5','bank_type','cid1','uid6','uid7']
aggregations = ['mean','min']

####### Cleaning Neagtive values and columns transformations
for df in [train_df, test_df]:

    for col in i_cols:
        df[col] = df[col].clip(0) 
    
    # Lets transform D8 and D9 column
    # As we almost sure it has connection with hours
    df['D9_not_na'] = np.where(df['D9'].isna(),0,1)
    df['D8_not_same_day'] = np.where(df['D8']>=1,1,0)
    df['D8_D9_decimal_dist'] = df['D8'].fillna(0)-df['D8'].fillna(0).astype(int)
    df['D8_D9_decimal_dist'] = ((df['D8_D9_decimal_dist']-df['D9'])**2)**0.5
    df['D8'] = df['D8'].fillna(-1).astype(int)
def values_normalization(dt_df, periods, columns):
    for period in periods:
        for col in columns:
            new_col = col +'_'+ period
            dt_df[col] = dt_df[col].astype(float)  

            temp_min = dt_df.groupby([period])[col].agg(['min']).reset_index()
            temp_min.index = temp_min[period].values
            temp_min = temp_min['min'].to_dict()

            temp_max = dt_df.groupby([period])[col].agg(['max']).reset_index()
            temp_max.index = temp_max[period].values
            temp_max = temp_max['max'].to_dict()

            temp_mean = dt_df.groupby([period])[col].agg(['mean']).reset_index()
            temp_mean.index = temp_mean[period].values
            temp_mean = temp_mean['mean'].to_dict()

            temp_std = dt_df.groupby([period])[col].agg(['std']).reset_index()
            temp_std.index = temp_std[period].values
            temp_std = temp_std['std'].to_dict()

            dt_df['temp_min'] = dt_df[period].map(temp_min)
            dt_df['temp_max'] = dt_df[period].map(temp_max)
            dt_df['temp_mean'] = dt_df[period].map(temp_mean)
            dt_df['temp_std'] = dt_df[period].map(temp_std)

            dt_df[new_col+'_min_max'] = (dt_df[col]-dt_df['temp_min'])/(dt_df['temp_max']-dt_df['temp_min'])
            dt_df[new_col+'_std_score'] = (dt_df[col]-dt_df['temp_mean'])/(dt_df['temp_std'])
            del dt_df['temp_min'],dt_df['temp_max'],dt_df['temp_mean'],dt_df['temp_std']
    return dt_df
#8. Ds period calculation (maybe not useful)
####### Values Normalization
i_cols.remove('D1')
i_cols.remove('D2')
i_cols.remove('D9')
periods = ['DT_D','DT_W','DT_M']

for df in [train_df, test_df]:
    df = values_normalization(df, periods, i_cols)


for col in ['D1','D2']:
    for df in [train_df, test_df]:
        df[col+'_scaled'] = df[col]/train_df[col].max()
        
####### Global Self frequency encoding
# self_encoding=True because 
# we don't need original values anymore
i_cols = ['D'+str(i) for i in range(1,16)]
train_df, test_df = frequency_encoding(train_df, test_df, i_cols, self_encoding=True)

對TransactionAmt做各種處理:

#9. TransAmt uids/cids aggregations and calculations(need more fe)
i_cols = ['TransactionAmt','TransactionAmt_decimal']
#uids = ['card1','card2','card3','card5','uid1','uid2','uid3','uid4','uid5','bank_type','uid6']
uids = ['card1','card2','card3','card5','uid3','uid4','uid5','bank_type','uid6','uid7','cid_1']
aggregations = ['mean','std','min']

# uIDs aggregations
train_df, test_df = uid_aggregation(train_df, test_df, i_cols, uids, aggregations)

for df in [train_df,test_df]:
    df['transAmt_mut_C1'] = df['TransactionAmt'] * df['C1']
    df['transAmt_mut_C13'] = df['TransactionAmt'] * df['C13']
    df['transAmt_mut_C14'] = df['TransactionAmt'] * df['C14']
    df['transAmt_dec_diff'] = df['TransactionAmt_decimal'] - ((df['uid4_TransactionAmt_mean']-df['uid4_TransactionAmt_mean'].astype(int)) * 1000).astype(int)
    df['Transdiff_in_uid'] = df['transAmt_dec_diff']*df['uid4_TransactionAmt_mean']/1000

# TransactionAmt Normalization-period scaling
periods = ['DT_D','DT_W','DT_M']
for df in [train_df, test_df]:
    df = values_normalization(df, periods, i_cols)
# Product type
train_df['product_type'] = train_df['ProductCD'].astype(str)+'_'+train_df['TransactionAmt'].astype(str)
test_df['product_type'] = test_df['ProductCD'].astype(str)+'_'+test_df['TransactionAmt'].astype(str)

i_cols = ['product_type']
periods = ['DT_D','DT_W','DT_M']
train_df, test_df = timeblock_frequency_encoding(train_df, test_df, periods, i_cols, 
                                                 with_proportions=False, only_proportions=True)
train_df, test_df = frequency_encoding(train_df, test_df, i_cols, self_encoding=True)

對於Vs特徵進行分類,請參考:Rajesh Vikraman——Understanding V columns

def column_value_freq(sel_col,cum_per):
    dfpercount = pd.DataFrame(columns=['col_name','num_values_'+str(round(cum_per,2))])
    for col in sel_col:
        col_value = train_df[col].value_counts(normalize=True)
        colpercount = pd.DataFrame({'value' : col_value.index,'per_count' : col_value.values})
        colpercount['cum_per_count'] = colpercount['per_count'].cumsum()
        if len(colpercount.loc[colpercount['cum_per_count'] < cum_per,] ) < 2:
            num_col_99 = len(colpercount.loc[colpercount['per_count'] > (1- cum_per),]) #返回大頭
        else:
            num_col_99 = len(colpercount.loc[colpercount['cum_per_count']< cum_per,] ) #返回小頭
        dfpercount=dfpercount.append({'col_name': col,'num_values_'+str(round(cum_per,2)): num_col_99},ignore_index = True)
    dfpercount['unique_values'] = train_df[sel_col].nunique().values
    dfpercount['unique_value_to_num_values'+str(round(cum_per,2))+'_ratio'] = 100 * (dfpercount['num_values_'+str(round(cum_per,2))]/dfpercount.unique_values)
    #dfpercount['percent_missing'] = percent_na(train_transaction[sel_col])['percent_missing'].round(3).values
    return dfpercount
#10. V cols
#Understand V cols
v_cols = ['V'+str(i) for i in range(1,340)]
cum_per = 0.965
colfreq=column_value_freq(v_cols,cum_per)
print(colfreq.head())
colfreq_bool = colfreq[colfreq.unique_values==2]['col_name'].values
colfreq_pseudobool = colfreq[(colfreq.unique_values !=2) & (colfreq['num_values_'+str(round(cum_per,2))] <= 2)]
colfreq_pseudobool_cat = colfreq_pseudobool[colfreq_pseudobool.unique_values <=15]['col_name'].values
colfreq_pseudobool_num = colfreq_pseudobool[colfreq_pseudobool.unique_values >15]['col_name'].values
colfreq_cat = colfreq[(colfreq.unique_values >15) & (colfreq['num_values_'+str(round(cum_per,2))] <= 15) & (colfreq['num_values_'+str(round(cum_per,2))]> 2)]['col_name'].values
colfreq_num = colfreq[colfreq['num_values_'+str(round(cum_per,2))]>15]['col_name'].values
 col_name num_values_0.96  unique_values unique_value_to_num_values0.96_ratio
0       V1               1              2                                   50
1       V2               2              9                              22.2222
2       V3               2             10                                   20
3       V4               2              7                              28.5714
4       V5               2              7                              28.5714

EDA觀察數據,去掉部分離羣值:

#cliping v_num_cats
vcol_spike = ['V96', 'V97','V167', 'V168','V177', 'V178','V179', 'V217', 'V218', 'V219','V231','V280', 'V282','V294', 'V322', 'V323', 'V324']
cols = list(colfreq_pseudobool_num) + vcol_spike
for df in [train_df, test_df]:
    for col in cols :
        max_value = train_df[train_df['DT_M']==train_df['DT_M'].min()][col].max()
        df[col] = df[col].clip(None,max_value) 

僅對numerical的Vs進行歸一化以及PCA,請參考:Konstantin Yakovlev——IEEE - V columns pv,但與參考中不同的是本模型僅對numerical的V特徵進行處理:

#Dealing with V cols
#Scaling with pca - Numerical V cols - scaling仍需謹慎
from sklearn.preprocessing import StandardScaler

v_cols = colfreq_num
print(v_cols)
test_group = list(v_cols)
train_df['group_sum'] = train_df[test_group].to_numpy().sum(axis=1)
train_df['group_mean'] = train_df[test_group].to_numpy().mean(axis=1)
    
test_df['group_sum'] = test_df[test_group].to_numpy().sum(axis=1)
test_df['group_mean'] = test_df[test_group].to_numpy().mean(axis=1)
compact_cols = ['group_sum','group_mean']
 
for col in test_group:
    sc = StandardScaler()
    sc.fit(train_df[[col]].fillna(0))
    train_df[col] = sc.transform(train_df[[col]].fillna(0))
    test_df[col] = sc.transform(test_df[[col]].fillna(0))
    
sc_test_group = test_group

# check -> same obviously
features_check = []
from scipy.stats import ks_2samp #檢查兩個分佈是否相同的函數
for col in sc_test_group:
    features_check.append(ks_2samp(train_df[col], test_df[col])[1])
    
features_check = pd.Series(features_check, index=sc_test_group).sort_values() 
print(features_check)

from sklearn.decomposition import PCA
#PCA還是必要的-是正交線性去噪
pca = PCA(random_state=42)
pca.fit(train_df[sc_test_group])
print(len(sc_test_group), pca.transform(train_df[sc_test_group]).shape[-1])
train_df[sc_test_group] = pca.transform(train_df[sc_test_group])
test_df[sc_test_group] = pca.transform(test_df[sc_test_group])

sc_variance =pca.explained_variance_ratio_
print(sc_variance)

# check
features_check = []

for col in sc_test_group:
    features_check.append(ks_2samp(train_df[col], test_df[col])[1])
    
features_check = pd.Series(features_check, index=sc_test_group).sort_values() 
print(features_check)
train_df[col], test_df[col]
['V126' 'V127' 'V128' 'V130' 'V131' 'V132' 'V133' 'V134' 'V136' 'V137'
 'V143' 'V144' 'V145' 'V150' 'V159' 'V160' 'V164' 'V165' 'V166' 'V202'
 'V203' 'V204' 'V205' 'V206' 'V207' 'V208' 'V209' 'V210' 'V211' 'V212'
 'V213' 'V214' 'V215' 'V216' 'V263' 'V264' 'V265' 'V266' 'V267' 'V268'
 'V270' 'V271' 'V272' 'V273' 'V274' 'V275' 'V276' 'V277' 'V278' 'V306'
 'V307' 'V308' 'V309' 'V310' 'V312' 'V313' 'V314' 'V315' 'V316' 'V317'
 'V318' 'V320' 'V321' 'V331' 'V332' 'V333' 'V335']
V130    1.069942e-100
V136     3.228515e-87
V317     1.542568e-65
V133     9.980904e-61
V127     1.833679e-60
            ...      
V206     2.495735e-01
V332     2.967873e-01
V333     2.998484e-01
V331     4.952229e-01
V335     5.364810e-01
Length: 67, dtype: float64
67 67
[3.95243705e-01 1.20604713e-01 9.08724136e-02 7.99145695e-02
 5.93916129e-02 5.00332241e-02 4.54584312e-02 2.89428818e-02
 2.32736617e-02 1.84120687e-02 1.45453003e-02 1.14526355e-02
 7.81445065e-03 7.34786437e-03 5.85068362e-03 4.37949141e-03
 4.02093888e-03 3.46559896e-03 3.18676729e-03 2.44932594e-03
 2.27222589e-03 2.24909807e-03 1.99991560e-03 1.95987640e-03
 1.71973338e-03 1.53540490e-03 1.51142993e-03 1.03556294e-03
 9.71562292e-04 8.96170111e-04 8.77193459e-04 7.01838736e-04
 6.96799764e-04 6.61501420e-04 5.95271089e-04 5.05089251e-04
 4.25160153e-04 3.66978537e-04 3.32092303e-04 3.14215803e-04
 3.07523162e-04 2.74325442e-04 2.09382351e-04 1.37651414e-04
 1.31921650e-04 1.06460734e-04 9.48509909e-05 8.59092947e-05
 7.21478151e-05 6.68975933e-05 5.36152094e-05 4.34649260e-05
 3.15145841e-05 2.53136017e-05 2.00119609e-05 1.58136115e-05
 1.04702022e-05 9.50144927e-06 5.76505723e-06 4.72613777e-06
 2.37874562e-06 2.01718484e-06 5.65667558e-07 2.11127749e-07
 7.10596417e-08 2.13648711e-08 9.31242653e-09]
V216    0.000000e+00
V215    0.000000e+00
V333    0.000000e+00
V316    0.000000e+00
V265    0.000000e+00
            ...     
V310    9.931409e-60
V130    1.223724e-54
V204    6.822684e-54
V209    3.987460e-53
V312    5.351877e-44
Length: 67, dtype: float64
(0         0.000023
 1         0.000009
 2         0.000009
 3         0.000015
 4         0.000141
             ...   
 590535    0.000013
 590536    0.000009
 590537    0.000009
 590538    0.000012
 590539    0.000016
 Name: V335, Length: 590540, dtype: float64, 0         0.000009
 1        -0.000035
 2        -0.000038
 3         0.000018
 4        -0.000004
             ...   
 506686    0.000009
 506687    0.000007
 506688    0.000009
 506689    0.000009
 506690    0.000009
 Name: V335, Length: 506691, dtype: float64)

對Cs特徵進行Clip,由EDA可知,許多Cs特徵在訓練集和測試集的分佈相差甚遠,訓練集的Cs特徵會在冬季出現明顯的離羣值,考慮將離羣值去掉改善分佈。

#12. Cs frequency encode and clip
i_cols = ['C'+str(i) for i in range(1,15)]

####### Global Self frequency encoding
# self_encoding=False because 
# I want to keep original values
train_df, test_df = frequency_encoding(train_df, test_df, i_cols, self_encoding=False)

####### Clip max values-這就跟丟掉冬天一樣了
for df in [train_df, test_df]:
    for col in i_cols:
        max_value = train_df[train_df['DT_M']==train_df['DT_M'].max()][col].max()
        df[col] = df[col].clip(None,max_value) 

進行多種組合和嘗試:

#13. More combinations
## Identity columns
from sklearn.preprocessing import LabelEncoder
for col in ['id_33']:
    train_identity[col] = train_identity[col].fillna('unseen_before_label')
    test_identity[col]  = test_identity[col].fillna('unseen_before_label')
    
    le = LabelEncoder()
    le.fit(list(train_identity[col])+list(test_identity[col]))
    train_identity[col] = le.transform(train_identity[col])
    test_identity[col]  = le.transform(test_identity[col])
    
print('train_set shape before merge:',train_df.shape)
train_df1 = train_df.merge(train_identity,how='left',on=['TransactionID'])
print('train_set shape after merge:',train_df.shape)

print('test_set shape before merge:',test_df.shape)
test_df1 = test_df.merge(test_identity,how='left',on=['TransactionID'])
print('test_set shape after merge:',test_df.shape)

# New feature - mean of sth
columns_a = ['TransactionAmt', 'id_02', 'D15']
columns_b = ['card1', 'card4', 'addr1']
for col_a in columns_a:
    for col_b in columns_b:
        for df in [train_df1, test_df1]:
            df[f'{col_a}_to_mean_{col_b}'] = df[col_a] / df.groupby([col_b])[col_a].transform('mean')
            df[f'{col_a}_to_std_{col_b}'] = df[col_a] / df.groupby([col_b])[col_a].transform('std')
del columns_a,columns_b
gc.collect()

# Some arbitrary features interaction 試做聯合特徵(?????)
for feature in ['id_02__id_20', 'id_02__D8', 'D11__DeviceInfo', 'DeviceInfo__P_emaildomain', 'P_emaildomain__C2', 
                    'card2__dist1', 'card1__card5', 'card2__id_20', 'card5__P_emaildomain', 'addr1__card1','card1__id_02']:

    f1, f2 = feature.split('__')
    train_df1[feature] = train_df1[f1].astype(str) + '_' + train_df1[f2].astype(str)
    test_df1[feature] = test_df1[f1].astype(str) + '_' + test_df1[f2].astype(str)

    le = LabelEncoder()
    le.fit(list(train_df1[feature].astype(str).values) + list(test_df1[feature].astype(str).values))
    train_df1[feature] = le.transform(list(train_df1[feature].astype(str).values))
    test_df1[feature] = le.transform(list(test_df1[feature].astype(str).values))
train_df = train_df1
test_df = test_df1

利用had_id區分交易爲traditional還是online,這只是個假設,區分後的交易情況分佈符合假說,並且可以很好地解釋冬季交易額和交易數的暴增,(冬季Black Friday、Cyber Monday以及二月Chinese New Year線下交易額峯值),製作區分had_id的agg特徵,這部分特徵使得LB略微上升:

train_df['had_id'] = train_df['had_id'].fillna(0)
test_df['had_id'] = test_df['had_id'].fillna(0)
def uid_sep_aggregation(train_df, test_df, main_columns, uids, aggregations):
    for main_column in main_columns:  
        for col in uids:
            for agg_type in aggregations:
                new_col_name = col+'_'+main_column+'_sep_'+agg_type
                
                train_df[col+'_sep'] = train_df[col].astype(str)+train_df['had_id'].astype(str)
                test_df[col+'_sep'] = test_df[col].astype(str)+test_df['had_id'].astype(str)
                
                temp_df = pd.concat([train_df[[col+'_sep', main_column]], test_df[[col+'_sep',main_column]]])
                temp_df = temp_df.groupby([col+'_sep'])[main_column].agg([agg_type]).reset_index().rename(
                                                        columns={agg_type: new_col_name})
                
                temp_df.index = list(temp_df[col+'_sep'])
                temp_df = temp_df[new_col_name].to_dict()   
                
                train_df[new_col_name] = train_df[col+'_sep'].map(temp_df)
                test_df[new_col_name]  = test_df[col+'_sep'].map(temp_df)
                del train_df[col+'_sep'],test_df[col+'_sep']
    return train_df, test_df

def values_sep_normalization(dt_df, periods, columns):
    for period in periods:
        for col in columns:
            new_col = col +'_sep_'+ period
            dt_df[col] = dt_df[col].astype(float)  

            dt_df[period+'_sep'] = dt_df[period].astype(str)+dt_df['had_id'].astype(str)     
                
            temp_min = dt_df.groupby([period+'_sep'])[col].agg(['min']).reset_index()
            temp_min.index = temp_min[period+'_sep'].values
            temp_min = temp_min['min'].to_dict()

            temp_max = dt_df.groupby([period+'_sep'])[col].agg(['max']).reset_index()
            temp_max.index = temp_max[period+'_sep'].values
            temp_max = temp_max['max'].to_dict()

            temp_mean = dt_df.groupby([period+'_sep'])[col].agg(['mean']).reset_index()
            temp_mean.index = temp_mean[period+'_sep'].values
            temp_mean = temp_mean['mean'].to_dict()

            temp_std = dt_df.groupby([period+'_sep'])[col].agg(['std']).reset_index()
            temp_std.index = temp_std[period+'_sep'].values
            temp_std = temp_std['std'].to_dict()

            dt_df['temp_min'] = dt_df[period+'_sep'].map(temp_min)
            dt_df['temp_max'] = dt_df[period+'_sep'].map(temp_max)
            dt_df['temp_mean'] = dt_df[period+'_sep'].map(temp_mean)
            dt_df['temp_std'] = dt_df[period+'_sep'].map(temp_std)

            dt_df[new_col+'_min_max'] = (dt_df[col]-dt_df['temp_min'])/(dt_df['temp_max']-dt_df['temp_min'])
            dt_df[new_col+'_std_score'] = (dt_df[col]-dt_df['temp_mean'])/(dt_df['temp_std'])
            del dt_df['temp_min'],dt_df['temp_max'],dt_df['temp_mean'],dt_df['temp_std'],dt_df[period+'_sep']
    return dt_df
#9.1 TransAmt seperated by had_id(Online/Traditional)
#分online/traditional來groupbyuid
i_cols = ['TransactionAmt','TransactionAmt_decimal']
uids = ['uid3','uid4','uid5','bank_type','uid6','uid7','cid_1']
aggregations = ['mean','std','min']

train_df, test_df = uid_aggregation(train_df, test_df, i_cols, uids, aggregations)

#分online/traditional來normalization
periods = ['DT_D','DT_W','DT_M']
for df in [train_df, test_df]:
    df = values_sep_normalization(df, periods, i_cols)

對部分特徵做頻率編碼,或是str轉numerical:

#14. Category Encoding
print('Category Encoding')
from sklearn.preprocessing import LabelEncoder
## card4, card6, ProductCD
# Converting Strings to ints(or floats if nan in column) using frequency encoding
# We will be able to use these columns as category or as numerical feature


for col in ['card4', 'card6', 'ProductCD']:
    print('Encoding', col)
    temp_df = pd.concat([train_df[[col]], test_df[[col]]])
    col_encoded = temp_df[col].value_counts().to_dict()   
    train_df[col] = train_df[col].map(col_encoded) #多分類用出現次數作爲編碼
    test_df[col]  = test_df[col].map(col_encoded)
    print(col_encoded)
    del temp_df,col_encoded
    gc.collect()

## M columns
# Converting Strings to ints(or floats if nan in column)

for col in ['M1','M2','M3','M5','M6','M7','M8','M9']:
    train_df[col] = train_df[col].map({'T':1, 'F':0})
    test_df[col]  = test_df[col].map({'T':1, 'F':0})

for col in ['P_emaildomain', 'R_emaildomain','M4']:
    print('Encoding', col)
    temp_df = pd.concat([train_df[[col]], test_df[[col]]])
    col_encoded = temp_df[col].value_counts().to_dict()   
    train_df[col] = train_df[col].map(col_encoded)
    test_df[col]  = test_df[col].map(col_encoded)
    print(col_encoded)
    del temp_df,col_encoded
    gc.collect()
    
i_cols = ['TransactionAmt']
uids = ['card2__id_20','card1__id_02']
aggregations = ['mean','std']

# uIDs aggregations
train_df, test_df = uid_aggregation(train_df, test_df, i_cols, uids, aggregations)
 
    
## Reduce Mem One More Time
train_df = reduce_mem_usage(train_df)
test_df  = reduce_mem_usage(test_df)
Category Encoding
Encoding card4
{'visa': 722693, 'mastercard': 348803, 'american express': 16078, 'discover': 9572}
Encoding card6
{'debit': 828379, 'credit': 268753, 'charge card': 16}
Encoding ProductCD
{'W': 800657, 'C': 137785, 'R': 73346, 'H': 62397, 'S': 23046}
Encoding P_emaildomain
{'gmail.com': 435803, 'yahoo.com': 182784, 'uknown': 163648, 'hotmail.com': 85649, 'anonymous.com': 71062, 'aol.com': 52337, 'comcast.net': 14474, 'icloud.com': 12316, 'outlook.com': 9934, 'att.net': 7647, 'msn.com': 7480, 'sbcglobal.net': 5767, 'live.com': 5720, 'verizon.net': 5011, 'ymail.com': 4075, 'bellsouth.net': 3437, 'yahoo.com.mx': 2827, 'me.com': 2713, 'cox.net': 2657, 'optonline.net': 1937, 'live.com.mx': 1470, 'charter.net': 1443, 'mail.com': 1156, 'rocketmail.com': 1105, 'gmail': 993, 'earthlink.net': 979, 'outlook.es': 863, 'mac.com': 862, 'hotmail.fr': 674, 'hotmail.es': 627, 'frontier.com': 594, 'roadrunner.com': 583, 'juno.com': 574, 'windstream.net': 552, 'web.de': 518, 'aim.com': 468, 'embarqmail.com': 464, 'twc.com': 439, 'frontiernet.net': 397, 'netzero.com': 387, 'centurylink.net': 386, 'q.com': 362, 'yahoo.fr': 344, 'hotmail.co.uk': 334, 'suddenlink.net': 323, 'netzero.net': 319, 'cfl.rr.com': 318, 'cableone.net': 311, 'prodigy.net.mx': 303, 'gmx.de': 298, 'sc.rr.com': 277, 'yahoo.es': 272, 'protonmail.com': 159, 'ptd.net': 140, 'yahoo.de': 137, 'hotmail.de': 130, 'live.fr': 106, 'yahoo.co.uk': 103, 'yahoo.co.jp': 101, 'servicios-ta.com': 80, 'scranton.edu': 2}
Encoding R_emaildomain
{'uknown': 824070, 'gmail.com': 118885, 'hotmail.com': 53166, 'anonymous.com': 39644, 'yahoo.com': 21405, 'aol.com': 7239, 'outlook.com': 5011, 'comcast.net': 3513, 'icloud.com': 2820, 'yahoo.com.mx': 2743, 'msn.com': 1698, 'live.com.mx': 1464, 'live.com': 1444, 'verizon.net': 1202, 'sbcglobal.net': 1163, 'me.com': 1095, 'att.net': 870, 'cox.net': 854, 'outlook.es': 853, 'bellsouth.net': 795, 'hotmail.fr': 667, 'hotmail.es': 595, 'web.de': 514, 'mac.com': 430, 'ymail.com': 405, 'optonline.net': 350, 'mail.com': 341, 'hotmail.co.uk': 317, 'yahoo.fr': 315, 'prodigy.net.mx': 303, 'gmx.de': 297, 'charter.net': 263, 'gmail': 196, 'earthlink.net': 170, 'embarqmail.com': 140, 'yahoo.de': 139, 'hotmail.de': 130, 'rocketmail.com': 126, 'yahoo.es': 124, 'juno.com': 111, 'frontier.com': 110, 'live.fr': 105, 'windstream.net': 104, 'yahoo.co.jp': 104, 'roadrunner.com': 101, 'yahoo.co.uk': 82, 'servicios-ta.com': 80, 'aim.com': 77, 'protonmail.com': 75, 'ptd.net': 70, 'scranton.edu': 69, 'twc.com': 61, 'cfl.rr.com': 57, 'suddenlink.net': 55, 'cableone.net': 46, 'q.com': 45, 'frontiernet.net': 38, 'centurylink.net': 28, 'netzero.com': 24, 'netzero.net': 19, 'sc.rr.com': 14}
Encoding M4
{'M0': 357789, 'M2': 122947, 'M1': 97306}
Mem. usage decreased to 1086.67 Mb (55.8% reduction)
Mem. usage decreased to 939.08 Mb (55.5% reduction)

將結果存入.pkl格式文件,佔用空間小,可利用pd.read_pickle快速讀取:

## Export
train_df.to_pickle('train_transaction_15.pkl')
test_df.to_pickle('test_transaction_15.pkl')

3.特徵篩選+降維(實驗記錄)

  1. 對Vs特徵分塊進行PCA相關操作,無法確定最佳維度,且較多Vs特徵並非numerical,不適合PCA,若僅加入分塊PCA特徵,試驗後LB也並無增長,放棄此舉。

  2. 考慮到V126-V138可能是某些值的累積,對它們做了groupby ProductCD和crad1、addr1的diff(),這裏邊’V126’ ‘V127’ ‘V128’ ‘V130’ ‘V131’ ‘V132’ ‘V133’ ‘V134’ ‘V136’ 'V137’是numerical,試驗後LB並無增長,放棄此舉。

  3. 降維:
    (1)Recursive feature elimination for block of features:做了三個UID_bolck,D_block,Trans_block,踢了30個feat,LB0.9488–>0.9487可以接受。
    (2)PCA on v_cols:只對numerical類型的V126-335部分feat做了scaling外加PCA,train和test是各自歸一化的,訓練集測試集分佈不太行的問題應該是解決了,分佈差得不是很大了,但是test還是有好多好多特別高的值,閾值大得多。其他Vs_feat不適合做PCA。
    (3)permutation importance
    僅使用solo_feature試試看。importance=0的,5fold選出[‘C7’,‘C7_fq_enc’,‘C10_fq_enc’,‘addr2’,‘M7’,‘C4’],感覺波動不會太大總之先試試。LB:0.9486–>0.9482 降了,還不能去掉,好像有緩和overfit的作用。
    (4)想把mean和std以及uidagg組合類好好過濾一下,感覺過擬合都是這些導致的。
    多了Di_uidagg部分,刪了Duid部分,一下子降了200維,LB:0.9486–>0.9478 不到1個千分點……Duidagg是沒有意義的,毫無疑問是造成過擬合的原因,但是沒有代替它的好特徵是不行的。(後使用D生成新uid對TransactionAmt進行agg操作使得LB大幅提升)

  4. change dist:
    1)Pseudo Labeling(2種辦法)- overampling
    (1)取test一次predict結果的極端樣本(0.01內)填充到train裏邊區再跑一次6折lgbm:
    正樣本率從0.03499下降至0.019082,少了一半,分佈區別加劇,放棄。
    就總結地來講這個樣本集做oversampling應該是會造成data的unbalance更加嚴重。
    (2)negative downsampling
    可以節約時間,用來測試新feature ,但對模型表現不會有提升,會損失一部分訓練數據。

  5. deeper fe:(find magic)
    (1)做了個rolling window of duplicates: 查看測試集和訓練集比例差不多,LB:0.9487–>0.9486 放棄。
    (2)改善分佈+想辦法提升對冬季(分佈最不同處)的預測精度
    a.Cs: 做了15、30天的shift和30天的rolling出mean和median LB:0.9487–>0.9486 基本上沒有用
    b.Vs:‘V144’ ‘V145’ ‘V150’ ‘V159’ 'V160’冬季會出現小山坡狀的V,在未scalingPCA之前做了15、30天的shift
    c.lag用在userid上,本人筆記本內存不支持產生rolling window的lag特徵,放棄。
    d.uid_aggregation用在一部分C上,是不是能區別出商家用戶和一般用戶呢?
    從某些角度來講也算是擴大了冬天數據的影響,但是對應的樣本太少了,沒能提升第一fold的auc,看起來不行,放棄。
    f.把email系列給bin了,外加了一點Transamt*C的組合
    特徵都挺重要的,但是LB:0.9487–>0.9486這樣,CV挺高的,先保留。
    g.把Vcols裏邊的多分類clip一下去除離羣值,LB無增長,先保留。
    h.之前對Ds的處理太草了,把D重新處理一下,之前根據uidagg得到的mean和std簡直是noise…… 只保留Dsuidagg是LB:0.9478–>0.9480 2個萬分點/然後scaling加上是LB:0.9480–>0.9486 6個萬分點,可以判斷Dsuidagg是造成過擬合的元兇了。
    i.做DminusDT,可以考慮剔除uid1、2這樣的試試。Ds有部分分爲to id cards,to id users2類,考慮生成新的uid和cardid,Trans_agg附加Trans的min,可以考慮保留一下scaling的part 看情況,當前維度502:

    • CV:Mean AUC = 0.9460381025052613 Out of folds AUC = 0.9449767448300119暴增
    • LB:0.9487–>0.9525 (MAGIC HERE
    • 冬季的表現: Fold 1 | AUC: 0.9147271019498214

    j.從had_id分析Xmas飆升:可能是Black Friday和Cyber Monday帶來的?
    2月份的線下交易額飆升是指中國春節?had_id指的是線上交易?
    請參考:miguel perez——Physical vs e-commerce (real dates, clearer)
    日交易數
    日交易額
    以had_id區分traditional和online,75.6%的train的had_id爲空,也就是traditional。71.9%的test的had_id爲空。從daily_trans_count和daily_trans_amt都可以看出來,數據的變化符合購物節假說,考慮使用had_id生成新的特徵。
    這樣如何?:had_id非空,也就是online的part用online的trans作count/uidagg;traditional的part用對應的trans count/uidagg,然後合成一列作爲新特徵。附加Ds_period_normalization。

    • CV: Mean AUC = 0.9465119707733309 Out of folds AUC = 0.9457792719899325增了一點
    • LB:0.9525–>0.9526(Best Single Model
    • 冬季的表現有變好:Fold 1 | AUC: 0.9179858958040653

    k.試試Vs_group的PCA作爲新特徵?
    沒啥用,Mean AUC = 0.9466500259256587 Out of folds AUC = 0.9458734168996746,稍微增了一點點,但是LB完全是0.9526沒變。

  6. 嘗試改變cv策略:(當前爲GroupKfold)
    (1)之前的cv策略都有時間穿越的問題,考慮用time_split試1fold,降了–>0.9389,放棄。
    (2)直接試試sklearn的time_series_split如何? 降了–>0.9339 ,放棄。

4.lightGBM+best_parameters

首先讀入已經處理好的.pkl文件:

import lightgbm as lgb
import pandas as pd
import numpy as np
import os, sys
import logging
import operator
import gc
from sklearn import metrics

train_trans = pd.read_pickle('train_transaction_14.pkl')
test_trans = pd.read_pickle('test_transaction_14.pkl')
print('train_set shape after merge:',train_trans.shape)
print('test_set shape after merge:',test_trans.shape)
train_set shape after merge: (590540, 765)
test_set shape after merge: (506691, 765)

然後丟掉不需要的、沒有價值的、沒法處理的中間特徵變量。同時將還未labelencode的object類型特徵進行編碼以供lightGBM處理。
部分特徵是因爲其具有過擬合的性質或是屬於噪音影響模型表現,又或是lightgbm給出的feature importance過低從而決定丟棄;關於其餘丟棄特徵,請參考:Roman——Recursive feature elimination

#drop cols
not_use = ['dist2', 'C3', 'D7', 'M1', 'id_04', 'id_07', 'id_08', 'id_10', 'id_16', 'id_18', 'id_21', 'id_22', 'id_23', 'id_24', 'id_25', 'id_26', 'id_27', 'id_28', 'id_29', 'id_34', 'id_35']
rm_cols = ['bank_type','uid1','uid2','uid3','uid4','uid5','DT','DT_W','DT_D','DT_hour','DT_day_week','DT_day','DT_D_total','DT_W_total','DT_M_total','id_30','id_31','id_33']
drop_v_vols = ['V1', 'V2', 'V14', 'V15', 'V16', 'V18', 'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'V31', 'V32', 'V39', 'V41', 'V42', 'V43', 'V50', 'V55', 'V57', 'V65', 'V66', 'V67', 'V68', 'V77', 'V79', 'V86', 'V88', 'V89', 'V98', 'V101', 'V102', 'V103', 'V104', 'V105', 'V106', 'V107', 'V108', 'V109', 'V110', 'V111', 'V112', 'V113', 'V114', 'V115', 'V116', 'V117', 'V118', 'V119', 'V120', 'V121', 'V122', 'V123', 'V124', 'V125', 'V129', 'V132', 'V133', 'V134', 'V135', 'V136', 'V137', 'V141', 'V142', 'V144', 'V148', 'V153', 'V155', 'V157', 'V168', 'V174', 'V179', 'V181', 'V183', 'V185', 'V186', 'V190', 'V191', 'V192', 'V193', 'V194', 'V196', 'V198', 'V199', 'V211', 'V218', 'V230', 'V232', 'V235', 'V236', 'V237', 'V240', 'V241', 'V248', 'V250', 'V252', 'V254', 'V255', 'V260', 'V269', 'V281', 'V284', 'V286', 'V290', 'V293', 'V295', 'V296', 'V297', 'V298', 'V299', 'V300', 'V301', 'V302', 'V305', 'V309', 'V311', 'V316', 'V318', 'V319', 'V320', 'V321', 'V325', 'V327', 'V328', 'V330', 'V334', 'V337', 'V339']
drop_cols2 = ['had_id','M_sum','D9','V138','D9_not_na','card1','TransactionDTday','card1_TransactionAmt_decimal_min', 'bank_type_TransactionAmt_decimal_min', 'card2_TransactionAmt_decimal_min', 'card5_TransactionAmt_decimal_min', 'card3_TransactionAmt_decimal_min']
#rfe_not1 = ['card3_fq_enc','bank_type_D1_mean','bank_type_D7_mean','bank_type_D10_mean','bank_type_D11_mean','D6_DT_W_min_max','D7_DT_W_min_max', 'D7_DT_W_std_score', 'D12_DT_W_min_max', 'D13_DT_W_min_max', 'D6_DT_M_min_max', 'D7_DT_M_std_score','D12_DT_M_min_max','D12_DT_M_std_score','D13_DT_M_min_max']

not_use = not_use + rm_cols + drop_cols2 + drop_v_vols 
train_trans = train_trans.drop(not_use,axis=1)
test_trans = test_trans.drop(not_use,axis=1)
from sklearn.preprocessing import LabelEncoder
for col in train_trans.columns:
    if train_trans[col].dtype == 'object':
        le = LabelEncoder()
        le.fit(list(train_trans[col].astype(str).values) + list(test_trans[col].astype(str).values))
        train_trans[col] = le.transform(list(train_trans[col].astype(str).values))
        test_trans[col] = le.transform(list(test_trans[col].astype(str).values))
P_emaildomain_bin
P_emaildomain_suffix
P_emaildomain_prefix
P_emaildomain_suffix_us
R_emaildomain_bin
R_emaildomain_suffix
R_emaildomain_prefix
R_emaildomain_suffix_us
uid6
cid_1
uid7
DeviceType
DeviceInfo
device_name
device_version
OS_id_30
version_id_30
browser_id_31
version_id_31
screen_width
screen_height

去掉純噪音‘TransactionDT’和‘TransactionID’,也去掉標籤的’isFraud’。將訓練集做6折GroupFold,並將訓練集和測試集轉爲lgb可以處理的類型,設定好lgb參數,開始訓練,取每折預測結果的平均作爲最終的預測結果。

#fit_lgb
X = train_trans.sort_values('TransactionDT').drop(['isFraud', 'TransactionDT','TransactionID'], axis=1)
y = train_trans.sort_values('TransactionDT')['isFraud']
X_test = test_trans.drop(['TransactionDT', 'isFraud','TransactionID'], axis=1)
print('the shape of train_df is:',X.shape)
print('the shape of test_df is:',X_test.shape)
the shape of train_df is: (590540, 580)
the shape of test_df is: (506691, 580)
#fit_lgb
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import GroupKFold
NFOLDS = 6
folds = GroupKFold(n_splits=NFOLDS)

params = {'num_leaves': 491,
          'min_child_weight': 0.03454472573214212,
          'feature_fraction': 0.3797454081646243,
          'bagging_fraction': 0.4181193142567742,
          'min_data_in_leaf': 106,
          'objective': 'binary',
          'max_depth': -1,
          'learning_rate': 0.006883242363721497,
          "boosting_type": "gbdt",
          "bagging_seed": 11,
          "metric": 'auc',
          "verbosity": -1,
          'reg_alpha': 0.3899927210061127,
          'reg_lambda': 0.6485237330340494,
          'random_state': 47,
          'num_threads':4,
          'n_estimators':1800
         }

columns = X.columns
split_groups = X['DT_M']    
splits = folds.split(X, y,groups=split_groups)
y_preds = np.zeros(X_test.shape[0])
y_oof = np.zeros(X.shape[0])
score = 0
feature_importances = pd.DataFrame()
feature_importances['feature'] = columns

for fold_n, (train_index, valid_index) in enumerate(splits):
    print('Fold:',fold_n)
    X_train, X_valid = X[columns].iloc[train_index], X[columns].iloc[valid_index]
    y_train, y_valid = y.iloc[train_index], y.iloc[valid_index]
    
    dtrain = lgb.Dataset(X_train, label=y_train)
    dvalid = lgb.Dataset(X_valid, label=y_valid)

    clf = lgb.train(params, dtrain, 10000, valid_sets = [dtrain, dvalid], verbose_eval=200, early_stopping_rounds=200)
    
    feature_importances[f'fold_{fold_n + 1}'] = clf.feature_importance()
    
    y_pred_valid = clf.predict(X_valid)
    y_oof[valid_index] = y_pred_valid
    print(f"Fold {fold_n + 1} | AUC: {roc_auc_score(y_valid, y_pred_valid)}")
    
    score += roc_auc_score(y_valid, y_pred_valid) / NFOLDS
    y_preds += clf.predict(X_test) / NFOLDS
    
    del X_train, X_valid, y_train, y_valid
    gc.collect()

print(f"\nMean AUC = {score}")
print(f"Out of folds AUC = {roc_auc_score(y, y_oof)}")
Fold: 0
//anaconda3/lib/python3.7/site-packages/lightgbm/engine.py:113: UserWarning: Found `n_estimators` in params. Will use it instead of argument
  warnings.warn("Found `{}` in params. Will use it instead of argument".format(alias))
Training until validation scores don't improve for 200 rounds.
[200]	training's auc: 0.970841	valid_1's auc: 0.888492
[400]	training's auc: 0.989367	valid_1's auc: 0.902119
[600]	training's auc: 0.99698	valid_1's auc: 0.909347
[800]	training's auc: 0.999369	valid_1's auc: 0.913024
[1000]	training's auc: 0.999881	valid_1's auc: 0.915294
[1200]	training's auc: 0.999982	valid_1's auc: 0.916427
[1400]	training's auc: 0.999998	valid_1's auc: 0.916971
[1600]	training's auc: 1	valid_1's auc: 0.917535
[1800]	training's auc: 1	valid_1's auc: 0.917986
Did not meet early stopping. Best iteration is:
[1800]	training's auc: 1	valid_1's auc: 0.917986
Fold 1 | AUC: 0.9179858958040653
Fold: 1
//anaconda3/lib/python3.7/site-packages/lightgbm/engine.py:113: UserWarning: Found `n_estimators` in params. Will use it instead of argument
  warnings.warn("Found `{}` in params. Will use it instead of argument".format(alias))
Training until validation scores don't improve for 200 rounds.
[200]	training's auc: 0.96973	valid_1's auc: 0.920247
[400]	training's auc: 0.988597	valid_1's auc: 0.934074
[600]	training's auc: 0.996845	valid_1's auc: 0.941598
[800]	training's auc: 0.999326	valid_1's auc: 0.944648
[1000]	training's auc: 0.999883	valid_1's auc: 0.946392
[1200]	training's auc: 0.999983	valid_1's auc: 0.947476
[1400]	training's auc: 0.999998	valid_1's auc: 0.948062
[1600]	training's auc: 1	valid_1's auc: 0.94845
[1800]	training's auc: 1	valid_1's auc: 0.948716
Did not meet early stopping. Best iteration is:
[1798]	training's auc: 1	valid_1's auc: 0.948727
Fold 2 | AUC: 0.9487270375003357
Fold: 2
//anaconda3/lib/python3.7/site-packages/lightgbm/engine.py:113: UserWarning: Found `n_estimators` in params. Will use it instead of argument
  warnings.warn("Found `{}` in params. Will use it instead of argument".format(alias))
Training until validation scores don't improve for 200 rounds.
[200]	training's auc: 0.967541	valid_1's auc: 0.920002
[400]	training's auc: 0.987839	valid_1's auc: 0.935714
[600]	training's auc: 0.996483	valid_1's auc: 0.943642
[800]	training's auc: 0.999224	valid_1's auc: 0.94732
[1000]	training's auc: 0.999857	valid_1's auc: 0.949401
[1200]	training's auc: 0.999978	valid_1's auc: 0.950679
[1400]	training's auc: 0.999998	valid_1's auc: 0.951512
[1600]	training's auc: 1	valid_1's auc: 0.952117
[1800]	training's auc: 1	valid_1's auc: 0.952431
Did not meet early stopping. Best iteration is:
[1798]	training's auc: 1	valid_1's auc: 0.95242
Fold 3 | AUC: 0.9524195779901877
Fold: 3
//anaconda3/lib/python3.7/site-packages/lightgbm/engine.py:113: UserWarning: Found `n_estimators` in params. Will use it instead of argument
  warnings.warn("Found `{}` in params. Will use it instead of argument".format(alias))
Training until validation scores don't improve for 200 rounds.
[200]	training's auc: 0.968861	valid_1's auc: 0.917606
[400]	training's auc: 0.987988	valid_1's auc: 0.931882
[600]	training's auc: 0.996424	valid_1's auc: 0.93957
[800]	training's auc: 0.999185	valid_1's auc: 0.94231
[1000]	training's auc: 0.999843	valid_1's auc: 0.943998
[1200]	training's auc: 0.999974	valid_1's auc: 0.945387
[1400]	training's auc: 0.999996	valid_1's auc: 0.946162
[1600]	training's auc: 1	valid_1's auc: 0.946454
[1800]	training's auc: 1	valid_1's auc: 0.946702
Did not meet early stopping. Best iteration is:
[1800]	training's auc: 1	valid_1's auc: 0.946702
Fold 4 | AUC: 0.9467022988334145
Fold: 4
//anaconda3/lib/python3.7/site-packages/lightgbm/engine.py:113: UserWarning: Found `n_estimators` in params. Will use it instead of argument
  warnings.warn("Found `{}` in params. Will use it instead of argument".format(alias))
Training until validation scores don't improve for 200 rounds.
[200]	training's auc: 0.967885	valid_1's auc: 0.933123
[400]	training's auc: 0.987962	valid_1's auc: 0.945288
[600]	training's auc: 0.996395	valid_1's auc: 0.950617
[800]	training's auc: 0.999154	valid_1's auc: 0.952375
[1000]	training's auc: 0.999834	valid_1's auc: 0.953323
[1200]	training's auc: 0.999972	valid_1's auc: 0.954082
[1400]	training's auc: 0.999997	valid_1's auc: 0.954569
[1600]	training's auc: 1	valid_1's auc: 0.954912
[1800]	training's auc: 1	valid_1's auc: 0.955194
Did not meet early stopping. Best iteration is:
[1800]	training's auc: 1	valid_1's auc: 0.955194
Fold 5 | AUC: 0.9551941380371436
Fold: 5
//anaconda3/lib/python3.7/site-packages/lightgbm/engine.py:113: UserWarning: Found `n_estimators` in params. Will use it instead of argument
  warnings.warn("Found `{}` in params. Will use it instead of argument".format(alias))
Training until validation scores don't improve for 200 rounds.
[200]	training's auc: 0.96762	valid_1's auc: 0.92463
[400]	training's auc: 0.987262	valid_1's auc: 0.942792
[600]	training's auc: 0.996126	valid_1's auc: 0.951216
[800]	training's auc: 0.999056	valid_1's auc: 0.954502
[1000]	training's auc: 0.999806	valid_1's auc: 0.956104
[1200]	training's auc: 0.999966	valid_1's auc: 0.957193
[1400]	training's auc: 0.999995	valid_1's auc: 0.958021
[1600]	training's auc: 0.999999	valid_1's auc: 0.958572
[1800]	training's auc: 1	valid_1's auc: 0.958972
Did not meet early stopping. Best iteration is:
[1800]	training's auc: 1	valid_1's auc: 0.958972
Fold 6 | AUC: 0.9589715765186857

Mean AUC = 0.9465119707733309 
Out of folds AUC = 0.9457792719899325

接下來將預測結果輸出成文件:

#prediction
submission = pd.DataFrame({'TransactionID':test_trans['TransactionID'],'isFraud':y_preds})
print('submission_shape is:',submission.shape)
submission.to_csv('submission_16.3.csv',index = False)

國內上傳速度越來越慢了,以如下形式也可以在Linux內將結果上傳至Kaggle,形如:

kaggle competitions submit -c ieee-fraud-detection -f submission.csv -m "Message"

稍微觀察一下特徵重要度,先看重要度最高的特徵及其得分。

feature_importances['average'] = feature_importances[[f'fold_{fold_n + 1}' for fold_n in range(folds.n_splits)]].mean(axis=1)
feature_importances.to_csv('feature_importances.csv')
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
%matplotlib inline
plt.figure(figsize=(16, 16))
sns.barplot(data=feature_importances.sort_values(by='average', ascending=False).head(50), x='average', y='feature');
plt.title('50 TOP feature importance over {} folds average'.format(folds.n_splits));

在這裏插入圖片描述
再看重要度最低的特徵,決定是否要在優化過程中將這些特徵去掉或加以其他處理。

完成特徵工程改進之後,用LightGBM訓練得到預測結果,分析auc和LB不斷對特徵處理方法進行改進,最終單模型得分LB上升到了0.9526。

5. Internal blend

blend考量:
(1)submission16 CV:Mean AUC = 0.9460381025052613 Out of folds AUC = 0.9449767448300119 LB:0.9525
(2)submission16.2 CV: Mean AUC = 0.9464855222093469 Out of folds AUC = 0.9457199040811567 LB:0.9525 -->冬天好像改善了一樣0.917
(3) submission16.3 CV: Mean AUC = 0.9465119707733309 Out of folds AUC = 0.9457792719899325增了一點 LB:0.9525–>0.9526 冬季的表現有變好:Fold 1 | AUC: 0.9179858958040653

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
sub_1 = pd.read_csv('blends/submission_16.9449.csv')
sub_2 = pd.read_csv('blends/submission_16.94571.csv')
sub_3 = pd.read_csv('blends/submission_16.94577.csv')
sub_4 = pd.read_csv('output/submission_09.9487.csv')
sub_5 = pd.read_csv('output/submission_07.9477.csv')

sub_1['isFraud'] = sub_1['isFraud'] + sub_2['isFraud'] + sub_3['isFraud'] + sub_4['isFraud'] + sub_5['isFraud'] 
sub_1.to_csv('submission_blend1.csv', index=False)

6.最終結果

在這裏插入圖片描述
雖然Blend模型的LB:0.9532比單模型的LB:0.9526要高出不少,但是最終結果還是單模型表現較好,Blend模型雖然是由各模型組合生成的,還是會存在對LB的過擬合現象。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章