2019騰訊廣告算法大賽之整理測試數據集以及構造訓練集

在重構訓練樣本之前我們首先需要對測試集中樣本進行整理,因爲訓練的樣本要和測試樣本在維度上(屬性列)要保持一致的。首先看一下原始樣本的格式:

除了人羣定向這一列屬性需要根據關鍵字進行分裂之外,對於其他屬性我沒有做任何修改,Okay!!!對於人羣定向的修改可以分成兩種情況,

第一種: 當該條記錄中的關鍵字是(屬性列名: 具體值)對於出現的屬性列進行保存,沒出現的屬性列直接設置成-1

第二種: 其中的關鍵字是all,首先統計一下user_data每一列的所有取值(data.drop_duplicates()),然後將賦值給對應的列

最後測試樣本的樣式是:

 

既然測試集已經給出,所以先在的目標就是對之前生成的文件進行拼接操作,最後拼接的結果和測試樣本一樣即可。數據拼接的步驟如下:

第一: 要將廣告操作數據集中的人羣定向,修改成和測試集一樣的格式

第二: 將靜態廣告數據和廣告操作進行拼接操作,進行pd.merge()

第二: 將生成帶有點擊量的曝光日誌數據和廣告數據按照廣告ID進行拼接操作。

最後訓練集構造的格式如下

 

對測試集重構之後的代碼:

# -*- coding: utf-8 -*-
# @Time    : 2019/5/3 8:53
# @Author  : YYLin
# @Email   : [email protected]
# @File    : Redo_Dataload_Sample_Data.py
import pandas as pd
import sys
import operator
from functools import reduce

Test_Sample_Data = []
Test_Sample_Data_columns = ['ad_id', 'ad_bid', 'num_click', 'Ad_material_size', 'Ad_Industry_Id', 'Commodity_type',
                            'Delivery_time', 'age', 'gender', 'area', 'education', 'device', 'consuptionAbility',
                            'status', 'connectionType', 'behavior']

# 爲數據集增加列名稱 其中的num_click全部設置成-3
Test_Sample_Data.append(Test_Sample_Data_columns)
int_num_click = -3

# 測試樣本中人羣定向是all的時候 利用原始的user.data將數據集劃分
user_data = pd.read_csv('../Dataset/tencent-dataset-19/dataset-for-user/userFeature_1000.csv')
print("*************user_data************", user_data.info())
User_age = user_data['Age'].drop_duplicates(keep='first', inplace=False)
User_age = list(User_age)
User_age = [str(x) for x in User_age]
all_age = ' '.join(User_age)
# print("all_age的樣式是:\n", all_age, type(all_age))

User_Gender = user_data['Gender'].drop_duplicates(keep='first', inplace=False)
User_Gender = list(User_Gender)
User_Gender = [str(x) for x in User_Gender]
all_Gender = ' '.join(User_Gender)

# 因爲地域這列屬性可以取多值 所以需要對其合併成一維數組之後 然後在執行去重操作
User_Area = user_data['Area']
User_Area = list(User_Area)
for i, temp_line in enumerate(User_Area):
    User_Area[i] = temp_line.strip().split(',')
User_Area = reduce(operator.add, User_Area)
# print("User_Area轉化成一維數組之後前20個數據是", User_Area[0:20], type(User_Area))
User_Area_set = list(set(User_Area))
# print("User_Area經過去重之後前20個數據是", User_Area_set[0:20], len(User_Area_set))
User_Area = [str(x) for x in User_Area]
all_Area = ' '.join(User_Area)
print("all_Area的類型是:\n", type(all_Area), type(all_Area[1]), )
# print("在用戶文件之中地域的取值爲:\n", all_Area[0:10], len(all_Area))

User_Education = user_data['Education'].drop_duplicates(keep='first', inplace=False)
User_Education = list(User_Education)
User_Education = [str(x) for x in User_Education]
all_Education = ' '.join(User_Education)
print("all_Education的類型是:\n", len(all_Education), type(all_Education), type(all_Education[1]))


User_Consuption_Ability = user_data['Consuption_Ability'].drop_duplicates(keep='first', inplace=False)
User_Consuption_Ability = list(User_Consuption_Ability)
User_Consuption_Ability = [str(x) for x in User_Consuption_Ability]
all_Consuption_Ability = ' '.join(User_Consuption_Ability)

User_Device = user_data['Device'].drop_duplicates(keep='first', inplace=False)
User_Device = list(User_Device)
User_Device = [str(x) for x in User_Device]
all_Device = ' '.join(User_Device)


# 對於工作可能是取多值的情況 所以參照地域的取值方式
User_Work_Status = user_data['Work_Status']
User_Work_Status = list(User_Work_Status)
for i, temp_line in enumerate(User_Work_Status):
    if ',' in temp_line:
        # print("temp_line是:", temp_line)
        User_Work_Status[i] = temp_line.strip().split(',')
    else:
        User_Work_Status[i] = list(temp_line)
# print("經過修改操作之後的User_Work_Status是:", User_Work_Status[0:10], type(User_Work_Status))
User_Work_Status = reduce(operator.add, User_Work_Status)
User_Work_Status = list(set(User_Work_Status))
User_Work_Status = [str(x) for x in User_Work_Status]
all_Work_Status = ' '.join(User_Work_Status)
# print("最後User_Work_Status的取值範圍是:\n", all_Work_Status)

User_Connection_Type = user_data['Connection_Type'].drop_duplicates(keep='first', inplace=False)
User_Connection_Type = list(User_Connection_Type)
User_Connection_Type = [str(x) for x in User_Connection_Type]
all_Connection_Type = ' '.join(User_Connection_Type)

# 該方法的目的是找到Behavior中所有唯一值,當出現all的時候 將Behavior的值賦值給該條數據,
# 但是發現數據集中Behavior太多 暫時不執行該操作
User_Behavior = user_data['Behavior']
User_Behavior = list(User_Behavior)
print("原始數據集中用戶行爲的結果爲:\n", type(User_Behavior[0:-1]))
# 將二維數組轉化成一維數組
for i, temp_line in enumerate(User_Behavior):
    if ',' in temp_line:
        User_Behavior[i] = temp_line.strip().split(',')
    else:
        print("Behavior中異常的數據是:", User_Behavior[i])
        User_Behavior[i] = list(temp_line)
        # del User_Behavior[i]
# User_Behavior.pop(0)
# 首先將數據降維到一維數組 然後去掉list中重複的元素
User_Behavior = reduce(operator.add, User_Behavior)
User_Behavior = list(set(User_Behavior))
Str_User_Behavior = [str(x) for x in User_Behavior]
all_Behavior = ' '.join(Str_User_Behavior)
print("用戶數據集中Behavior的取值範圍是", len(User_Behavior))
print("用戶數據集中所有的屬性值已加載完畢!!!!")


# 需要重寫測試集中的人羣定向
with open('../Dataset/tencent-dataset-19/test_sample.dat', 'r') as f:
    for i, line in enumerate(f):
        # 測試的時候使用的數據
        # if i >= 3:
            # break
            # sys.exit()

        # 原始數據每列屬性的含義 修改數據之後每列屬性的含義
        # Sample_id ad_id Creation_time Ad_material_size Ad_Industry_Id Commodity_type Commerce_id Account_id
        # Delivery_time Chose_People ad_bid
        # 'ad_id', 'ad_bid', 'num_click', 'Ad_material_size', 'Ad_Industry_Id', 'Commodity_type', 'Delivery_time',

        # 定義一個臨時的數組用於緩存數據集 首先加載的屬性是直接能夠從原始數據中
        save_line = []
        line = line.strip().split('\t')
        # print("line:", line, '\n', 'line[9]:', line[9], type(line))

        save_line.append(line[1])
        save_line.append(line[10])
        save_line.append(int_num_click)
        save_line.append(line[3])
        save_line.append(line[4])
        save_line.append(line[5])

        # 對於屬性中存在的多值屬性將其中的逗號轉化成空格 驗證成功
        tmp_line_6 = line[8].strip().split(',')
        line[8] = ' '.join(tmp_line_6)
        save_line.append(line[8])
        # print("最後用於保存的數據的格式:\n", save_line)

        # 對文件中存在的人羣定向分離出各個子節點
        tmp_line = line[9].strip().split('|')
        userFeature_dict = {}

        for each in tmp_line:
            each_list = each.split(':')
            userFeature_dict[each_list[0]] = ' '.join(each_list[1:])

        # print(result_of_line_9)
        value_age = ''
        value_gender = ''
        value_area = ''
        value_education = ''
        value_device = ''
        value_consuptionAbility = ''
        value_status = ''
        value_connectionType = ''
        value_behavior = ''

        # 當定向人羣是all的時候 因爲Behavior的數據比較大, 所以在使用Behavior值使用-2代替
        value_all = 'all'

        if value_all in userFeature_dict.keys():
            value_age = all_age
            value_gender = all_Gender
            value_area = all_Area
            value_education = all_Education
            value_consuptionAbility = all_Consuption_Ability
            value_device = all_Device
            value_status = all_Work_Status
            value_connectionType = all_Connection_Type
            value_behavior = all_Behavior
        else:
            if 'age' in userFeature_dict.keys():
                value_age = userFeature_dict['age']
                # print(userFeature_dict['age'], type(userFeature_dict['age']))
            if 'gender' in userFeature_dict.keys():
                value_gender = userFeature_dict['gender']
                # print(userFeature_dict['gender'], type(userFeature_dict['gender']))
            if 'area' in userFeature_dict.keys():
                value_area = userFeature_dict['area']
                # print(userFeature_dict['area'], type(userFeature_dict['area']))
            if 'education' in userFeature_dict.keys():
                value_education = userFeature_dict['education']
                # print(userFeature_dict['education'], type(userFeature_dict['education']))
            if 'device' in userFeature_dict.keys():
                value_device = userFeature_dict['device']
                # print(userFeature_dict['device'], type(userFeature_dict['device']))
            if 'consuptionAbility' in userFeature_dict.keys():
                value_consuptionAbility = userFeature_dict['consuptionAbility']
                # print(userFeature_dict['consuptionAbility'], type(userFeature_dict['consuptionAbility']))
            if 'status' in userFeature_dict.keys():
                value_status = userFeature_dict['status']
                # print(userFeature_dict['value_status'], type(userFeature_dict['value_status']))
            if 'connectionType' in userFeature_dict.keys():
                value_connectionType = userFeature_dict['connectionType']
            if 'behavior' in userFeature_dict.keys():
                value_behavior = userFeature_dict['behavior']
        # 對於人羣定向列屬性 指定的屬性列是: 'age', 'gender', 'area', 'education',
        # 'device', 'consuptionAbility', 'status', 'connectionType', 'behavior'
        save_line.append(value_age)
        save_line.append(value_gender)
        save_line.append(value_area)
        save_line.append(value_education)
        save_line.append(value_device)
        save_line.append(value_consuptionAbility)
        save_line.append(value_status)
        save_line.append(value_connectionType)
        save_line.append(value_behavior)

        # 保存最後的結果數據集
        Test_Sample_Data.append(save_line)
        if i == 2:
            print("******最後用於保存結果的數據格式是:\n", Test_Sample_Data[3][10], '\n', len(Test_Sample_Data[3][10]))

# 測試成功!!!!!數據集保存正確
user_feature = pd.DataFrame(Test_Sample_Data)
user_feature.to_csv('../Dataset/dataset_for_train/Test_Sample_Data_all.csv', index=False, header=None)



 

將廣告數據集和用戶數據集進行拼接操作:

# -*- coding: utf-8 -*-
# @Time    : 2019/5/1 15:55
# @Author  : YYLin
# @Email   : [email protected]
# @File    : Generator_Label_For_Train.py
import pandas as pd
import datetime
import numpy as np
# 生成點擊數並且暫時刪除測試集中沒有的屬性
Total_Exposure_Log_Data = pd.read_csv('../Dataset/dataset_for_train/Total_Exposure_Log_Data.csv')
print("原始數據集中的樣式是:\n", Total_Exposure_Log_Data.info())
tfa = Total_Exposure_Log_Data.Ad_Request_Time.astype(str).apply(lambda x: datetime.datetime(int(x[:4]),
                                                                          int(x[5:7]),
                                                                          int(x[8:10]),
                                                                          int(x[11:13]),
                                                                          int(x[14:16]),
                                                                          int(x[17:])))

Total_Exposure_Log_Data['tfa_year'] = np.array([x.year for x in tfa])
Total_Exposure_Log_Data['tfa_month'] = np.array([x.month for x in tfa])
Total_Exposure_Log_Data['tfa_day'] = np.array([x.day for x in tfa])
print("增加單獨的年月日之後的數據形狀是:\n", Total_Exposure_Log_Data.info())

Group_Exposure_Data = Total_Exposure_Log_Data.groupby(['tfa_year', 'tfa_month', 'tfa_day', 'ad_id', 'Ad_bid']).size().reset_index()
Group_Exposure_Data = Group_Exposure_Data.rename(columns={0: 'num_click'})
print("按照年月日 廣告id和廣告競價進行分組之後的數據是:\n", Group_Exposure_Data.head(5))

# 將曝光數據按照年月日 廣告id和廣告競價刪除重複的元素之後進行合併
Total_Exposure_Log_Data_one = Total_Exposure_Log_Data.drop_duplicates(subset=['tfa_year', 'tfa_month', 'tfa_day', 'ad_id', 'Ad_bid'] ,keep="first").reset_index(drop=True)
Clicks_of_Exposure_Data = pd.merge(Total_Exposure_Log_Data_one, Group_Exposure_Data, on=['tfa_year', 'tfa_month', 'tfa_day', 'ad_id', 'Ad_bid'])

# 刪除測試集中沒有的相關屬性 並將結果進行保存
Clicks_of_Exposure_Data.drop('Ad_Request_id', axis=1, inplace=True)
Clicks_of_Exposure_Data.drop('Ad_Request_Time', axis=1, inplace=True)
Clicks_of_Exposure_Data.drop('user_id', axis=1, inplace=True)
Clicks_of_Exposure_Data.drop('Ad_material_size', axis=1, inplace=True)
Clicks_of_Exposure_Data.drop('Ad_pctr', axis=1, inplace=True)
Clicks_of_Exposure_Data.drop('Ad_quality_ecpm', axis=1, inplace=True)
Clicks_of_Exposure_Data.drop('Ad_total_Ecpm', axis=1, inplace=True)
Clicks_of_Exposure_Data.drop('tfa_year', axis=1, inplace=True)
Clicks_of_Exposure_Data.drop('tfa_month', axis=1, inplace=True)
Clicks_of_Exposure_Data.drop('tfa_day', axis=1, inplace=True)
Clicks_of_Exposure_Data.drop('Ad_pos_id', axis=1, inplace=True)
Clicks_of_Exposure_Data.drop('Ad_bid', axis=1, inplace=True)

print("廣告數據集中需要保存的信息格式是:\n", Clicks_of_Exposure_Data.info())
Clicks_of_Exposure_Data.to_csv('../Dataset/dataset_for_train/Clicks_of_Exposure_Data.csv', index=False)

# 將曝光日誌按照ID和靜態廣告數據進行拼接操作
Ad_Static_Data = pd.read_csv('../Dataset/dataset_for_train/Ad_Static_Feature_Data.csv')
Ad_Static_Data.drop('Commodity_id', axis=1, inplace=True)
Ad_Static_Data.drop('Ad_account_id', axis=1, inplace=True)
Ad_Static_Data.drop('Creation_time', axis=1, inplace=True)
print("*********靜態數據集的樣式是:\n", Ad_Static_Data.info())
Merce_Ad_Static_and_Exposure_Data = pd.merge(Clicks_of_Exposure_Data, Ad_Static_Data, on=['ad_id'])

# 讀取廣告操作數據集並拼接數據集
Op_Ad_Data = pd.read_csv('../Dataset/dataset_for_train/Ad_Operation_Data.csv').drop_duplicates(['ad_id'])
Op_Ad_Data.drop('Create_modify_time', axis=1, inplace=True)

Dataset_For_Train = pd.merge(Op_Ad_Data, Merce_Ad_Static_and_Exposure_Data, on=['ad_id'])
print("最後數據集保存的樣式是:\n", Dataset_For_Train.info())
Dataset_For_Train.to_csv('../Dataset/dataset_for_train/Dataset_For_Train.csv', index=False)

 

生成訓練樣本的代碼:

# -*- coding: utf-8 -*-
# @Time    : 2019/5/3 14:33
# @Author  : YYLin
# @Email   : [email protected]
# @File    : Redo_Dataload_For_Train.py
# 測試樣本中的格式是
# ad_id ad_bid num_click Ad_material_size Ad_Industry_Id Commodity_type Delivery_time
# age gender area education device consuptionAbility status connectionType behavior
# 首先使用pandas將數據集劃分成三個部分 分別是單獨屬性 人羣定向文件 以及最後的投放時間
import pandas as pd
from functools import reduce
import operator
import sys

data_for_train = pd.read_csv('../Dataset/dataset_for_train/dataset_for_train.csv')

data_for_chose_people = data_for_train['Chose_People']
data_for_chose_time = data_for_train['Delivery_time']

data_for_train.drop('Chose_People', axis=1, inplace=True)
data_for_train.drop('Delivery_time', axis=1, inplace=True)
data_for_train = data_for_train[
    ['ad_id', 'ad_bid', 'num_click', 'Ad_material_size', 'Ad_Industry_Id', 'Commodity_type']]

data_for_chose_people.to_csv('../Dataset/data/Chose_people.csv', index=False)
data_for_chose_time.to_csv('../Dataset/data/Chose_time.csv', index=False)
data_for_train.to_csv('../Dataset/data/data_for_train_test.csv', index=False)

time_line = []
count_line_of_time = 1
time_line.append('Delivery_time')

with open('../Dataset/data/Chose_time.csv', 'r') as f:
    for i, line in enumerate(f):
        # 測試代碼的時候使用
        # if i >= 2:
            # break
        count_line_of_time = count_line_of_time + 1
        tmp_line = line.strip().split(',')
        line = ' '.join(tmp_line)
        time_line.append(line)
chose_time = pd.DataFrame(time_line)
chose_time.to_csv('../Dataset/data/chose_time.csv', index=False, header=False)

# 刪除已經運行結束的變量 節省內存
del chose_time, time_line

people_line = []
count_line_of_people = 1
people_line_columns = ['age', 'gender', 'area', 'education', 'device', 'consuptionAbility', 'status', 'connectionType',
                       'behavior']
people_line.append(people_line_columns)

# 從user_data中提取相關的屬性信息
# 測試樣本中人羣定向是all的時候 利用原始的user.data將數據集劃分
user_data = pd.read_csv('../Dataset/tencent-dataset-19/dataset-for-user/userFeature_1000.csv')
print("*************user_data************", user_data.info())
User_age = user_data['Age'].drop_duplicates(keep='first', inplace=False)
User_age = list(User_age)
User_age = [str(x) for x in User_age]
all_age = ' '.join(User_age)
# print("all_age的樣式是:\n", all_age, type(all_age))

User_Gender = user_data['Gender'].drop_duplicates(keep='first', inplace=False)
User_Gender = list(User_Gender)
User_Gender = [str(x) for x in User_Gender]
all_Gender = ' '.join(User_Gender)

# 因爲地域這列屬性可以取多值 所以需要對其合併成一維數組之後 然後在執行去重操作
User_Area = user_data['Area']
User_Area = list(User_Area)
for i, temp_line in enumerate(User_Area):
    User_Area[i] = temp_line.strip().split(',')
User_Area = reduce(operator.add, User_Area)
# print("User_Area轉化成一維數組之後前20個數據是", User_Area[0:20], type(User_Area))
User_Area_set = list(set(User_Area))
# print("User_Area經過去重之後前20個數據是", User_Area_set[0:20], len(User_Area_set))
User_Area = [str(x) for x in User_Area]
all_Area = ' '.join(User_Area)
print("all_Area的類型是:\n", type(all_Area), type(all_Area[1]), )
# print("在用戶文件之中地域的取值爲:\n", all_Area[0:10], len(all_Area))

User_Education = user_data['Education'].drop_duplicates(keep='first', inplace=False)
User_Education = list(User_Education)
User_Education = [str(x) for x in User_Education]
all_Education = ' '.join(User_Education)
print("all_Education的類型是:\n", len(all_Education), type(all_Education), type(all_Education[1]))


User_Consuption_Ability = user_data['Consuption_Ability'].drop_duplicates(keep='first', inplace=False)
User_Consuption_Ability = list(User_Consuption_Ability)
User_Consuption_Ability = [str(x) for x in User_Consuption_Ability]
all_Consuption_Ability = ' '.join(User_Consuption_Ability)

User_Device = user_data['Device'].drop_duplicates(keep='first', inplace=False)
User_Device = list(User_Device)
User_Device = [str(x) for x in User_Device]
all_Device = ' '.join(User_Device)


# 對於工作可能是取多值的情況 所以參照地域的取值方式
User_Work_Status = user_data['Work_Status']
User_Work_Status = list(User_Work_Status)
for i, temp_line in enumerate(User_Work_Status):
    if ',' in temp_line:
        # print("temp_line是:", temp_line)
        User_Work_Status[i] = temp_line.strip().split(',')
    else:
        User_Work_Status[i] = list(temp_line)
# print("經過修改操作之後的User_Work_Status是:", User_Work_Status[0:10], type(User_Work_Status))
User_Work_Status = reduce(operator.add, User_Work_Status)
User_Work_Status = list(set(User_Work_Status))
User_Work_Status = [str(x) for x in User_Work_Status]
all_Work_Status = ' '.join(User_Work_Status)
# print("最後User_Work_Status的取值範圍是:\n", all_Work_Status)

User_Connection_Type = user_data['Connection_Type'].drop_duplicates(keep='first', inplace=False)
User_Connection_Type = list(User_Connection_Type)
User_Connection_Type = [str(x) for x in User_Connection_Type]
all_Connection_Type = ' '.join(User_Connection_Type)

# 該方法的目的是找到Behavior中所有唯一值,當出現all的時候 將Behavior的值賦值給該條數據,
# 但是發現數據集中Behavior太多 暫時不執行該操作
User_Behavior = user_data['Behavior']
User_Behavior = list(User_Behavior)
print("原始數據集中用戶行爲的結果爲:\n", type(User_Behavior[0:-1]))
# 將二維數組轉化成一維數組
for i, temp_line in enumerate(User_Behavior):
    if ',' in temp_line:
        User_Behavior[i] = temp_line.strip().split(',')
    else:
        # print("Behavior中異常的數據是:", User_Behavior[i])
        User_Behavior[i] = list(temp_line)
        # del User_Behavior[i]

# User_Behavior.pop(0)
# 首先將數據降維到一維數組 然後去掉list中重複的元素
User_Behavior = reduce(operator.add, User_Behavior)
User_Behavior = list(set(User_Behavior))
Str_User_Behavior = [str(x) for x in User_Behavior]
all_Behavior = ' '.join(Str_User_Behavior)
print("用戶數據集中Behavior的取值範圍是", len(User_Behavior))

with open('../Dataset/data/Chose_people.csv', 'r') as f:
    for i, line in enumerate(f):
        # 測試代碼的時候使用
        # if i >= 2:
            # break
        count_line_of_people = count_line_of_people + 1

        if i%10000 == 0:
            print("我已經執行了%d條數據了"%(i))
        # 開始處理人羣定向數據集 定向人羣屬性列的格式是:
        #  'age', 'gender', 'area', 'education', 'device',
        #  'consuptionAbility', 'status', 'connectionType', 'behavior'

        # 對文件中存在的人羣定向分離出各個子節點
        tmp_line = line.strip().split('|')
        userFeature_dict = {}

        for each in tmp_line:
            each_list = each.split(':')
            userFeature_dict[each_list[0]] = ' '.join(each_list[1:])

        # print(result_of_line_9)
        value_age = ''
        value_gender = ''
        value_area = ''
        value_education = ''
        value_device = ''
        value_consuptionAbility = ''
        value_status = ''
        value_connectionType = ''
        value_behavior = ''

        # 當定向人羣是all的時候 需要特殊處理
        value_all = 'all'
        save_line = []

        if value_all in userFeature_dict.keys():
            value_age = all_age
            value_gender = all_Gender
            value_area = all_Area
            value_education = all_Education
            value_consuptionAbility = all_Consuption_Ability
            value_device = all_Device
            value_status = all_Work_Status
            value_connectionType = all_Connection_Type
            value_behavior = all_Behavior
        else:
            if 'age' in userFeature_dict.keys():
                value_age = userFeature_dict['age']
                # print(userFeature_dict['age'], type(userFeature_dict['age']))
            if 'gender' in userFeature_dict.keys():
                value_gender = userFeature_dict['gender']
                # print(userFeature_dict['gender'], type(userFeature_dict['gender']))
            if 'area' in userFeature_dict.keys():
                value_area = userFeature_dict['area']
                # print(userFeature_dict['area'], type(userFeature_dict['area']))
            if 'education' in userFeature_dict.keys():
                value_education = userFeature_dict['education']
                # print(userFeature_dict['education'], type(userFeature_dict['education']))
            if 'device' in userFeature_dict.keys():
                value_device = userFeature_dict['device']
                # print(userFeature_dict['device'], type(userFeature_dict['device']))
            if 'consuptionAbility' in userFeature_dict.keys():
                value_consuptionAbility = userFeature_dict['consuptionAbility']
                # print(userFeature_dict['consuptionAbility'], type(userFeature_dict['consuptionAbility']))
            if 'status' in userFeature_dict.keys():
                value_status = userFeature_dict['status']
                # print(userFeature_dict['value_status'], type(userFeature_dict['value_status']))
            if 'connectionType' in userFeature_dict.keys():
                value_connectionType = userFeature_dict['connectionType']
            if 'behavior' in userFeature_dict.keys():
                value_behavior = userFeature_dict['behavior']
        # 對於人羣定向列屬性 指定的屬性列是: 'age', 'gender', 'area', 'education',
        # 'device', 'consuptionAbility', 'status', 'connectionType', 'behavior'
        save_line.append(value_age)
        save_line.append(value_gender)
        save_line.append(value_area)
        save_line.append(value_education)
        save_line.append(value_device)
        save_line.append(value_consuptionAbility)
        save_line.append(value_status)
        save_line.append(value_connectionType)
        save_line.append(value_behavior)
        people_line.append(save_line)

if count_line_of_people != count_line_of_time:
    print("數據集中的人羣定向和指定時間行數不相等,系統退出")
    sys.exit()

print("***********程序已經加載完畢,正在保存數據***************")
chose_people = pd.DataFrame(people_line)
chose_people.to_csv('../Dataset/data/chose_people.csv', index=False, header=False)
print("已經保存好人羣定向的數據,開始將三個數據進行拼接操作")
# 最後將三個已保存的數據進行拼接即可
Test_Sample_Data_time = pd.read_csv('../Dataset/data/chose_time.csv')
Test_Sample_Data_people = pd.read_csv('../Dataset/data/chose_people.csv')
Test_Sample_Data_train = pd.read_csv('../Dataset/data/data_for_train_test.csv')

result = pd.concat([Test_Sample_Data_train, Test_Sample_Data_time, Test_Sample_Data_people], axis=1)
result.to_csv('../Dataset/dataset_for_train/result_for_train_all.csv', index=False)



發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章