一般呢,我們拿到的原始數據中包含大量的髒數據,常常需要對其進行預處理,得到我們想要的數據格式。最常用的不外乎過濾數據、日期格式轉換、填空值、排序、去重等,下面就用個實例來展示下pandas處理數據的基本用法吧。
原始數據:
實現功能:
- 讀取原始數據
- 在A列中去除包含‘||’的行–>過濾數據
- 去除一行有3個空值的行–>過濾數據
- 將日期中的‘-’去掉–>日期格式轉換
- E列的空值填1–>填空值
- 按D列的日期降序排列–>排序
- B列去重,保留第一行–>去重
- 保存處理結果
import pandas as pd
data = pd.read_csv('buydata.csv', sep=',', header=None, names=['cookie', 'phone', 'deal_time', 'lead_time', 'num'])
print('raw data:\n', data)
data = data[~data.cookie.str.contains('\|\|')]
data.dropna(axis=0, thresh=3, inplace=True)
data.deal_time = data.deal_time.str.replace('-', '')
data.lead_time = data.lead_time.str.replace('-', '')
data.fillna({'num': 1}, inplace=True)
data = data.sort_values(by='lead_time', ascending=False).drop_duplicates(['phone'], keep='first')
print('preprocessing data:\n', data)
data.reset_index(drop=False, inplace=True)
print('reset_index data:\n', data)
data.to_csv('buydata_clear.csv', columns=['cookie', 'phone', 'deal_time', 'lead_time', 'num'], index_label='index')
運行結果:
raw data:
cookie phone deal_time lead_time num
0 asdfawef||asff 1123545.0 2018-10-10 2018-10-05 1.0
1 ghsdrg 4521665.0 2018-10-11 2018-10-06 2.0
2 dfag||adgh 544862.0 2018-10-12 2018-10-07 46.0
3 dfgtntsrg 5588662.0 2018-10-13 2018-10-08 7.0
4 aedfga 1123545.0 2018-10-14 2018-10-09 NaN
5 asdgh 4521665.0 2018-10-15 2018-10-10 2.0
6 ayjsdr 544862.0 2018-10-16 2018-10-11 7.0
7 kjfghjtd 5588662.0 2018-10-17 2018-10-12 3.0
8 kfghjtewert NaN NaN NaN NaN
9 uwrtywqeru 1123545.0 2018-10-11 2018-10-05 8.0
10 jsdfh||adfhs 4521665.0 2018-10-12 2018-10-06 3.0
11 iryuisfdh 544862.0 2018-10-13 2018-10-07 7.0
12 fhjulfy 5588662.0 2018-10-14 2018-10-08 1.0
preprocessing data:
cookie phone deal_time lead_time num
7 kjfghjtd 5588662.0 20181017 20181012 3.0
6 ayjsdr 544862.0 20181016 20181011 7.0
5 asdgh 4521665.0 20181015 20181010 2.0
4 aedfga 1123545.0 20181014 20181009 1.0
reset_index data:
index cookie phone deal_time lead_time num
0 7 kjfghjtd 5588662.0 20181017 20181012 3.0
1 6 ayjsdr 544862.0 20181016 20181011 7.0
2 5 asdgh 4521665.0 20181015 20181010 2.0
3 4 aedfga 1123545.0 20181014 20181009 1.0
處理後的數據: