這部分可以用三個單詞概括: split(拆分)--->apply(應用)--->combine(合併)
split:pandas對象中的數據根據你提供的一個或多個鍵被拆分爲多組,在特定的軸上操作。
apply:將一個函數應用到各個分組產生一個新值。
combine: 將apply執行的結果合併到最終的結果對象中。
用圖直觀描述一下:
分組鍵可以有多種形式,且類型不必相同:
- 列表或數組,其長度與待分組的軸一樣;
- 表示DataFrame某個列名的值;
- 字典或Series給出待分組軸上的值與分組之間的對應關係;
- 函數,用於處理索引或索引中的各個標籤;
數據分組
看幾個例子吧:
In [81]: df =pd.DataFrame({'key1':list('aabba'),'key2':['one',
...: 'two','one','two','one'], 'data1':np.random.randn(5),
...: 'data2':np.random.randn(5)})
In [82]: grouped = df['data1'].groupby(df['key1']) # 等價於 df.groupby('key1')['data1'] 返回一個含有分組大小的Series
In [83]: grouped.mean()
Out[83]:
key1
a -0.130183
b 0.558940
Name: data1, dtype: float64
In [90]: grouped = df[['data1']].groupby(df['key1']) # 這樣返回的就是一個分組後的 Dataframe 對象
In [91]: grouped.mean()
Out[91]:
data1
key1
a -0.130183
b 0.558940
In [86]: grouped = df['data1'].groupby([df['key1'],df['key2']] #等價於 df.groupby(['key1','key2'])['data1'].mean()
...: ).mean()
In [87]: grouped
Out[87]:
key1 key2
a one -0.134323
two -0.121903
b one -0.222980
two 1.340860
Name: data1, dtype: float64
分組鍵還可以是任何長度適當的數組:
In [94]: states=np.array(['Ohio','California','California','Ohio','Ohio']) In [95]: years=np.array([2005,2005,2006,2005,2006]) In [96]: df['data1'].groupby([states,years]).mean() Out[96]: California 2005 -0.121903 2006 -0.222980 Ohio 2005 0.522358 2006 0.027499 Name: data1, dtype: float64
通過字典或Series進行分組
In [97]: people =pd.DataFrame(np.random.randn(5,5),columns=['a ...: ','b','c','d','e'],index=['Joe','Steve','Wes','Jim',' Travis']) In [99]: people.loc[2:3,['b','c']]=np.nan In [100]: people Out[100]: a b ... d e Joe -0.947326 0.239880 ... 0.872371 -0.735662 Steve -0.347697 -0.770544 ... -0.065098 1.408378 Wes 0.090692 NaN ... 0.514894 1.544698 Jim -0.023477 0.392851 ... -0.730022 2.376998 Travis 1.760169 -0.671978 ... -1.363583 0.284736 [5 rows x 5 columns]
假設已知列的關係,並希望根據分組計算列的總和。 In [102]: mappings={'a':'red','b':'red','c':'blue','d':'blue', ...: 'e':'red','f':'orange'} In [103]: by_column=people.groupby(mappings,axis=1) In [104]: by_column Out[104]: <pandas.core.groupby.groupby.DataFrameGroupBy object at 0x000002225F554748> In [105]: by_column.sum() Out[105]: blue red Joe 1.365578 -1.443109 Steve -0.234023 0.290137 Wes 0.514894 1.635390 Jim -2.368476 2.746372 Travis -3.138280 1.372926
Series也有着同樣的功能,可以被看做是一個大的映射。 In [106]: map_series=pd.Series(mappings) In [109]: people.groupby(map_series,axis=1).count() Out[109]: blue red Joe 2 3 Steve 2 3 Wes 1 2 Jim 2 3 Travis 2 3
通過函數進行分組
任何被當做分組鍵的函數都會在各個索引值上被調用一次,其返回值就會被用作分組名稱。假設我們想根據人名的長度進行分組,雖然可以求取一個字符串的長度數組,但是其實僅僅傳入len函數就可以了:
In [112]: people.groupby(len).sum() Out[112]: a b c d e 3 -0.880111 0.632731 -1.145247 0.657244 3.186034 5 -0.347697 -0.770544 -0.168926 -0.065098 1.408378 6 1.760169 -0.671978 -1.774697 -1.363583 0.284736
將函數跟數組、字典、列表、Series混用都可以 In [113]: key_list=['one','one','one','two','two'] In [115]: people.groupby([len, key_list]).sum() Out[115]: a b ... d e 3 one -0.856634 0.239880 ... 1.387265 0.809036 two -0.023477 0.392851 ... -0.730022 2.376998 5 one -0.347697 -0.770544 ... -0.065098 1.408378 6 two 1.760169 -0.671978 ... -1.363583 0.284736 [4 rows x 5 columns]
根據索引級別分組
In [117]: columns=pd.MultiIndex.from_arrays([['US','US','US',' ...: JP','JP'],[1,3,5,1,3]],names=['city','tenor']) In [118]: hier_df=pd.DataFrame(np.random.randn(4,5),columns=co ...: lumns) In [119]: hier_df Out[119]: city US ... JP tenor 1 3 ... 1 3 0 -1.394733 0.729459 ... -0.596068 -0.603338 1 -1.781984 -1.218506 ... 1.466333 -1.819780 2 0.563109 -0.254265 ... -0.084187 -0.231257 3 -0.988175 1.948536 ... 0.661451 1.788520 In [120]: hier_df.groupby(level='city',axis=1).count() Out[120]: city JP US 0 2 3 1 2 3 2 2 3 3 2 3
數據聚合
數據聚合是從數組產生標量值的數據轉換過程,此過程可以應用 groupby的方法,也可以應用自定義的方法,可以一次應用一個,也可以一次多個。
groupby的方法:
In [81]: df =pd.DataFrame({'key1':list('aabba'),'key2':['one', ...: 'two','one','two','one'], 'data1':np.random.randn(5), ...: 'data2':np.random.randn(5)}) In [82]: grouped = df['data1'].groupby(df['key1']) # 等價於 df.groupby('key1')['data1'] 返回一個含有分組大小的Series In [83]: grouped.mean() Out[83]: key1 a -0.130183 b 0.558940 Name: data1, dtype: float64
應用自定義的方法:
In [121]: df Out[121]: data1 data2 key1 key2 0 -0.296144 0.110057 a one 1 -0.121903 0.139525 a two 2 -0.222980 0.937050 b one 3 1.340860 0.395284 b two 4 0.027499 -0.668304 a one In [122]: gd =df.groupby('key1') In [123]: def peak_to_peak(arr): ...: return arr.max() -arr.min() In [124]: gd.agg(peak_to_peak) Out[124]: data1 data2 key1 a 0.323643 0.807829 b 1.563840 0.541766
應用多個方法:
實驗用到的數據可以去這個地址拿到,自己處理一下,就當鍛鍊了:
In [174]: tips = pd.read_csv('D:\workspaces\datatest\in.csv') # 讀文件 自己注意路徑 In [175]: tips['tip_pct'] = tips['tip']/tips['total_bill'] # 計算 tip_pct In [179]: tips.to_csv('D:\workspaces\datatest\out.csv',index=False) # 寫文件
https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv
In [130]: tips = pd.read_table('D:\workspaces\datatest\out.csv',sep='\s+') In [131]: tips Out[131]: total_bill tip sex smoker day time size tip_pct 0 16.99 1.01 Female No Sun Dinner 2 0.059447 1 10.34 1.66 Male No Sun Dinner 3 0.160542 2 21.01 3.50 Male No Sun Dinner 3 0.166587 3 23.68 3.31 Male No Sun Dinner 2 0.139780 In [132]: grouped = tips.groupby(['sex','smoker']) In [133]: g_pct = grouped['tip_pct'] In [134]: g_pct.agg('mean') Out[134]: sex smoker Female No 0.059447 Male No 0.155636 Name: tip_pct, dtype: float64 In [135]: g_pct.agg(['mean','std']) Out[135]: mean std sex smoker Female No 0.059447 NaN Male No 0.155636 0.014061 In [136]: g_pct.agg([('m','mean'),('s','std')]) Out[136]: m s sex smoker Female No 0.059447 NaN Male No 0.155636 0.014061 In [137]: grouped['tip_pct','total_bill'].agg(['count','mean','max']) Out[137]: tip_pct total_bill count mean max count mean max sex smoker Female No 1 0.059447 0.059447 1 16.990000 16.99 Male No 3 0.155636 0.166587 3 18.343333 23.68
以“無索引”的方式返回聚合數據
In [140]: grouped = tips.groupby(['sex','smoker'],as_index=Fal ...: se).mean() In [141]: grouped Out[141]: sex smoker ... size tip_pct 0 Female No ... 2.000000 0.059447 1 Male No ... 2.666667 0.155636 [2 rows x 6 columns]
上面的數據聚合只是數據轉換的特例,下面介紹transform和apply方法。
下列是爲數據集添加一個用於存放各索引分組平均值的列,先聚合再合併
In [33]: df Out[33]: data1 data2 key1 key2 0 -1.828647 0.329238 a one 1 -0.639952 1.532362 a two 2 0.617105 0.906281 b one 3 1.443470 1.419738 b two 4 -2.031339 -0.649743 a one In [34]: k1_means = df.groupby('key1').mean().add_prefix('mean_') In [35]: k1_means Out[35]: mean_data1 mean_data2 key1 a -1.499979 0.403952 b 1.030287 1.163010 In [36]: pd.merge(df,k1_means,left_on = 'key1',right_index = True) Out[36]: data1 data2 key1 key2 mean_data1 mean_data2 0 -1.828647 0.329238 a one -1.499979 0.403952 1 -0.639952 1.532362 a two -1.499979 0.403952 4 -2.031339 -0.649743 a one -1.499979 0.403952 2 0.617105 0.906281 b one 1.030287 1.163010 3 1.443470 1.419738 b two 1.030287 1.163010上面的方法雖然也可以完成任務,但是太繁瑣,我們再看一個例子:
In [37]: key = ['one','two','one','two','one'] In [41]: people =DataFrame(np.random.randn(5,5),columns=['a','b','c','d','e'],
index=['Joe','Steve','Wes','Jim','Travis']) In [42]: people Out[42]: a b c d e Joe -1.264617 -1.385894 0.146627 -1.225148 0.627616 Steve 0.880528 0.530060 0.453235 1.160768 -0.053416 Wes 1.033023 -0.859791 -0.629231 -1.094454 -2.073512 Jim 1.777919 -0.864824 -1.940994 -0.806969 0.504503 Travis -1.260144 -0.486910 1.180371 -0.214743 0.629261 In [43]: people.groupby(key).mean() Out[43]: a b c d e one -0.497246 -0.910865 0.232589 -0.844782 -0.272212 two 1.329224 -0.167382 -0.743879 0.176899 0.225544 In [44]: people.groupby(key).transform(np.mean) Out[44]: a b c d e Joe -0.497246 -0.910865 0.232589 -0.844782 -0.272212 Steve 1.329224 -0.167382 -0.743879 0.176899 0.225544 Wes -0.497246 -0.910865 0.232589 -0.844782 -0.272212 Jim 1.329224 -0.167382 -0.743879 0.176899 0.225544 Travis -0.497246 -0.910865 0.232589 -0.844782 -0.272212 In [45]: def demean(arr): ...: return arr-arr.mean() In [46]: demeaned = people.groupby(key).transform(demean) In [47]: demeaned Out[47]: a b c d e Joe -0.767371 -0.475029 -0.085962 -0.380366 0.899828 Steve -0.448695 0.697442 1.197114 0.983868 -0.278960 Wes 1.530269 0.051074 -0.861820 -0.249672 -1.801300 Jim 0.448695 -0.697442 -1.197114 -0.983868 0.278960 Travis -0.762898 0.423955 0.947782 0.630038 0.901473上面的例子是不是比第一個簡單點。
apply:一般性的“拆分-應用-合併”
transform是一個嚴格的條件的特殊函數:傳入的函數只能產生兩種結果,要麼產生一個可以廣播的標量值,要麼產生一個相同大小的結果數組。apply會將待處理的對象拆分爲多個片段,然後對各片段調用傳入的函數,最後嘗試將各個片段組合在一起。
假設我們想從下面數據集裏面選出5個最高的tip_pct值。
total_bill tip sex smoker day time size tip_pct # 修改了幾個數據 16.99 1.01 Female No Sun Dinner 2 0.059447 10.34 1.66 Male YES Sun Dinner 3 0.160542 21.01 3.50 Female YES Sun Dinner 3 0.166587 23.68 3.31 Male No Sun Dinner 2 0.139780
定義函數:
In [143]: def top(df,n=5,column = 'tip_pct'): ...: return df.sort_values(by=column)[-n:]
In [150]: top(tips,n=3) Out[150]: total_bill tip sex ... time size tip_pct 3 23.68 3.31 Male ... Dinner 2 0.139780 1 10.34 1.66 Male ... Dinner 3 0.160542 2 21.01 3.50 Female ... Dinner 3 0.166587 [3 rows x 8 columns]
In [151]: tips.groupby('smoker').apply(top) Out[151]: total_bill tip ... size tip_pct smoker ... No 0 16.99 1.01 ... 2 0.059447 3 23.68 3.31 ... 2 0.139780 YES 1 10.34 1.66 ... 3 0.160542 2 21.01 3.50 ... 3 0.166587 [4 rows x 8 columns]
top函數在Dataframe的各個片段上調用,然後結果由pandas.concat組裝到一起,並以分組名稱進行了標記。
加一點特殊條件
In [155]: tips.groupby(['smoker','day']).apply(top,n=1,column = 'total_bill')
禁止分組鍵:
In [152]: tips.groupby('smoker',group_keys=False).apply(top) Out[152]: total_bill tip sex ... time size tip_pct 0 16.99 1.01 Female ... Dinner 2 0.059447 3 23.68 3.31 Male ... Dinner 2 0.139780 1 10.34 1.66 Male ... Dinner 3 0.160542 2 21.01 3.50 Female ... Dinner 3 0.166587 [4 rows x 8 columns]
分位數與桶分析
我的理解就是:利用 cut() 或 qcut() 函數將數據分成一個一個的數據區間。一個數據區間大概就是一個桶了。
關於上面兩個函數可以參考:
https://blog.csdn.net/qq_42413820/article/details/80792530 離散化和麪元劃分一節
舉個栗子:
In [157]: frame = pd.DataFrame({'data1':np.random.randn(1000), ...: 'data2':np.random.randn(1000)}) In [158]: frame Out[158]: data1 data2 0 -0.805422 -0.369037 1 1.197099 -0.749794 ............. 998 -1.361444 -0.456945 999 0.187000 0.214123 [1000 rows x 2 columns] In [160]: quant = pd.cut(frame.data1,4) # 根據 data1 數據的最大值與最小值,將數據區間分成長度相等的4份 In [162]: quant[:2] Out[162]: 0 (-1.226, 0.664] 1 (0.664, 2.553] Name: data1, dtype: category Categories (4, interval[float64]): [(-3.123, -1.226] < (-1.226,0.664] < (0.664, 2.553] (2.553, 4.443]] In [163]: def get_stats(group): ...: return {'min':group.min(), ...: 'max':group.max(), ...: 'count':group.count(), ...: 'mean':group.mean()} In [164]: group1 = frame.data2.groupby(quant) In [165]: group1.apply(get_stats).unstack() Out[165]: count max mean min data1 (-3.123, -1.226] 113.0 1.904771 -0.077147 -2.467027 (-1.226, 0.664] 621.0 2.959166 -0.007698 -3.102885 (0.664, 2.553] 261.0 2.622070 0.010818 -2.502551 (2.553, 4.443] 5.0 0.504759 0.020948 -0.919104
根據樣本分位數將數據分成大小相等的桶:
In [168]: quant = pd.qcut(frame.data1,4)
In [169]: group1 = frame.data2.groupby(quant)
In [170]: group1.apply(get_stats).unstack()
Out[170]:
count ... min
data1 ...
(-3.116, -0.672] 250.0 ... -2.467027
(-0.672, -0.0194] 250.0 ... -3.102885
(-0.0194, 0.73] 250.0 ... -2.815319
(0.73, 4.443] 250.0 ... -2.502551
[4 rows x 4 columns]
In [171]: quant = pd.qcut(frame.data1,4,labels=False) # 去除標籤
In [172]: group1 = frame.data2.groupby(quant)
In [173]: group1.apply(get_stats).unstack()
Out[173]:
count max mean min
data1
0 250.0 2.959166 -0.048357 -2.467027
1 250.0 2.401510 0.017249 -3.102885
2 250.0 2.386183 -0.033954 -2.815319
3 250.0 2.622070 0.022783 -2.502551
透視表和交叉表
透視表(pivot table)是由各種電子表格程序和其他數據分析軟件一種常見的數據彙總工具。
pivot_table 參數列表
values | Column name or names to aggregate. By default aggregates all numeric columns |
index | Column names or other group keys to group on the rows of the resulting pivot table |
columns | Column names or other group keys to group on the columns of the resulting pivot table |
aggfunc | Aggregation function or list of functions; 'mean' by default. Can be any function valid in a groupby context |
fill_value | Replace missing values in result table |
margins | Add row/column subtotals and grand total, False by default |
In [181]: tips.pivot_table(index=['sex','smoker']) Out[181]: size tip tip_pct total_bill sex smoker Female No 2.592593 2.773519 0.156921 18.105185 Yes 2.242424 2.931515 0.182150 17.977879 Male No 2.711340 3.113402 0.160669 19.791237 Yes 2.500000 3.051167 0.152771 22.284500 # 只聚合tip_pct和size, In [182]: tips.pivot_table(['tip_pct','size'],index = ['sex','day'],columns='smoker') Out[182]: size tip_pct smoker No Yes No Yes sex day Female Fri 2.500000 2.000000 0.165296 0.209129 Sat 2.307692 2.200000 0.147993 0.163817 Sun 3.071429 2.500000 0.165710 0.237075 Thur 2.480000 2.428571 0.155971 0.163073 Male Fri 2.000000 2.125000 0.138005 0.144730 Sat 2.656250 2.629630 0.162132 0.139067 Sun 2.883721 2.600000 0.158291 0.173964 Thur 2.500000 2.300000 0.165706 0.164417 # 傳入margins = True作爲分項的一個彙總 In [183]: tips.pivot_table('tip_pct',index=['sex','smoker'],columns='day',margins=True) Out[183]: day Fri Sat Sun Thur All sex smoker Female No 0.165296 0.147993 0.165710 0.155971 0.156921 Yes 0.209129 0.163817 0.237075 0.163073 0.182150 Male No 0.138005 0.162132 0.158291 0.165706 0.160669 Yes 0.144730 0.139067 0.173964 0.164417 0.152771 All 0.169913 0.153152 0.166897 0.161276 0.160803 # 使用其他聚合函數傳給參數 aggfunc In [184]: tips.pivot_table('tip_pct',index=['sex','smoker'],columns='day',aggfunc=len,margins=True) Out[184]: day Fri Sat Sun Thur All sex smoker Female No 2.0 13.0 14.0 25.0 54.0 Yes 7.0 15.0 4.0 7.0 33.0 Male No 2.0 32.0 43.0 20.0 97.0 Yes 8.0 27.0 15.0 10.0 60.0 All 19.0 87.0 76.0 62.0 244.0 # 用 fill_value 填充 NA 值 In [185]: tips.pivot_table('size',index=['time','sex','smoker'],columns='day',aggfunc='sum',fill_value=0) Out[185]: day Fri Sat Sun Thur time sex smoker Dinner Female No 2 30 43 2 Yes 8 33 10 0 Male No 4 85 124 0 Yes 12 71 39 0 Lunch Female No 3 0 0 60 Yes 6 0 0 17 Male No 0 0 0 50 Yes 5 0 0 23
交叉表(crosstab)是一種計算分組頻率的特殊透視表。
In [186]: pd.crosstab([tips.time,tips.day],tips.smoker,margins=True) Out[186]: smoker No Yes All time day Dinner Fri 3 9 12 Sat 45 42 87 Sun 57 19 76 Thur 1 0 1 Lunch Fri 1 6 7 Thur 44 17 61 All 151 93 244