【課程2.19】 數據分組
分組統計 - groupby功能
① 根據某些條件將數據拆分成組
② 對每個組獨立應用函數
③ 將結果合併到一個數據結構中
1.分組
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'],
'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
'C' : np.random.randn(8),
'D' : np.random.randn(8)})
print(df)
print(df.groupby('A'))
# 直接分組得到一個groupby對象,是一箇中間數據,沒有進行計算
a = df.groupby('A').mean()
b = df.groupby(['A','B']).mean()
c = df.groupby('A')['D'].mean() # 以A分組,算D的平均值
print(a)
print(b)
print(c)
# 通過分組後的計算,得到一個新的dataframe
# 默認axis = 0,以行來分組
# 可單個或多個([])列分組
----------------------------------------------------------------------
A B C D
0 foo one -0.493902 0.618592
1 bar one 1.125378 -1.685569
2 foo two 2.891270 -0.979019
3 bar three -0.948411 0.047357
4 foo two 1.337867 -0.223610
5 bar two 0.111866 1.104062
6 foo one -0.317939 2.130371
7 foo three -1.447532 0.442768
----------------------------------------------------------------------
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000016F51727D48>
----------------------------------------------------------------------
C D
A
bar 0.096278 -0.178050
foo 0.393953 0.397821
----------------------------------------------------------------------
C D
A B
bar one 1.125378 -1.685569
three -0.948411 0.047357
two 0.111866 1.104062
foo one -0.405921 1.374481
three -1.447532 0.442768
two 2.114569 -0.601314
----------------------------------------------------------------------
A
bar -0.178050
foo 0.397821
Name: D, dtype: float64
2.分組 - 可迭代對象
df = pd.DataFrame({'X' : ['A', 'B', 'A', 'B'], 'Y' : [1, 4, 3, 2]})
print(df)
print(df.groupby('X'))
----------------------------------------------------------------------
X Y
0 A 1
1 B 4
2 A 3
3 B 2
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001F0B07E7A88>
print(list(df.groupby('X')), '→ 可迭代對象,直接生成list\n')
print(list(df.groupby('X'))[0], '→ 以元祖形式顯示\n')
for n,g in df.groupby('X'):
print(n)
print(g)
print('-----')
# n是組名,g是分組後的Dataframe
----------------------------------------------------------------------
[('A', X Y
0 A 1
2 A 3), ('B', X Y
1 B 4
3 B 2)] → 可迭代對象,直接生成list
----------------------------------------------------------------------
('A', X Y
0 A 1
2 A 3) → 以元祖形式顯示
----------------------------------------------------------------------
A
X Y
0 A 1
2 A 3
###
B
X Y
1 B 4
3 B 2
###
print(df.groupby(['X']).get_group('A'))
print(df.groupby(['X']).get_group('B'))
# .get_group()提取分組後的組
-----------------------------------------------------------------------
{'A': Int64Index([0, 2], dtype='int64'), 'B': Int64Index([1, 3], dtype='int64')}
Int64Index([0, 2], dtype='int64')
grouped = df.groupby(['X'])
print(grouped.groups)
print(grouped.groups['A'])
# 也可寫:df.groupby('X').groups['A']
# .groups:將分組後的groups轉爲dict
# 可以字典索引方法來查看groups裏的元素
-----------------------------------------------------------------------
{'A': Int64Index([0, 2], dtype='int64'), 'B': Int64Index([1, 3], dtype='int64')}
Int64Index([0, 2], dtype='int64')
sz = grouped.size()
print(sz)
# .size():查看分組後的長度
-----------------------------------------------------------------------
X
A 2
B 2
dtype: int64
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'],
'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
'C' : np.random.randn(8),
'D' : np.random.randn(8)})
grouped = df.groupby(['A','B']).groups
print(df)
print(grouped)
print(grouped[('foo', 'three')])
# 按照兩個列進行分組
-----------------------------------------------------------------------
A B C D
0 foo one -0.579087 0.232869
1 bar one -0.032279 0.843799
2 foo two -0.530994 -0.384497
3 bar three 0.207413 0.397429
4 foo two 0.032195 -0.168221
5 bar two 0.572647 -0.494428
6 foo one -1.887133 -1.031850
7 foo three -0.746258 -0.091591
----------------------------------------------------------------------
{('bar', 'one'): Int64Index([1], dtype='int64'), ('bar', 'three'): Int64Index([3], dtype='int64'), ('bar', 'two'): Int64Index([5], dtype='int64'), ('foo', 'one'): Int64Index([0, 6], dtype='int64'), ('foo', 'three'): Int64Index([7], dtype='int64'), ('foo', 'two'): Int64Index([2, 4], dtype='int64')}
----------------------------------------------------------------------
Int64Index([7], dtype='int64')
3.其他軸上的分組
df = pd.DataFrame({'data1':np.random.rand(2),
'data2':np.random.rand(2),
'key1':['a','b'],
'key2':['one','two']})
print(df)
print(df.dtypes)
print(df.groupby(df.dtypes, axis=1))
----------------------------------------------------------------------
data1 data2 key1 key2
0 0.786626 0.898570 a one
1 0.006811 0.971437 b two
data1 float64
data2 float64
key1 object
key2 object
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000016F4D91A688>
4.通過字典或者Series分組
df = pd.DataFrame(np.arange(16).reshape(4,4),
columns = ['a','b','c','d'])
print(df)
mapping = {'a':'one','b':'one','c':'two','d':'two','e':'three'}
by_column = df.groupby(mapping, axis = 1)
print(by_column.sum())
# mapping中,a、b列對應的爲one,c、d列對應的爲two,以字典來分組
s = pd.Series(mapping)
print(s)
print(s.groupby(s).count())
# s中,index中a、b對應的爲one,c、d對應的爲two,以Series來分組
----------------------------------------------------------------------
a b c d
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
-----
one two
0 1 5
1 9 13
2 17 21
3 25 29
-----
a one
b one
c two
d two
e three
one 2
three 1
two 2
5.分組計算函數方法
s = pd.Series([1, 2, 3, 10, 20, 30], index = [1, 2, 3, 1, 2, 3])
grouped = s.groupby(level=0) # 唯一索引用.groupby(level=0),將同一個index的分爲一組
print(grouped)
print(grouped.first(),'→ first:非NaN的第一個值\n')
print(grouped.last(),'→ last:非NaN的最後一個值\n')
print(grouped.sum(),'→ sum:非NaN的和\n')
print(grouped.mean(),'→ mean:非NaN的平均值\n')
print(grouped.median(),'→ median:非NaN的算術中位數\n')
print(grouped.count(),'→ count:非NaN的值\n')
print(grouped.min(),'→ min、max:非NaN的最小值、最大值\n')
print(grouped.std(),'→ std,var:非NaN的標準差和方差\n')
print(grouped.prod(),'→ prod:非NaN的積\n')
----------------------------------------------------------------------
6.多函數計算:agg()
df = pd.DataFrame({'a':[1,1,2,2],
'b':np.random.rand(4),
'c':np.random.rand(4),
'd':np.random.rand(4),})
print(df)
print(df.groupby('a').agg(['mean',np.sum]))
print(df.groupby('a')['b'].agg({'result1':np.mean,
'result2':np.sum}))
# 函數寫法可以用str,或者np.方法
# 可以通過list,dict傳入,當用dict時,key名爲columns
----------------------------------------------------------------------
a b c d
0 1 0.357911 0.318324 0.627797
1 1 0.964829 0.500017 0.570063
2 2 0.116608 0.194164 0.049509
3 2 0.933123 0.542615 0.718640
b c d
mean sum mean sum mean sum
a
1 0.661370 1.322739 0.409171 0.818341 0.598930 1.19786
2 0.524865 1.049730 0.368390 0.736780 0.384075 0.76815
result2 result1
a
1 1.322739 0.661370
2 1.049730 0.524865
作業1:按要求創建Dataframe df,並通過分組得到以下結果
① 以A分組,求出C,D,E的分組平均值
② 以A,B分組,求出D,E的分組求和
③ 以A分組,得到所有分組,以字典顯示
④ 按照數值類型分組,求和
⑤ 將C,D作爲一組分出來,並計算求和
⑥ 以B分組,求出每組的均值,求和,最大值,最小值
df = pd.DataFrame({'A' : ['one', 'two', 'three', 'one','two', 'three', 'one', 'two'],
'B' : ['h', 'h', 'h', 'h', 'f', 'f', 'f', 'f'],
'C' : np.arange(10,26,2),
'D' : np.random.randn(8),
'E':np.random.rand(8)})
print(df)
print(df.groupby('A').mean())
print(df.groupby(['A','B']).sum())
print(df.groupby('A').groups)
print(df.groupby(df.dtypes,axis=1).sum())
print(df.groupby({'C':'r','D':'r'},axis=1).sum())
print(df.groupby('B').agg([np.mean,np.sum,np.max,np.min]))
創建df爲:
A B C D E
0 one h 10 -0.085340 0.420645
1 two h 12 2.373044 0.664479
2 three h 14 0.553483 0.988042
3 one h 16 0.155289 0.184052
4 two f 18 1.942460 0.037124
5 three f 20 -0.085759 0.658828
6 one f 22 -1.368377 0.334869
7 two f 24 -1.101152 0.254488
------
以A分組,求出C,D的分組平均值爲:
C D E
A
one 16 -0.432809 0.313189
three 17 0.233862 0.823435
two 18 1.071450 0.318697
------
以A,B分組,求出D,E的分組求和爲:
C D E
A B
one f 22 -1.368377 0.334869
h 26 0.069949 0.604697
three f 20 -0.085759 0.658828
h 14 0.553483 0.988042
two f 42 0.841308 0.291611
h 12 2.373044 0.664479
------
以A分組,篩選出分組後的第一組數據爲:
{'three': [2, 5], 'two': [1, 4, 7], 'one': [0, 3, 6]}
------
按照數值類型分組爲:
int32 float64 object
0 10 0.335305 oneh
1 12 3.037522 twoh
2 14 1.541526 threeh
3 16 0.339341 oneh
4 18 1.979583 twof
5 20 0.573069 threef
6 22 -1.033508 onef
7 24 -0.846665 twof
------
將C,D作爲一組分出來,並計算求和爲:
r
0 9.914660
1 14.373044
2 14.553483
3 16.155289
4 19.942460
5 19.914241
6 20.631623
7 22.898848
------
以B分組,求出每組的均值,求和,最大值,最小值:
C D E \
mean sum amax amin mean sum amax amin mean
B
f 21 84 24 18 -0.153207 -0.612828 1.942460 -1.368377 0.321327
h 13 52 16 10 0.749119 2.996477 2.373044 -0.085340 0.564304
sum amax amin
B
f 1.285308 0.658828 0.037124
h 2.257218 0.988042 0.184052
------