【課程2.19】數據分組

分組統計 - groupby功能

① 根據某些條件將數據拆分成組
② 對每個組獨立應用函數
③ 將結果合併到一個數據結構中

1.分組


df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'],
                   'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
                   'C' : np.random.randn(8),
                   'D' : np.random.randn(8)})
print(df)

print(df.groupby('A'))
# 直接分組得到一個groupby對象，是一箇中間數據，沒有進行計算

a = df.groupby('A').mean()
b = df.groupby(['A','B']).mean()
c = df.groupby('A')['D'].mean()  # 以A分組，算D的平均值
print(a)
print(b)
print(c)
# 通過分組後的計算，得到一個新的dataframe
# 默認axis = 0，以行來分組
# 可單個或多個（[]）列分組
----------------------------------------------------------------------
     A      B         C         D
0  foo    one -0.493902  0.618592
1  bar    one  1.125378 -1.685569
2  foo    two  2.891270 -0.979019
3  bar  three -0.948411  0.047357
4  foo    two  1.337867 -0.223610
5  bar    two  0.111866  1.104062
6  foo    one -0.317939  2.130371
7  foo  three -1.447532  0.442768
----------------------------------------------------------------------
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000016F51727D48>
----------------------------------------------------------------------
            C         D
A                      
bar  0.096278 -0.178050
foo  0.393953  0.397821
----------------------------------------------------------------------
                  C         D
A   B                        
bar one    1.125378 -1.685569
    three -0.948411  0.047357
    two    0.111866  1.104062
foo one   -0.405921  1.374481
    three -1.447532  0.442768
    two    2.114569 -0.601314
----------------------------------------------------------------------
A
bar   -0.178050
foo    0.397821
Name: D, dtype: float64

2.分組 - 可迭代對象

df = pd.DataFrame({'X' : ['A', 'B', 'A', 'B'], 'Y' : [1, 4, 3, 2]})
print(df)
print(df.groupby('X'))
----------------------------------------------------------------------
   X  Y
0  A  1
1  B  4
2  A  3
3  B  2
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001F0B07E7A88>

print(list(df.groupby('X')), '→ 可迭代對象，直接生成list\n')
print(list(df.groupby('X'))[0], '→ 以元祖形式顯示\n')
for n,g in df.groupby('X'):
    print(n)
    print(g)
print('-----')
# n是組名，g是分組後的Dataframe
----------------------------------------------------------------------
[('A',    X  Y
0  A  1
2  A  3), ('B',    X  Y
1  B  4
3  B  2)] → 可迭代對象，直接生成list
----------------------------------------------------------------------
('A',    X  Y
0  A  1
2  A  3) → 以元祖形式顯示
----------------------------------------------------------------------
A
   X  Y
0  A  1
2  A  3
###
B
   X  Y
1  B  4
3  B  2
###

print(df.groupby(['X']).get_group('A'))
print(df.groupby(['X']).get_group('B'))
# .get_group()提取分組後的組
-----------------------------------------------------------------------
{'A': Int64Index([0, 2], dtype='int64'), 'B': Int64Index([1, 3], dtype='int64')}
Int64Index([0, 2], dtype='int64')

grouped = df.groupby(['X'])
print(grouped.groups)
print(grouped.groups['A'])  
# 也可寫：df.groupby('X').groups['A']
# .groups：將分組後的groups轉爲dict
# 可以字典索引方法來查看groups裏的元素
-----------------------------------------------------------------------
{'A': Int64Index([0, 2], dtype='int64'), 'B': Int64Index([1, 3], dtype='int64')}
Int64Index([0, 2], dtype='int64')

sz = grouped.size()
print(sz)
# .size()：查看分組後的長度
-----------------------------------------------------------------------
X
A    2
B    2
dtype: int64

df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'],
                   'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
                   'C' : np.random.randn(8),
                   'D' : np.random.randn(8)})
grouped = df.groupby(['A','B']).groups
print(df)
print(grouped)
print(grouped[('foo', 'three')])
# 按照兩個列進行分組
-----------------------------------------------------------------------
     A      B         C         D
0  foo    one -0.579087  0.232869
1  bar    one -0.032279  0.843799
2  foo    two -0.530994 -0.384497
3  bar  three  0.207413  0.397429
4  foo    two  0.032195 -0.168221
5  bar    two  0.572647 -0.494428
6  foo    one -1.887133 -1.031850
7  foo  three -0.746258 -0.091591
----------------------------------------------------------------------
{('bar', 'one'): Int64Index([1], dtype='int64'), ('bar', 'three'): Int64Index([3], dtype='int64'), ('bar', 'two'): Int64Index([5], dtype='int64'), ('foo', 'one'): Int64Index([0, 6], dtype='int64'), ('foo', 'three'): Int64Index([7], dtype='int64'), ('foo', 'two'): Int64Index([2, 4], dtype='int64')}
----------------------------------------------------------------------
Int64Index([7], dtype='int64')

3.其他軸上的分組


df = pd.DataFrame({'data1':np.random.rand(2),
                  'data2':np.random.rand(2),
                  'key1':['a','b'],
                  'key2':['one','two']})
print(df)
print(df.dtypes)
print(df.groupby(df.dtypes, axis=1))
----------------------------------------------------------------------
      data1     data2 key1 key2
0  0.786626  0.898570    a  one
1  0.006811  0.971437    b  two

data1    float64
data2    float64
key1      object
key2      object


<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000016F4D91A688>

4.通過字典或者Series分組


df = pd.DataFrame(np.arange(16).reshape(4,4),
                  columns = ['a','b','c','d'])
print(df)

mapping = {'a':'one','b':'one','c':'two','d':'two','e':'three'}
by_column = df.groupby(mapping, axis = 1)
print(by_column.sum())

# mapping中，a、b列對應的爲one，c、d列對應的爲two，以字典來分組

s = pd.Series(mapping)
print(s)
print(s.groupby(s).count())
# s中，index中a、b對應的爲one，c、d對應的爲two，以Series來分組
----------------------------------------------------------------------
    a   b   c   d
0   0   1   2   3
1   4   5   6   7
2   8   9  10  11
3  12  13  14  15
-----
   one  two
0    1    5
1    9   13
2   17   21
3   25   29
-----
a      one
b      one
c      two
d      two
e    three

one      2
three    1
two      2

5.分組計算函數方法


s = pd.Series([1, 2, 3, 10, 20, 30], index = [1, 2, 3, 1, 2, 3])
grouped = s.groupby(level=0)  # 唯一索引用.groupby(level=0)，將同一個index的分爲一組
print(grouped)
print(grouped.first(),'→ first：非NaN的第一個值\n')
print(grouped.last(),'→ last：非NaN的最後一個值\n')
print(grouped.sum(),'→ sum：非NaN的和\n')
print(grouped.mean(),'→ mean：非NaN的平均值\n')
print(grouped.median(),'→ median：非NaN的算術中位數\n')
print(grouped.count(),'→ count：非NaN的值\n')
print(grouped.min(),'→ min、max：非NaN的最小值、最大值\n')
print(grouped.std(),'→ std，var：非NaN的標準差和方差\n')
print(grouped.prod(),'→ prod：非NaN的積\n')
----------------------------------------------------------------------

6.多函數計算：agg()


df = pd.DataFrame({'a':[1,1,2,2],
                  'b':np.random.rand(4),
                  'c':np.random.rand(4),
                  'd':np.random.rand(4),})
print(df)
print(df.groupby('a').agg(['mean',np.sum]))
print(df.groupby('a')['b'].agg({'result1':np.mean,
                               'result2':np.sum}))
# 函數寫法可以用str，或者np.方法
# 可以通過list，dict傳入，當用dict時，key名爲columns
----------------------------------------------------------------------
   a         b         c         d
0  1  0.357911  0.318324  0.627797
1  1  0.964829  0.500017  0.570063
2  2  0.116608  0.194164  0.049509
3  2  0.933123  0.542615  0.718640
          b                   c                   d         
       mean       sum      mean       sum      mean      sum
a                                                           
1  0.661370  1.322739  0.409171  0.818341  0.598930  1.19786
2  0.524865  1.049730  0.368390  0.736780  0.384075  0.76815
    result2   result1
a                    
1  1.322739  0.661370
2  1.049730  0.524865

 作業1：按要求創建Dataframe df，並通過分組得到以下結果
① 以A分組，求出C,D,E的分組平均值
② 以A,B分組，求出D,E的分組求和
③ 以A分組，得到所有分組，以字典顯示
④ 按照數值類型分組，求和
⑤ 將C,D作爲一組分出來，並計算求和
⑥ 以B分組，求出每組的均值，求和，最大值，最小值

df = pd.DataFrame({'A' : ['one', 'two', 'three', 'one','two', 'three', 'one', 'two'],
                   'B' : ['h', 'h', 'h', 'h', 'f', 'f', 'f', 'f'],
                   'C' : np.arange(10,26,2),
                   'D' : np.random.randn(8),
                   'E':np.random.rand(8)})


print(df)
print(df.groupby('A').mean())
print(df.groupby(['A','B']).sum())
print(df.groupby('A').groups)
print(df.groupby(df.dtypes,axis=1).sum())
print(df.groupby({'C':'r','D':'r'},axis=1).sum())
print(df.groupby('B').agg([np.mean,np.sum,np.max,np.min]))

創建df爲：
        A  B   C         D         E
0    one  h  10 -0.085340  0.420645
1    two  h  12  2.373044  0.664479
2  three  h  14  0.553483  0.988042
3    one  h  16  0.155289  0.184052
4    two  f  18  1.942460  0.037124
5  three  f  20 -0.085759  0.658828
6    one  f  22 -1.368377  0.334869
7    two  f  24 -1.101152  0.254488 
------
以A分組，求出C,D的分組平均值爲：
         C         D         E
A                            
one    16 -0.432809  0.313189
three  17  0.233862  0.823435
two    18  1.071450  0.318697 
------
以A,B分組，求出D,E的分組求和爲：
           C         D         E
A     B                        
one   f  22 -1.368377  0.334869
      h  26  0.069949  0.604697
three f  20 -0.085759  0.658828
      h  14  0.553483  0.988042
two   f  42  0.841308  0.291611
      h  12  2.373044  0.664479 
------
以A分組，篩選出分組後的第一組數據爲：
 {'three': [2, 5], 'two': [1, 4, 7], 'one': [0, 3, 6]} 
------
按照數值類型分組爲：
    int32   float64  object
0     10  0.335305    oneh
1     12  3.037522    twoh
2     14  1.541526  threeh
3     16  0.339341    oneh
4     18  1.979583    twof
5     20  0.573069  threef
6     22 -1.033508    onef
7     24 -0.846665    twof 
------
將C,D作爲一組分出來，並計算求和爲：
            r
0   9.914660
1  14.373044
2  14.553483
3  16.155289
4  19.942460
5  19.914241
6  20.631623
7  22.898848 
------
以B分組，求出每組的均值，求和，最大值，最小值：
      C                       D                                       E  \
  mean sum amax amin      mean       sum      amax      amin      mean   
B                                                                        
f   21  84   24   18 -0.153207 -0.612828  1.942460 -1.368377  0.321327   
h   13  52   16   10  0.749119  2.996477  2.373044 -0.085340  0.564304   

                                 
        sum      amax      amin  
B                                
f  1.285308  0.658828  0.037124  
h  2.257218  0.988042  0.184052   
------

Python數據分析實戰【第三章】2.19-Pandas數據分組【python】

【課程2.19】數據分組

1.分組

2.分組 - 可迭代對象

3.其他軸上的分組

4.通過字典或者Series分組

5.分組計算函數方法

6.多函數計算：agg()

Python數據分析實戰【第三章】2.8-時間模塊：datetime【python】

Python數據分析實戰【第三章】2.5-Pandas數據結構Dataframe：基本概念及創建【python】

Python數據分析實戰【第三章】3.8-Matplotlib面積圖、填圖、餅圖【python】

Python數據分析實戰【第三章】3.11-Matplotlib極座標圖【python】

Python數據分析實戰【第三章】3.10-Matplotlib散點圖、矩陣散點圖【python】

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

Python數據分析實戰【第三章】2.19-Pandas數據分組【python】

【課程2.19】 數據分組

1.分組

2.分組 - 可迭代對象

3.其他軸上的分組

4.通過字典或者Series分組

5.分組計算函數方法

6.多函數計算：agg()

【課程2.19】數據分組