數據科學包——pandas基礎(整形、透視、時間序列、類別)

一、數據整形(stack函數)

在用pandas進行數據重排時,經常用到stackunstack兩個函數。stack的意思是堆疊,堆積,unstack即“不要堆疊”
 表格在行列方向上均有索引(類似於DataFrame),花括號結構只有“列方向”上的索引(類似於層次化的Series),結構更加偏向於堆疊(Series-stack,方便記憶)。stack函數會將數據從"表格結構"變成"花括號結構",即將其行索引變成列索引,反之,unstack函數將數據從”花括號結構“變成”表格結構“,即要將其中一層的列索引變成行索引

tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
                     'foo', 'foo', 'qux', 'qux'],
                    ['one', 'two', 'one', 'two',
                     'one', 'two', 'one', 'two']]))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
df
>>>
                     A         B
first second
bar   one     0.040744 -0.114534
      two     1.172007  0.314802
baz   one     2.609065  1.809331
      two     0.998182  0.630713
foo   one    -1.592057 -0.383721
      two     0.547086  0.964051
qux   one    -1.547477 -0.014280
      two     0.193672  1.180379
  • stack行索引變成列索引
stacked = df.stack()
stacked
>>>
first  second
bar    one     A    0.040744
               B   -0.114534
       two     A    1.172007
               B    0.314802
baz    one     A    2.609065
               B    1.809331
       two     A    0.998182
               B    0.630713
foo    one     A   -1.592057
               B   -0.383721
       two     A    0.547086
               B    0.964051
qux    one     A   -1.547477
               B   -0.014280
       two     A    0.193672
               B    1.180379
dtype: float64
  • unstack列索引變成行索引
stacked.unstack()
>>>
                     A         B
first second
bar   one     0.040744 -0.114534
      two     1.172007  0.314802
baz   one     2.609065  1.809331
      two     0.998182  0.630713
foo   one    -1.592057 -0.383721
      two     0.547086  0.964051
qux   one    -1.547477 -0.014280
      two     0.193672  1.180379

stacked.unstack().unstack()
>>>
               A                   B
second       one       two       one       two
first
bar     0.040744  1.172007 -0.114534  0.314802
baz     2.609065  0.998182  1.809331  0.630713
foo    -1.592057  0.547086 -0.383721  0.964051
qux    -1.547477  0.193672 -0.014280  1.180379

二、數據透視表(pivot table)

pivot table/軸向旋轉表

>>> df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,
...                     'B' : ['A', 'B', 'C'] * 4,
...                     'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
...                     'D' : np.random.randn(12),
...                     'E' : np.random.randn(12)})
>>> df
        A  B    C         D         E
0     one  A  foo  0.133675  0.008375
1     one  B  foo -0.184734 -0.175907
2     two  C  foo -0.380400  0.020585
3   three  A  bar  0.447330  0.030309
4     one  B  bar -0.242254  0.194685
5     one  C  bar  0.388497 -0.846569
6     two  A  foo  0.739428  0.737537
7   three  B  foo  0.171285 -1.558148
8     one  C  foo -0.458316 -1.048907
9     one  A  bar -1.319356  1.664459
10    two  B  bar  0.797775 -0.163211
11  three  C  bar  2.528107  1.471624
  • 選擇D列數據,以A、B列爲行標籤,以C列爲列標籤
>>> pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])
C             bar       foo
A     B
one   A -1.319356  0.133675
      B -0.242254 -0.184734
      C  0.388497 -0.458316
three A  0.447330       NaN
      B       NaN  0.171285
      C  2.528107       NaN
two   A       NaN  0.739428
      B  0.797775       NaN
      C       NaN -0.380400
  • 選擇'one'類數據,按照C列分組並且求各行數據的均值
>>> df[df.A=='one'].groupby('C').mean()
            D         E
C
bar -0.391038  0.337525
foo -0.169792 -0.405480

三、時間序列

pandas 提供了強大的時間序列功能,比如把秒級的股票數據轉換爲5分鐘週期數據等。

1.時間序列

  • 從20200504開始,以秒爲單位記錄600個時間節點
>>> rng=pd.date_range('20200504',periods=600,freq='s')
>>> rng
DatetimeIndex(['2020-05-04 00:00:00', '2020-05-04 00:00:01',
               '2020-05-04 00:00:02', '2020-05-04 00:00:03',
               '2020-05-04 00:00:04', '2020-05-04 00:00:05',
               '2020-05-04 00:00:06', '2020-05-04 00:00:07',
               '2020-05-04 00:00:08', '2020-05-04 00:00:09',
               ...
               '2020-05-04 00:09:50', '2020-05-04 00:09:51',
               '2020-05-04 00:09:52', '2020-05-04 00:09:53',
               '2020-05-04 00:09:54', '2020-05-04 00:09:55',
               '2020-05-04 00:09:56', '2020-05-04 00:09:57',
               '2020-05-04 00:09:58', '2020-05-04 00:09:59'],
              dtype='datetime64[ns]', length=600, freq='S')
  • 爲每一個時間節點匹配一個0-500的數字
>>> ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
>>> ts
2020-05-04 00:00:00    163
2020-05-04 00:00:01    380
2020-05-04 00:00:02    459
2020-05-04 00:00:03     73
                      ...
2020-05-04 00:09:56    189
2020-05-04 00:09:57    242
2020-05-04 00:09:58    398
2020-05-04 00:09:59    281
Freq: S, Length: 600, dtype: int32
  • 以兩分鐘爲一個時間週期,計算每兩分鐘內所有節點對應的數字的和
>>> ts.resample('2Min', how='sum')
__main__:1: FutureWarning: how in .resample() is deprecated
the new syntax is .resample(...).sum()
2020-05-04 00:00:00    31323
2020-05-04 00:02:00    30679
2020-05-04 00:04:00    29742
2020-05-04 00:06:00    32315
2020-05-04 00:08:00    27771
Freq: 2T, dtype: int32

2.在不同的時間表達方式間轉換

>>> rng = pd.date_range('20160301', periods=5, freq='M')
>>> ts = pd.Series(np.random.randn(len(rng)), index=rng)
>>> ts
2016-03-31    0.608165
2016-04-30    0.371630
2016-05-31   -0.169328
2016-06-30    0.403367
2016-07-31    0.242917
Freq: M, dtype: float64

2.1 時間戳轉爲時期 to_period()函數

>>> ps = ts.to_period() #默認按月('M')顯示
>>> ps
2016-03    0.608165
2016-04    0.371630
2016-05   -0.169328
2016-06    0.403367
2016-07    0.242917
Freq: M, dtype: float64

2.2 時期轉爲時間戳 to_timestamp()函數

>>> ps.to_timestamp() #默認how='start',即:ps.to_timestamp(how='start')
2016-03-01    0.608165
2016-04-01    0.371630
2016-05-01   -0.169328
2016-06-01    0.403367
2016-07-01    0.242917
Freq: MS, dtype: float64

2.3 返回固定頻率的週期索引 period_range()函數

  • 格式
pandas.period_range(start=None, end=None, periods=None, freq=None, name=None) 
  • 實例
>>> prng = pd.period_range('1990Q1', '2000Q4', freq='Q-NOV')
>>> prng
PeriodIndex(['1990Q1', '1990Q2', '1990Q3', '1990Q4', '1991Q1', '1991Q2',
             '1991Q3', '1991Q4', '1992Q1', '1992Q2', '1992Q3', '1992Q4',
             '1993Q1', '1993Q2', '1993Q3', '1993Q4', '1994Q1', '1994Q2',
             '1994Q3', '1994Q4', '1995Q1', '1995Q2', '1995Q3', '1995Q4',
             '1996Q1', '1996Q2', '1996Q3', '1996Q4', '1997Q1', '1997Q2',
             '1997Q3', '1997Q4', '1998Q1', '1998Q2', '1998Q3', '1998Q4',
             '1999Q1', '1999Q2', '1999Q3', '1999Q4', '2000Q1', '2000Q2',
             '2000Q3', '2000Q4'],
            dtype='period[Q-NOV]', freq='Q-NOV')

一年有四個季度,每年都以11月份爲結束,即是9,10,11三個月爲一個季度,以此類推
Q:quarter end frequency #季度結束頻率
(B)Q(S)-NOV :quarterly frequency, year ends in November #季度頻率,年度結束於11月份

2.4 series索引

>>> ts = pd.Series(np.random.randn(len(prng)), prng)
>>> ts
1990Q1   -0.504704
1990Q2   -0.343573
1990Q3   -1.511160
1990Q4   -0.268746
1991Q1   -0.013206
1991Q2   -1.730078
1991Q3    0.193844
1991Q4    0.732594
1992Q1   -0.656539
1992Q2   -0.577248
1992Q3    1.942460
1992Q4   -1.112426
1993Q1    0.040957
1993Q2   -1.572008
1993Q3   -0.130710
1993Q4   -1.019854
1994Q1    0.817004
1994Q2    2.099237
1994Q3    0.574557
1994Q4   -0.775947
1995Q1   -1.034555
1995Q2   -1.449465
1995Q3   -0.472482
1995Q4    0.750370
1996Q1   -0.081299
1996Q2   -0.214759
1996Q3   -0.141590
1996Q4   -0.397613
1997Q1   -0.771929
1997Q2    0.823948
1997Q3   -0.479001
1997Q4    0.624560
1998Q1    0.956335
1998Q2    0.531748
1998Q3   -0.048368
1998Q4    0.183149
1999Q1    1.060150
1999Q2    1.207727
1999Q3    1.309845
1999Q4    0.817419
2000Q1    0.992689
2000Q2    0.187355
2000Q3    1.014541
2000Q4    0.431878
Freq: Q-NOV, dtype: float64
  • series.index
>>> ts.index
PeriodIndex(['1990-03-01 09:00', '1990-06-01 09:00', '1990-09-01 09:00',
             '1990-12-01 09:00', '1991-03-01 09:00', '1991-06-01 09:00',
             '1991-09-01 09:00', '1991-12-01 09:00', '1992-03-01 09:00',
             '1992-06-01 09:00', '1992-09-01 09:00', '1992-12-01 09:00',
             '1993-03-01 09:00', '1993-06-01 09:00', '1993-09-01 09:00',
             '1993-12-01 09:00', '1994-03-01 09:00', '1994-06-01 09:00',
             '1994-09-01 09:00', '1994-12-01 09:00', '1995-03-01 09:00',
             '1995-06-01 09:00', '1995-09-01 09:00', '1995-12-01 09:00',
             '1996-03-01 09:00', '1996-06-01 09:00', '1996-09-01 09:00',
             '1996-12-01 09:00', '1997-03-01 09:00', '1997-06-01 09:00',
             '1997-09-01 09:00', '1997-12-01 09:00', '1998-03-01 09:00',
             '1998-06-01 09:00', '1998-09-01 09:00', '1998-12-01 09:00',
             '1999-03-01 09:00', '1999-06-01 09:00', '1999-09-01 09:00',
             '1999-12-01 09:00', '2000-03-01 09:00', '2000-06-01 09:00',
             '2000-09-01 09:00', '2000-12-01 09:00'],
            dtype='period[H]', freq='H')

在週期和時間戳之間進行轉換,可以使用一些方便的算術函數。在下面的例子中,我們將截至11月的季度頻率轉換爲季度結束後的月初的上午9時:

>>> ts.index = (prng.asfreq('M', 'e') + 1).asfreq('H', 's') + 9
>>> ts
1990-03-01 09:00   -0.504704
1990-06-01 09:00   -0.343573
1990-09-01 09:00   -1.511160
1990-12-01 09:00   -0.268746
1991-03-01 09:00   -0.013206
1991-06-01 09:00   -1.730078
1991-09-01 09:00    0.193844
1991-12-01 09:00    0.732594
1992-03-01 09:00   -0.656539
1992-06-01 09:00   -0.577248
1992-09-01 09:00    1.942460
1992-12-01 09:00   -1.112426
1993-03-01 09:00    0.040957
1993-06-01 09:00   -1.572008
1993-09-01 09:00   -0.130710
1993-12-01 09:00   -1.019854
1994-03-01 09:00    0.817004
1994-06-01 09:00    2.099237
1994-09-01 09:00    0.574557
1994-12-01 09:00   -0.775947
1995-03-01 09:00   -1.034555
1995-06-01 09:00   -1.449465
1995-09-01 09:00   -0.472482
1995-12-01 09:00    0.750370
1996-03-01 09:00   -0.081299
1996-06-01 09:00   -0.214759
1996-09-01 09:00   -0.141590
1996-12-01 09:00   -0.397613
1997-03-01 09:00   -0.771929
1997-06-01 09:00    0.823948
1997-09-01 09:00   -0.479001
1997-12-01 09:00    0.624560
1998-03-01 09:00    0.956335
1998-06-01 09:00    0.531748
1998-09-01 09:00   -0.048368
1998-12-01 09:00    0.183149
1999-03-01 09:00    1.060150
1999-06-01 09:00    1.207727
1999-09-01 09:00    1.309845
1999-12-01 09:00    0.817419
2000-03-01 09:00    0.992689
2000-06-01 09:00    0.187355
2000-09-01 09:00    1.014541
2000-12-01 09:00    0.431878
Freq: H, dtype: float64

四、類別數據

>>> df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']} 
>>> df
   id raw_grade
0   1         a
1   2         b
2   3         b
3   4         a
4   5         a
5   6         e

1.astype函數

  • 格式
DataFrame.astype(self: ~FrameOrSeries, dtype, copy: bool = True, errors: str = 'raise') 
  • 功能
    將pandas對象轉換爲指定的dtype類型。
  • 實例
>>> df["grade"] = df["raw_grade"].astype("category")
>>> df
   id raw_grade grade
0   1         a     a
1   2         b     b
2   3         b     b
3   4         a     a
4   5         a     a
5   6         e     e
  • 查看類型
>>> df["grade"].cat.categories
Index(['a', 'b', 'e'], dtype='object')
  • 替換數據
    ['a', 'b', 'e']變成了對應的["very good", "good", "very bad"]
>>> df["grade"].cat.categories = ["very good", "good", "very bad"]
>>> df
   id raw_grade      grade
0   1         a  very good
1   2         b       good
2   3         b       good
3   4         a  very good
4   5         a  very good
5   6         e   very bad

2.排序(sort_values())

根據grade的值排序

>>> df.sort_values(by='grade', ascending=True)
   id raw_grade      grade
0   1         a  very good
3   4         a  very good
4   5         a  very good
1   2         b       good
2   3         b       good
5   6         e   very bad

3.統計個數(groupby().size())

>>> df.groupby("grade").size()
grade
very good    3
good         2
very bad     1
dtype: int64
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章