文章目錄
一、數據整形(stack函數)
在用pandas進行數據重排時,經常用到stack
和unstack
兩個函數。stack的意思是堆疊,堆積
,unstack即“不要堆疊”
。
表格在行列方向上均有索引(類似於DataFrame),花括號結構只有“列方向”上的索引
(類似於層次化的Series),結構更加偏向於堆疊(Series-stack,方便記憶)。stack函數會將數據從"表格結構"變成"花括號結構",即將其行索引變成列索引
,反之,unstack函數將數據從”花括號結構“變成”表格結構“,即要將其中一層的列索引變成行索引
。
tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
'foo', 'foo', 'qux', 'qux'],
['one', 'two', 'one', 'two',
'one', 'two', 'one', 'two']]))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
df
>>>
A B
first second
bar one 0.040744 -0.114534
two 1.172007 0.314802
baz one 2.609065 1.809331
two 0.998182 0.630713
foo one -1.592057 -0.383721
two 0.547086 0.964051
qux one -1.547477 -0.014280
two 0.193672 1.180379
- stack行索引變成列索引
stacked = df.stack()
stacked
>>>
first second
bar one A 0.040744
B -0.114534
two A 1.172007
B 0.314802
baz one A 2.609065
B 1.809331
two A 0.998182
B 0.630713
foo one A -1.592057
B -0.383721
two A 0.547086
B 0.964051
qux one A -1.547477
B -0.014280
two A 0.193672
B 1.180379
dtype: float64
- unstack列索引變成行索引
stacked.unstack()
>>>
A B
first second
bar one 0.040744 -0.114534
two 1.172007 0.314802
baz one 2.609065 1.809331
two 0.998182 0.630713
foo one -1.592057 -0.383721
two 0.547086 0.964051
qux one -1.547477 -0.014280
two 0.193672 1.180379
stacked.unstack().unstack()
>>>
A B
second one two one two
first
bar 0.040744 1.172007 -0.114534 0.314802
baz 2.609065 0.998182 1.809331 0.630713
foo -1.592057 0.547086 -0.383721 0.964051
qux -1.547477 0.193672 -0.014280 1.180379
二、數據透視表(pivot table)
pivot table/軸向旋轉表
>>> df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,
... 'B' : ['A', 'B', 'C'] * 4,
... 'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
... 'D' : np.random.randn(12),
... 'E' : np.random.randn(12)})
>>> df
A B C D E
0 one A foo 0.133675 0.008375
1 one B foo -0.184734 -0.175907
2 two C foo -0.380400 0.020585
3 three A bar 0.447330 0.030309
4 one B bar -0.242254 0.194685
5 one C bar 0.388497 -0.846569
6 two A foo 0.739428 0.737537
7 three B foo 0.171285 -1.558148
8 one C foo -0.458316 -1.048907
9 one A bar -1.319356 1.664459
10 two B bar 0.797775 -0.163211
11 three C bar 2.528107 1.471624
- 選擇
D列
數據,以A、B列
爲行標籤,以C列
爲列標籤
>>> pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])
C bar foo
A B
one A -1.319356 0.133675
B -0.242254 -0.184734
C 0.388497 -0.458316
three A 0.447330 NaN
B NaN 0.171285
C 2.528107 NaN
two A NaN 0.739428
B 0.797775 NaN
C NaN -0.380400
- 選擇
'one'類
數據,按照C列分組
並且求各行數據的均值
>>> df[df.A=='one'].groupby('C').mean()
D E
C
bar -0.391038 0.337525
foo -0.169792 -0.405480
三、時間序列
pandas 提供了強大的時間序列功能,比如把秒級的股票數據轉換爲5分鐘週期數據等。
1.時間序列
- 從20200504開始,以秒爲單位記錄600個時間節點
>>> rng=pd.date_range('20200504',periods=600,freq='s')
>>> rng
DatetimeIndex(['2020-05-04 00:00:00', '2020-05-04 00:00:01',
'2020-05-04 00:00:02', '2020-05-04 00:00:03',
'2020-05-04 00:00:04', '2020-05-04 00:00:05',
'2020-05-04 00:00:06', '2020-05-04 00:00:07',
'2020-05-04 00:00:08', '2020-05-04 00:00:09',
...
'2020-05-04 00:09:50', '2020-05-04 00:09:51',
'2020-05-04 00:09:52', '2020-05-04 00:09:53',
'2020-05-04 00:09:54', '2020-05-04 00:09:55',
'2020-05-04 00:09:56', '2020-05-04 00:09:57',
'2020-05-04 00:09:58', '2020-05-04 00:09:59'],
dtype='datetime64[ns]', length=600, freq='S')
- 爲每一個時間節點匹配一個0-500的數字
>>> ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
>>> ts
2020-05-04 00:00:00 163
2020-05-04 00:00:01 380
2020-05-04 00:00:02 459
2020-05-04 00:00:03 73
...
2020-05-04 00:09:56 189
2020-05-04 00:09:57 242
2020-05-04 00:09:58 398
2020-05-04 00:09:59 281
Freq: S, Length: 600, dtype: int32
- 以兩分鐘爲一個時間週期,計算每兩分鐘內所有節點對應的數字的和
>>> ts.resample('2Min', how='sum')
__main__:1: FutureWarning: how in .resample() is deprecated
the new syntax is .resample(...).sum()
2020-05-04 00:00:00 31323
2020-05-04 00:02:00 30679
2020-05-04 00:04:00 29742
2020-05-04 00:06:00 32315
2020-05-04 00:08:00 27771
Freq: 2T, dtype: int32
2.在不同的時間表達方式間轉換
>>> rng = pd.date_range('20160301', periods=5, freq='M')
>>> ts = pd.Series(np.random.randn(len(rng)), index=rng)
>>> ts
2016-03-31 0.608165
2016-04-30 0.371630
2016-05-31 -0.169328
2016-06-30 0.403367
2016-07-31 0.242917
Freq: M, dtype: float64
2.1 時間戳轉爲時期 to_period()函數
>>> ps = ts.to_period() #默認按月('M')顯示
>>> ps
2016-03 0.608165
2016-04 0.371630
2016-05 -0.169328
2016-06 0.403367
2016-07 0.242917
Freq: M, dtype: float64
2.2 時期轉爲時間戳 to_timestamp()函數
>>> ps.to_timestamp() #默認how='start',即:ps.to_timestamp(how='start')
2016-03-01 0.608165
2016-04-01 0.371630
2016-05-01 -0.169328
2016-06-01 0.403367
2016-07-01 0.242917
Freq: MS, dtype: float64
2.3 返回固定頻率的週期索引 period_range()函數
- 格式
pandas.period_range(start=None, end=None, periods=None, freq=None, name=None)
- 實例
>>> prng = pd.period_range('1990Q1', '2000Q4', freq='Q-NOV')
>>> prng
PeriodIndex(['1990Q1', '1990Q2', '1990Q3', '1990Q4', '1991Q1', '1991Q2',
'1991Q3', '1991Q4', '1992Q1', '1992Q2', '1992Q3', '1992Q4',
'1993Q1', '1993Q2', '1993Q3', '1993Q4', '1994Q1', '1994Q2',
'1994Q3', '1994Q4', '1995Q1', '1995Q2', '1995Q3', '1995Q4',
'1996Q1', '1996Q2', '1996Q3', '1996Q4', '1997Q1', '1997Q2',
'1997Q3', '1997Q4', '1998Q1', '1998Q2', '1998Q3', '1998Q4',
'1999Q1', '1999Q2', '1999Q3', '1999Q4', '2000Q1', '2000Q2',
'2000Q3', '2000Q4'],
dtype='period[Q-NOV]', freq='Q-NOV')
一年有四個季度,每年都以11月份爲結束,即是9,10,11三個月爲一個季度,以此類推
Q:quarter end frequency
#季度結束頻率
(B)Q(S)-NOV :quarterly frequency, year ends in November
#季度頻率,年度結束於11月份
2.4 series索引
>>> ts = pd.Series(np.random.randn(len(prng)), prng)
>>> ts
1990Q1 -0.504704
1990Q2 -0.343573
1990Q3 -1.511160
1990Q4 -0.268746
1991Q1 -0.013206
1991Q2 -1.730078
1991Q3 0.193844
1991Q4 0.732594
1992Q1 -0.656539
1992Q2 -0.577248
1992Q3 1.942460
1992Q4 -1.112426
1993Q1 0.040957
1993Q2 -1.572008
1993Q3 -0.130710
1993Q4 -1.019854
1994Q1 0.817004
1994Q2 2.099237
1994Q3 0.574557
1994Q4 -0.775947
1995Q1 -1.034555
1995Q2 -1.449465
1995Q3 -0.472482
1995Q4 0.750370
1996Q1 -0.081299
1996Q2 -0.214759
1996Q3 -0.141590
1996Q4 -0.397613
1997Q1 -0.771929
1997Q2 0.823948
1997Q3 -0.479001
1997Q4 0.624560
1998Q1 0.956335
1998Q2 0.531748
1998Q3 -0.048368
1998Q4 0.183149
1999Q1 1.060150
1999Q2 1.207727
1999Q3 1.309845
1999Q4 0.817419
2000Q1 0.992689
2000Q2 0.187355
2000Q3 1.014541
2000Q4 0.431878
Freq: Q-NOV, dtype: float64
- series.index
>>> ts.index
PeriodIndex(['1990-03-01 09:00', '1990-06-01 09:00', '1990-09-01 09:00',
'1990-12-01 09:00', '1991-03-01 09:00', '1991-06-01 09:00',
'1991-09-01 09:00', '1991-12-01 09:00', '1992-03-01 09:00',
'1992-06-01 09:00', '1992-09-01 09:00', '1992-12-01 09:00',
'1993-03-01 09:00', '1993-06-01 09:00', '1993-09-01 09:00',
'1993-12-01 09:00', '1994-03-01 09:00', '1994-06-01 09:00',
'1994-09-01 09:00', '1994-12-01 09:00', '1995-03-01 09:00',
'1995-06-01 09:00', '1995-09-01 09:00', '1995-12-01 09:00',
'1996-03-01 09:00', '1996-06-01 09:00', '1996-09-01 09:00',
'1996-12-01 09:00', '1997-03-01 09:00', '1997-06-01 09:00',
'1997-09-01 09:00', '1997-12-01 09:00', '1998-03-01 09:00',
'1998-06-01 09:00', '1998-09-01 09:00', '1998-12-01 09:00',
'1999-03-01 09:00', '1999-06-01 09:00', '1999-09-01 09:00',
'1999-12-01 09:00', '2000-03-01 09:00', '2000-06-01 09:00',
'2000-09-01 09:00', '2000-12-01 09:00'],
dtype='period[H]', freq='H')
在週期和時間戳之間進行轉換,可以使用一些方便的算術函數。在下面的例子中,我們將截至11月的季度頻率轉換爲季度結束後的月初的上午9時:
>>> ts.index = (prng.asfreq('M', 'e') + 1).asfreq('H', 's') + 9
>>> ts
1990-03-01 09:00 -0.504704
1990-06-01 09:00 -0.343573
1990-09-01 09:00 -1.511160
1990-12-01 09:00 -0.268746
1991-03-01 09:00 -0.013206
1991-06-01 09:00 -1.730078
1991-09-01 09:00 0.193844
1991-12-01 09:00 0.732594
1992-03-01 09:00 -0.656539
1992-06-01 09:00 -0.577248
1992-09-01 09:00 1.942460
1992-12-01 09:00 -1.112426
1993-03-01 09:00 0.040957
1993-06-01 09:00 -1.572008
1993-09-01 09:00 -0.130710
1993-12-01 09:00 -1.019854
1994-03-01 09:00 0.817004
1994-06-01 09:00 2.099237
1994-09-01 09:00 0.574557
1994-12-01 09:00 -0.775947
1995-03-01 09:00 -1.034555
1995-06-01 09:00 -1.449465
1995-09-01 09:00 -0.472482
1995-12-01 09:00 0.750370
1996-03-01 09:00 -0.081299
1996-06-01 09:00 -0.214759
1996-09-01 09:00 -0.141590
1996-12-01 09:00 -0.397613
1997-03-01 09:00 -0.771929
1997-06-01 09:00 0.823948
1997-09-01 09:00 -0.479001
1997-12-01 09:00 0.624560
1998-03-01 09:00 0.956335
1998-06-01 09:00 0.531748
1998-09-01 09:00 -0.048368
1998-12-01 09:00 0.183149
1999-03-01 09:00 1.060150
1999-06-01 09:00 1.207727
1999-09-01 09:00 1.309845
1999-12-01 09:00 0.817419
2000-03-01 09:00 0.992689
2000-06-01 09:00 0.187355
2000-09-01 09:00 1.014541
2000-12-01 09:00 0.431878
Freq: H, dtype: float64
四、類別數據
>>> df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']}
>>> df
id raw_grade
0 1 a
1 2 b
2 3 b
3 4 a
4 5 a
5 6 e
1.astype函數
- 格式
DataFrame.astype(self: ~FrameOrSeries, dtype, copy: bool = True, errors: str = 'raise')
- 功能
將pandas對象轉換爲指定的dtype類型。 - 實例
>>> df["grade"] = df["raw_grade"].astype("category")
>>> df
id raw_grade grade
0 1 a a
1 2 b b
2 3 b b
3 4 a a
4 5 a a
5 6 e e
- 查看類型
>>> df["grade"].cat.categories
Index(['a', 'b', 'e'], dtype='object')
- 替換數據
['a', 'b', 'e']
變成了對應的["very good", "good", "very bad"]
>>> df["grade"].cat.categories = ["very good", "good", "very bad"]
>>> df
id raw_grade grade
0 1 a very good
1 2 b good
2 3 b good
3 4 a very good
4 5 a very good
5 6 e very bad
2.排序(sort_values())
根據grade的值排序
>>> df.sort_values(by='grade', ascending=True)
id raw_grade grade
0 1 a very good
3 4 a very good
4 5 a very good
1 2 b good
2 3 b good
5 6 e very bad
3.統計個數(groupby().size())
>>> df.groupby("grade").size()
grade
very good 3
good 2
very bad 1
dtype: int64