點擊標題即可獲取文章源代碼和筆記
數據集:https://download.csdn.net/download/weixin_44827418/12548095
Pandas高級處理
缺失值處理
數據離散化
合併
交叉表與透視表
分組與聚合
綜合案例
4.6 高級處理- 缺失值處理
1 )如何進行缺失值處理
兩種思路:
1 )刪除含有缺失值的樣本
2 )替換/ 插補
4.6 .1 如何處理nan
1 )判斷數據中是否存在NaN
pd. isnull( df)
pd. notnull( df)
2 )刪除含有缺失值的樣本
df. dropna( inplace= False )
替換/ 插補
df. fillna( value, inplace= False )
4.6 .2 不是缺失值nan,有默認標記的
1 )替換 ?- > np. nan
df. replace( to_replace= "?" , value= np. nan)
2 )處理np. nan缺失值的步驟
2 )缺失值處理實例
4.7 高級處理- 數據離散化
性別 年齡
A 1 23
B 2 30
C 1 18
物種 毛髮
A 1
B 2
C 3
男 女 年齡
A 1 0 23
B 0 1 30
C 1 0 18
狗 豬 老鼠 毛髮
A 1 0 0 2
B 0 1 0 1
C 0 0 1 1
one- hot編碼& 啞變量
4.7 .1 什麼是數據的離散化
原始的身高數據:165 ,174 ,160 ,180 ,159 ,163 ,192 ,184
4.7 .2 爲什麼要離散化
4.7 .3 如何實現數據的離散化
1 )分組
自動分組sr= pd. qcut( data, bins)
自定義分組sr= pd. cut( data, [ ] )
2 )將分組好的結果轉換成one- hot編碼
pd. get_dummies( sr, prefix= )
4.8 高級處理- 合併
numpy
np. concatnate( ( a, b) , axis= )
水平拼接
np. hstack( )
豎直拼接
np. vstack( )
1 )按方向拼接
pd. concat( [ data1, data2] , axis= 1 )
2 )按索引拼接
pd. merge實現合併
pd. merge( left, right, how= "inner" , on= [ 索引] )
4.9 高級處理- 交叉表與透視表
找到、探索兩個變量之間的關係
4.9 .1 交叉表與透視表什麼作用
4.9 .2 使用crosstab( 交叉表) 實現
pd. crosstab( value1, value2)
4.9 .3 pivot_table
4.10 高級處理- 分組與聚合
4.10 .1 什麼是分組與聚合
4.10 .2 分組與聚合API
dataframe
sr
4.6.1如何處理nan
import pandas as pd
movie = pd. read_csv( "./datas/IMDB-Movie-Data.csv" )
movie
Rank
Title
Genre
Description
Director
Actors
Year
Runtime (Minutes)
Rating
Votes
Revenue (Millions)
Metascore
0
1
Guardians of the Galaxy
Action,Adventure,Sci-Fi
A group of intergalactic criminals are forced ...
James Gunn
Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...
2014
121
8.1
757074
333.13
76.0
1
2
Prometheus
Adventure,Mystery,Sci-Fi
Following clues to the origin of mankind, a te...
Ridley Scott
Noomi Rapace, Logan Marshall-Green, Michael Fa...
2012
124
7.0
485820
126.46
65.0
2
3
Split
Horror,Thriller
Three girls are kidnapped by a man with a diag...
M. Night Shyamalan
James McAvoy, Anya Taylor-Joy, Haley Lu Richar...
2016
117
7.3
157606
138.12
62.0
3
4
Sing
Animation,Comedy,Family
In a city of humanoid animals, a hustling thea...
Christophe Lourdelet
Matthew McConaughey,Reese Witherspoon, Seth Ma...
2016
108
7.2
60545
270.32
59.0
4
5
Suicide Squad
Action,Adventure,Fantasy
A secret government agency recruits some of th...
David Ayer
Will Smith, Jared Leto, Margot Robbie, Viola D...
2016
123
6.2
393727
325.02
40.0
...
...
...
...
...
...
...
...
...
...
...
...
...
995
996
Secret in Their Eyes
Crime,Drama,Mystery
A tight-knit team of rising investigators, alo...
Billy Ray
Chiwetel Ejiofor, Nicole Kidman, Julia Roberts...
2015
111
6.2
27585
NaN
45.0
996
997
Hostel: Part II
Horror
Three American college students studying abroa...
Eli Roth
Lauren German, Heather Matarazzo, Bijou Philli...
2007
94
5.5
73152
17.54
46.0
997
998
Step Up 2: The Streets
Drama,Music,Romance
Romantic sparks occur between two dance studen...
Jon M. Chu
Robert Hoffman, Briana Evigan, Cassie Ventura,...
2008
98
6.2
70699
58.01
50.0
998
999
Search Party
Adventure,Comedy
A pair of friends embark on a mission to reuni...
Scot Armstrong
Adam Pally, T.J. Miller, Thomas Middleditch,Sh...
2014
93
5.6
4881
NaN
22.0
999
1000
Nine Lives
Comedy,Family,Fantasy
A stuffy businessman finds himself trapped ins...
Barry Sonnenfeld
Kevin Spacey, Jennifer Garner, Robbie Amell,Ch...
2016
87
5.3
12435
19.64
11.0
1000 rows × 12 columns
movie. isnull( )
Rank
Title
Genre
Description
Director
Actors
Year
Runtime (Minutes)
Rating
Votes
Revenue (Millions)
Metascore
0
False
False
False
False
False
False
False
False
False
False
False
False
1
False
False
False
False
False
False
False
False
False
False
False
False
2
False
False
False
False
False
False
False
False
False
False
False
False
3
False
False
False
False
False
False
False
False
False
False
False
False
4
False
False
False
False
False
False
False
False
False
False
False
False
...
...
...
...
...
...
...
...
...
...
...
...
...
995
False
False
False
False
False
False
False
False
False
False
True
False
996
False
False
False
False
False
False
False
False
False
False
False
False
997
False
False
False
False
False
False
False
False
False
False
False
False
998
False
False
False
False
False
False
False
False
False
False
True
False
999
False
False
False
False
False
False
False
False
False
False
False
False
1000 rows × 12 columns
import numpy as np
np. any ( movie. isnull( ) )
True
pd. notnull( movie)
Rank
Title
Genre
Description
Director
Actors
Year
Runtime (Minutes)
Rating
Votes
Revenue (Millions)
Metascore
0
True
True
True
True
True
True
True
True
True
True
True
True
1
True
True
True
True
True
True
True
True
True
True
True
True
2
True
True
True
True
True
True
True
True
True
True
True
True
3
True
True
True
True
True
True
True
True
True
True
True
True
4
True
True
True
True
True
True
True
True
True
True
True
True
...
...
...
...
...
...
...
...
...
...
...
...
...
995
True
True
True
True
True
True
True
True
True
True
False
True
996
True
True
True
True
True
True
True
True
True
True
True
True
997
True
True
True
True
True
True
True
True
True
True
True
True
998
True
True
True
True
True
True
True
True
True
True
False
True
999
True
True
True
True
True
True
True
True
True
True
True
True
1000 rows × 12 columns
np. all ( pd. notnull( movie) )
False
pd. isnull( movie) . any ( )
Rank False
Title False
Genre False
Description False
Director False
Actors False
Year False
Runtime (Minutes) False
Rating False
Votes False
Revenue (Millions) True
Metascore True
dtype: bool
pd. notnull( movie) . all ( )
Rank True
Title True
Genre True
Description True
Director True
Actors True
Year True
Runtime (Minutes) True
Rating True
Votes True
Revenue (Millions) False
Metascore False
dtype: bool
movie_full = movie. dropna( )
movie_full. isnull( ) . any ( )
Rank False
Title False
Genre False
Description False
Director False
Actors False
Year False
Runtime (Minutes) False
Rating False
Votes False
Revenue (Millions) False
Metascore False
dtype: bool
movie. head( )
Rank
Title
Genre
Description
Director
Actors
Year
Runtime (Minutes)
Rating
Votes
Revenue (Millions)
Metascore
0
1
Guardians of the Galaxy
Action,Adventure,Sci-Fi
A group of intergalactic criminals are forced ...
James Gunn
Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...
2014
121
8.1
757074
333.13
76.0
1
2
Prometheus
Adventure,Mystery,Sci-Fi
Following clues to the origin of mankind, a te...
Ridley Scott
Noomi Rapace, Logan Marshall-Green, Michael Fa...
2012
124
7.0
485820
126.46
65.0
2
3
Split
Horror,Thriller
Three girls are kidnapped by a man with a diag...
M. Night Shyamalan
James McAvoy, Anya Taylor-Joy, Haley Lu Richar...
2016
117
7.3
157606
138.12
62.0
3
4
Sing
Animation,Comedy,Family
In a city of humanoid animals, a hustling thea...
Christophe Lourdelet
Matthew McConaughey,Reese Witherspoon, Seth Ma...
2016
108
7.2
60545
270.32
59.0
4
5
Suicide Squad
Action,Adventure,Fantasy
A secret government agency recruits some of th...
David Ayer
Will Smith, Jared Leto, Margot Robbie, Viola D...
2016
123
6.2
393727
325.02
40.0
movie[ "Revenue (Millions)" ] . mean( )
82.95637614678897
movie[ "Revenue (Millions)" ] . fillna( movie[ "Revenue (Millions)" ] . mean( ) , inplace= True )
movie[ "Revenue (Millions)" ] . isnull( ) . any ( )
False
movie[ "Metascore" ] . fillna( movie[ "Metascore" ] . mean( ) , inplace= True )
movie[ "Metascore" ] . isnull( ) . any ( )
False
movie. isnull( ) . any ( )
Rank False
Title False
Genre False
Description False
Director False
Actors False
Year False
Runtime (Minutes) False
Rating False
Votes False
Revenue (Millions) False
Metascore False
dtype: bool
不是缺失值nan,有默認標記的處理方法
data = pd. read_csv( "./datas/GBvideos.csv" , encoding= "GBK" )
data
video_id
title
channel_title
category_id
tags
views
likes
dislikes
comment_total
thumbnail_link
date
0
jt2OHQh0HoQ
Live Apple Event - Apple September Event 2017 ...
Apple Event
28
apple events|apple event|iphone 8|iphone x|iph...
7426393
78240
13548
705
https://i.ytimg.com/vi/jt2OHQh0HoQ/default_liv...
13.09
1
AqokkXoa7uE
Holly and Phillip Meet Samantha the Sex Robot ...
This Morning
24
this morning|interview|holly willoughby|philli...
494203
2651
1309
0
https://i.ytimg.com/vi/AqokkXoa7uE/default.jpg
13.09
2
YPVcg45W0z4
My DNA Test Results? I'm WHAT??
emmablackery
24
emmablackery|emma blackery|emma|blackery|briti...
142819
13119
151
1141
https://i.ytimg.com/vi/YPVcg45W0z4/default.jpg
13.09
3
T_PuZBdT2iM
getting into a conversation in a language you ...
ProZD
1
skit|korean|language|conversation|esl|japanese...
1580028
65729
1529
3598
https://i.ytimg.com/vi/T_PuZBdT2iM/default.jpg
13.09
4
NsjsmgmbCfc
Baby Name Challenge?
Sprinkleofglitter
26
sprinkleofglitter|sprinkle of glitter|baby gli...
40592
5019
57
490
https://i.ytimg.com/vi/NsjsmgmbCfc/default.jpg
13.09
...
...
...
...
...
...
...
...
...
...
...
...
1595
w8fAellnPns
Juicy Chicken Breast - You Suck at Cooking (ep...
You Suck At Cooking
26
how to|cooking|recipe|kitchen|chicken|chicken ...
788466
31945
945
2274
https://i.ytimg.com/vi/w8fAellnPns/default.jpg
20.09
1596
RsG37JcEQNw
Weezer - Beach Boys
weezer
10
weezer|pacific daydream|pacificdaydream|beach ...
107927
2435
412
641
https://i.ytimg.com/vi/RsG37JcEQNw/default.jpg
20.09
1597
htSiIA2g7G8
Berry Frozen Yogurt Bark Recipe
SORTEDfood
26
frozen yogurt bark|frozen yoghurt bark|frozen ...
109222
4840
35
212
https://i.ytimg.com/vi/htSiIA2g7G8/default.jpg
20.09
1598
ZQK1F0wz6z4
What Do You Want to Eat??
Wong Fu Productions
24
panda|what should we eat|buzzfeed|comedy|boyfr...
626223
22962
532
1559
https://i.ytimg.com/vi/ZQK1F0wz6z4/default.jpg
20.09
1599
DuPXdnSWoLk
The Child in Time: Trailer - BBC One
BBC
24
BBC|iPlayer|bbc one|bbc 1|bbc1|trailer|the chi...
99228
1699
?
135
https://i.ytimg.com/vi/DuPXdnSWoLk/default.jpg
20.09
1600 rows × 11 columns
new_data = data. replace( to_replace= "?" , value= np. nan)
new_data
video_id
title
channel_title
category_id
tags
views
likes
dislikes
comment_total
thumbnail_link
date
0
jt2OHQh0HoQ
Live Apple Event - Apple September Event 2017 ...
Apple Event
28
apple events|apple event|iphone 8|iphone x|iph...
7426393
78240
13548
705
https://i.ytimg.com/vi/jt2OHQh0HoQ/default_liv...
13.09
1
AqokkXoa7uE
Holly and Phillip Meet Samantha the Sex Robot ...
This Morning
24
this morning|interview|holly willoughby|philli...
494203
2651
1309
0
https://i.ytimg.com/vi/AqokkXoa7uE/default.jpg
13.09
2
YPVcg45W0z4
My DNA Test Results? I'm WHAT??
emmablackery
24
emmablackery|emma blackery|emma|blackery|briti...
142819
13119
151
1141
https://i.ytimg.com/vi/YPVcg45W0z4/default.jpg
13.09
3
T_PuZBdT2iM
getting into a conversation in a language you ...
ProZD
1
skit|korean|language|conversation|esl|japanese...
1580028
65729
1529
3598
https://i.ytimg.com/vi/T_PuZBdT2iM/default.jpg
13.09
4
NsjsmgmbCfc
Baby Name Challenge?
Sprinkleofglitter
26
sprinkleofglitter|sprinkle of glitter|baby gli...
40592
5019
57
490
https://i.ytimg.com/vi/NsjsmgmbCfc/default.jpg
13.09
...
...
...
...
...
...
...
...
...
...
...
...
1595
w8fAellnPns
Juicy Chicken Breast - You Suck at Cooking (ep...
You Suck At Cooking
26
how to|cooking|recipe|kitchen|chicken|chicken ...
788466
31945
945
2274
https://i.ytimg.com/vi/w8fAellnPns/default.jpg
20.09
1596
RsG37JcEQNw
Weezer - Beach Boys
weezer
10
weezer|pacific daydream|pacificdaydream|beach ...
107927
2435
412
641
https://i.ytimg.com/vi/RsG37JcEQNw/default.jpg
20.09
1597
htSiIA2g7G8
Berry Frozen Yogurt Bark Recipe
SORTEDfood
26
frozen yogurt bark|frozen yoghurt bark|frozen ...
109222
4840
35
212
https://i.ytimg.com/vi/htSiIA2g7G8/default.jpg
20.09
1598
ZQK1F0wz6z4
What Do You Want to Eat??
Wong Fu Productions
24
panda|what should we eat|buzzfeed|comedy|boyfr...
626223
22962
532
1559
https://i.ytimg.com/vi/ZQK1F0wz6z4/default.jpg
20.09
1599
DuPXdnSWoLk
The Child in Time: Trailer - BBC One
BBC
24
BBC|iPlayer|bbc one|bbc 1|bbc1|trailer|the chi...
99228
1699
NaN
135
https://i.ytimg.com/vi/DuPXdnSWoLk/default.jpg
20.09
1600 rows × 11 columns
new_data. isnull( ) . any ( )
video_id False
title False
channel_title False
category_id False
tags False
views False
likes False
dislikes True
comment_total False
thumbnail_link False
date False
dtype: bool
new_data. dropna( inplace= True )
new_data. isnull( ) . any ( )
video_id False
title False
channel_title False
category_id False
tags False
views False
likes False
dislikes False
comment_total False
thumbnail_link False
date False
dtype: bool
4.7 高級處理-數據離散化
import pandas as pd
data = pd. Series( [ 165 , 174 , 160 , 180 , 159 , 163 , 192 , 184 ] , index= [ "No1:165" , "No2:174" , "No3:160" , "No4:180" , "No5:159" , "No6:163" , "No7:192" , "No8:184" ] )
data
No1:165 165
No2:174 174
No3:160 160
No4:180 180
No5:159 159
No6:163 163
No7:192 192
No8:184 184
dtype: int64
自動分組
sr = pd. qcut( data, 3 )
sr
No1:165 (163.667, 178.0]
No2:174 (163.667, 178.0]
No3:160 (158.999, 163.667]
No4:180 (178.0, 192.0]
No5:159 (158.999, 163.667]
No6:163 (158.999, 163.667]
No7:192 (178.0, 192.0]
No8:184 (178.0, 192.0]
dtype: category
Categories (3, interval[float64]): [(158.999, 163.667] < (163.667, 178.0] < (178.0, 192.0]]
sr. value_counts( )
(178.0, 192.0] 3
(158.999, 163.667] 3
(163.667, 178.0] 2
dtype: int64
type ( sr)
pandas.core.series.Series
pd. get_dummies( sr, prefix= "height" )
height_(158.999, 163.667]
height_(163.667, 178.0]
height_(178.0, 192.0]
No1:165
0
1
0
No2:174
0
1
0
No3:160
1
0
0
No4:180
0
0
1
No5:159
1
0
0
No6:163
1
0
0
No7:192
0
0
1
No8:184
0
0
1
自定義分組
sr = pd. cut( data, [ 150 , 165 , 180 , 195 ] )
sr
No1:165 (150, 165]
No2:174 (165, 180]
No3:160 (150, 165]
No4:180 (165, 180]
No5:159 (150, 165]
No6:163 (150, 165]
No7:192 (180, 195]
No8:184 (180, 195]
dtype: category
Categories (3, interval[int64]): [(150, 165] < (165, 180] < (180, 195]]
sr. value_counts( )
(150, 165] 4
(180, 195] 2
(165, 180] 2
dtype: int64
pd. get_dummies( sr, prefix= "身高" )
身高_(150, 165]
身高_(165, 180]
身高_(180, 195]
No1:165
1
0
0
No2:174
0
1
0
No3:160
1
0
0
No4:180
0
1
0
No5:159
1
0
0
No6:163
1
0
0
No7:192
0
0
1
No8:184
0
0
1
4.8 高級處理-合併
4.8.1 pd.concat實現合併(按方向拼接)
data1 = np. arange( 0 , 20 , 1 ) . reshape( 4 , 5 )
data1 = pd. DataFrame( data1)
data1
0
1
2
3
4
0
0
1
2
3
4
1
5
6
7
8
9
2
10
11
12
13
14
3
15
16
17
18
19
data2 = np. arange( 100 , 120 , 1 ) . reshape( 4 , 5 )
data2 = pd. DataFrame( data2)
data2
0
1
2
3
4
0
100
101
102
103
104
1
105
106
107
108
109
2
110
111
112
113
114
3
115
116
117
118
119
data_concat = pd. concat( [ data1, data2] , axis= 1 )
data_concat
0
1
2
3
4
0
1
2
3
4
0
0
1
2
3
4
100
101
102
103
104
1
5
6
7
8
9
105
106
107
108
109
2
10
11
12
13
14
110
111
112
113
114
3
15
16
17
18
19
115
116
117
118
119
data2. T
0
1
2
3
0
100
105
110
115
1
101
106
111
116
2
102
107
112
117
3
103
108
113
118
4
104
109
114
119
data_concat1 = pd. concat( [ data1, data2. T] , axis= 0 )
data_concat1
0
1
2
3
4
0
0
1
2
3
4.0
1
5
6
7
8
9.0
2
10
11
12
13
14.0
3
15
16
17
18
19.0
0
100
105
110
115
NaN
1
101
106
111
116
NaN
2
102
107
112
117
NaN
3
103
108
113
118
NaN
4
104
109
114
119
NaN
4.8.2 pd.merge實現合併(按索引拼接)
left= pd. DataFrame( { 'key1' : [ 'K0' , 'K0' , 'K1' , 'K2' ] ,
'key2' : [ 'K0' , 'K1' , 'K0' , 'K1' ] ,
'A' : [ 'A0' , 'A1' , 'A2' , 'A3' ] ,
'B' : [ 'B0' , 'B1' , 'B2' , 'B3' ] } )
left
key1
key2
A
B
0
K0
K0
A0
B0
1
K0
K1
A1
B1
2
K1
K0
A2
B2
3
K2
K1
A3
B3
right= pd. DataFrame( { 'key1' : [ 'K0' , 'K1' , 'K1' , 'K2' ] ,
'key2' : [ 'K0' , 'K0' , 'K0' , 'K0' ] ,
'C' : [ 'Co' , 'C1' , 'C2' , 'C3' ] ,
'D' : [ 'DO' , 'D1' , 'D2' , 'D3' ] } )
right
key1
key2
C
D
0
K0
K0
Co
DO
1
K1
K0
C1
D1
2
K1
K0
C2
D2
3
K2
K0
C3
D3
result = pd. merge( left, right, on= [ 'key1' , 'key2' ] , how= "inner" )
result
key1
key2
A
B
C
D
0
K0
K0
A0
B0
Co
DO
1
K1
K0
A2
B2
C1
D1
2
K1
K0
A2
B2
C2
D2
result_left = pd. merge( left, right, on= [ 'key1' , 'key2' ] , how= "left" )
result_left
key1
key2
A
B
C
D
0
K0
K0
A0
B0
Co
DO
1
K0
K1
A1
B1
NaN
NaN
2
K1
K0
A2
B2
C1
D1
3
K1
K0
A2
B2
C2
D2
4
K2
K1
A3
B3
NaN
NaN
result_right = pd. merge( left, right, on= [ 'key1' , 'key2' ] , how= "right" )
result_right
key1
key2
A
B
C
D
0
K0
K0
A0
B0
Co
DO
1
K1
K0
A2
B2
C1
D1
2
K1
K0
A2
B2
C2
D2
3
K2
K0
NaN
NaN
C3
D3
result_outer = pd. merge( left, right, on= [ 'key1' , 'key2' ] , how= "outer" )
result_outer
key1
key2
A
B
C
D
0
K0
K0
A0
B0
Co
DO
1
K0
K1
A1
B1
NaN
NaN
2
K1
K0
A2
B2
C1
D1
3
K1
K0
A2
B2
C2
D2
4
K2
K1
A3
B3
NaN
NaN
5
K2
K0
NaN
NaN
C3
D3
4.9 高級處理-交叉表與透視表
4.9.2 使用crosstab(交叉表)實現
data = pd. read_excel( "./datas/szfj_baoan.xls" )
data
district
roomnum
hall
AREA
C_floor
floor_num
school
subway
per_price
0
baoan
3
2
89.3
middle
31
0
0
7.0773
1
baoan
4
2
127.0
high
31
0
0
6.9291
2
baoan
1
1
28.0
low
39
0
0
3.9286
3
baoan
1
1
28.0
middle
30
0
0
3.3568
4
baoan
2
2
78.0
middle
8
1
1
5.0769
...
...
...
...
...
...
...
...
...
...
1246
baoan
4
2
89.3
low
8
0
0
4.2553
1247
baoan
2
1
67.0
middle
30
0
0
3.8060
1248
baoan
2
2
67.4
middle
29
1
0
5.3412
1249
baoan
2
2
73.1
low
15
1
0
5.9508
1250
baoan
3
2
86.2
middle
32
0
1
4.5244
1251 rows × 9 columns
time = "2020-06-23"
date = pd. to_datetime( time)
date
Timestamp('2020-06-23 00:00:00')
type ( date)
pandas._libs.tslibs.timestamps.Timestamp
date. year
2020
date. month
6
data[ "week" ] = date. weekday
data. drop( "week" , axis= 1 , inplace= True )
data
district
roomnum
hall
AREA
C_floor
floor_num
school
subway
per_price
0
baoan
3
2
89.3
middle
31
0
0
7.0773
1
baoan
4
2
127.0
high
31
0
0
6.9291
2
baoan
1
1
28.0
low
39
0
0
3.9286
3
baoan
1
1
28.0
middle
30
0
0
3.3568
4
baoan
2
2
78.0
middle
8
1
1
5.0769
...
...
...
...
...
...
...
...
...
...
1246
baoan
4
2
89.3
low
8
0
0
4.2553
1247
baoan
2
1
67.0
middle
30
0
0
3.8060
1248
baoan
2
2
67.4
middle
29
1
0
5.3412
1249
baoan
2
2
73.1
low
15
1
0
5.9508
1250
baoan
3
2
86.2
middle
32
0
1
4.5244
1251 rows × 9 columns
data[ "feature" ] = np. where( data[ "per_price" ] > 5.0000 , 1 , 0 )
data
district
roomnum
hall
AREA
C_floor
floor_num
school
subway
per_price
feature
0
baoan
3
2
89.3
middle
31
0
0
7.0773
1
1
baoan
4
2
127.0
high
31
0
0
6.9291
1
2
baoan
1
1
28.0
low
39
0
0
3.9286
0
3
baoan
1
1
28.0
middle
30
0
0
3.3568
0
4
baoan
2
2
78.0
middle
8
1
1
5.0769
1
...
...
...
...
...
...
...
...
...
...
...
1246
baoan
4
2
89.3
low
8
0
0
4.2553
0
1247
baoan
2
1
67.0
middle
30
0
0
3.8060
0
1248
baoan
2
2
67.4
middle
29
1
0
5.3412
1
1249
baoan
2
2
73.1
low
15
1
0
5.9508
1
1250
baoan
3
2
86.2
middle
32
0
1
4.5244
0
1251 rows × 10 columns
data0 = pd. crosstab( data[ "floor_num" ] , data[ "feature" ] )
data0
feature
0
1
floor_num
1
6
8
3
0
1
4
0
10
6
3
7
7
16
25
8
19
32
9
2
11
10
4
9
11
8
11
12
1
3
13
4
20
14
0
5
15
8
33
16
9
19
17
20
21
18
17
35
19
11
5
20
2
4
21
1
6
22
0
1
23
4
8
24
10
26
25
4
37
26
9
57
27
5
38
28
6
35
29
26
68
30
30
78
31
4
151
32
21
126
33
34
20
34
1
5
35
1
2
36
0
4
37
1
1
38
0
1
39
5
10
40
1
3
43
0
1
44
0
6
45
0
7
47
0
1
50
0
1
51
0
3
52
0
2
53
0
1
data0. sum ( axis= 1 )
floor_num
1 14
3 1
4 10
6 10
7 41
8 51
9 13
10 13
11 19
12 4
13 24
14 5
15 41
16 28
17 41
18 52
19 16
20 6
21 7
22 1
23 12
24 36
25 41
26 66
27 43
28 41
29 94
30 108
31 155
32 147
33 54
34 6
35 3
36 4
37 2
38 1
39 15
40 4
43 1
44 6
45 7
47 1
50 1
51 3
52 2
53 1
dtype: int64
data0. div( data0. sum ( axis= 1 ) , axis= 0 )
feature
0
1
floor_num
1
0.428571
0.571429
3
0.000000
1.000000
4
0.000000
1.000000
6
0.300000
0.700000
7
0.390244
0.609756
8
0.372549
0.627451
9
0.153846
0.846154
10
0.307692
0.692308
11
0.421053
0.578947
12
0.250000
0.750000
13
0.166667
0.833333
14
0.000000
1.000000
15
0.195122
0.804878
16
0.321429
0.678571
17
0.487805
0.512195
18
0.326923
0.673077
19
0.687500
0.312500
20
0.333333
0.666667
21
0.142857
0.857143
22
0.000000
1.000000
23
0.333333
0.666667
24
0.277778
0.722222
25
0.097561
0.902439
26
0.136364
0.863636
27
0.116279
0.883721
28
0.146341
0.853659
29
0.276596
0.723404
30
0.277778
0.722222
31
0.025806
0.974194
32
0.142857
0.857143
33
0.629630
0.370370
34
0.166667
0.833333
35
0.333333
0.666667
36
0.000000
1.000000
37
0.500000
0.500000
38
0.000000
1.000000
39
0.333333
0.666667
40
0.250000
0.750000
43
0.000000
1.000000
44
0.000000
1.000000
45
0.000000
1.000000
47
0.000000
1.000000
50
0.000000
1.000000
51
0.000000
1.000000
52
0.000000
1.000000
53
0.000000
1.000000
data_percent = data0. div( data0. sum ( axis= 1 ) , axis= 0 )
data_percent
feature
0
1
floor_num
1
0.428571
0.571429
3
0.000000
1.000000
4
0.000000
1.000000
6
0.300000
0.700000
7
0.390244
0.609756
8
0.372549
0.627451
9
0.153846
0.846154
10
0.307692
0.692308
11
0.421053
0.578947
12
0.250000
0.750000
13
0.166667
0.833333
14
0.000000
1.000000
15
0.195122
0.804878
16
0.321429
0.678571
17
0.487805
0.512195
18
0.326923
0.673077
19
0.687500
0.312500
20
0.333333
0.666667
21
0.142857
0.857143
22
0.000000
1.000000
23
0.333333
0.666667
24
0.277778
0.722222
25
0.097561
0.902439
26
0.136364
0.863636
27
0.116279
0.883721
28
0.146341
0.853659
29
0.276596
0.723404
30
0.277778
0.722222
31
0.025806
0.974194
32
0.142857
0.857143
33
0.629630
0.370370
34
0.166667
0.833333
35
0.333333
0.666667
36
0.000000
1.000000
37
0.500000
0.500000
38
0.000000
1.000000
39
0.333333
0.666667
40
0.250000
0.750000
43
0.000000
1.000000
44
0.000000
1.000000
45
0.000000
1.000000
47
0.000000
1.000000
50
0.000000
1.000000
51
0.000000
1.000000
52
0.000000
1.000000
53
0.000000
1.000000
data_percent. plot( kind= "bar" , stacked= True )
<matplotlib.axes._subplots.AxesSubplot at 0x24719dd7488>
data_percent = data0. div( data0. sum ( axis= 1 ) , axis= 0 )
data_percent
<tr>
<th>50</th>
<td>0.000000</td>
<td>1.000000</td>
</tr>
<tr>
<th>51</th>
<td>0.000000</td>
<td>1.000000</td>
</tr>
<tr>
<th>52</th>
<td>0.000000</td>
<td>1.000000</td>
</tr>
<tr>
<th>53</th>
<td>0.000000</td>
<td>1.000000</td>
</tr>
feature
0
1
floor_num
1
0.428571
0.571429
3
0.000000
1.000000
4
0.000000
1.000000
6
0.300000
0.700000
7
0.390244
0.609756
8
0.372549
0.627451
9
0.153846
0.846154
10
0.307692
0.692308
11
0.421053
0.578947
12
0.250000
0.750000
13
0.166667
0.833333
14
0.000000
1.000000
15
0.195122
0.804878
16
0.321429
0.678571
17
0.487805
0.512195
18
0.326923
0.673077
19
0.687500
0.312500
20
0.333333
0.666667
21
0.142857
0.857143
22
0.000000
1.000000
23
0.333333
0.666667
24
0.277778
0.722222
25
0.097561
0.902439
26
0.136364
0.863636
27
0.116279
0.883721
28
0.146341
0.853659
29
0.276596
0.723404
30
0.277778
0.722222
4.9.3使用pivot_table(透視表)實現
data. pivot_table( [ "feature" ] , index= [ "floor_num" ] )
...
feature
floor_num
1
0.571429
3
1.000000
4
1.000000
6
0.700000
50
1.000000
51
1.000000
52
1.000000
53
1.000000
4.10 高級處理-分組與聚合
4.10.2 分組與聚合API
col = pd. DataFrame( { 'color' : [ 'white' , 'red' , 'green' , 'red' , 'green' ] ,
'object' : [ "pen" , "pencil" , "pencil" , "ashtray" , "pen" ] ,
'price1' : [ 4.56 , 4.20 , 1.30 , 0.56 , 2.75 ] ,
'price2' : [ 4.75 , 4.12 , 1.68 , 0.75 , 3.15 ] } )
col
color
object
price1
price2
0
white
pen
4.56
4.75
1
red
pencil
4.20
4.12
2
green
pencil
1.30
1.68
3
red
ashtray
0.56
0.75
4
green
pen
2.75
3.15
col. groupby( by= "color" ) [ "price1" ] . max ( )
color
green 2.75
red 4.20
white 4.56
Name: price1, dtype: float64
col[ 'price1' ] . groupby( col[ "color" ] )
<pandas.core.groupby.generic.SeriesGroupBy object at 0x000002471D178D08>
col[ 'price1' ] . groupby( col[ "color" ] ) . max ( )
color
green 2.75
red 4.20
white 4.56
Name: price1, dtype: float64
4.11 綜合案例
movie = pd. read_csv( "./datas/IMDB-Movie-Data.csv" )
movie
Rank
Title
Genre
Description
Director
Actors
Year
Runtime (Minutes)
Rating
Votes
Revenue (Millions)
Metascore
0
1
Guardians of the Galaxy
Action,Adventure,Sci-Fi
A group of intergalactic criminals are forced ...
James Gunn
Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...
2014
121
8.1
757074
333.13
76.0
1
2
Prometheus
Adventure,Mystery,Sci-Fi
Following clues to the origin of mankind, a te...
Ridley Scott
Noomi Rapace, Logan Marshall-Green, Michael Fa...
2012
124
7.0
485820
126.46
65.0
2
3
Split
Horror,Thriller
Three girls are kidnapped by a man with a diag...
M. Night Shyamalan
James McAvoy, Anya Taylor-Joy, Haley Lu Richar...
2016
117
7.3
157606
138.12
62.0
3
4
Sing
Animation,Comedy,Family
In a city of humanoid animals, a hustling thea...
Christophe Lourdelet
Matthew McConaughey,Reese Witherspoon, Seth Ma...
2016
108
7.2
60545
270.32
59.0
4
5
Suicide Squad
Action,Adventure,Fantasy
A secret government agency recruits some of th...
David Ayer
Will Smith, Jared Leto, Margot Robbie, Viola D...
2016
123
6.2
393727
325.02
40.0
...
...
...
...
...
...
...
...
...
...
...
...
...
995
996
Secret in Their Eyes
Crime,Drama,Mystery
A tight-knit team of rising investigators, alo...
Billy Ray
Chiwetel Ejiofor, Nicole Kidman, Julia Roberts...
2015
111
6.2
27585
NaN
45.0
996
997
Hostel: Part II
Horror
Three American college students studying abroa...
Eli Roth
Lauren German, Heather Matarazzo, Bijou Philli...
2007
94
5.5
73152
17.54
46.0
997
998
Step Up 2: The Streets
Drama,Music,Romance
Romantic sparks occur between two dance studen...
Jon M. Chu
Robert Hoffman, Briana Evigan, Cassie Ventura,...
2008
98
6.2
70699
58.01
50.0
998
999
Search Party
Adventure,Comedy
A pair of friends embark on a mission to reuni...
Scot Armstrong
Adam Pally, T.J. Miller, Thomas Middleditch,Sh...
2014
93
5.6
4881
NaN
22.0
999
1000
Nine Lives
Comedy,Family,Fantasy
A stuffy businessman finds himself trapped ins...
Barry Sonnenfeld
Kevin Spacey, Jennifer Garner, Robbie Amell,Ch...
2016
87
5.3
12435
19.64
11.0
1000 rows × 12 columns
movie[ "Rating" ] . mean( )
6.723200000000003
movie[ "Director" ]
0 James Gunn
1 Ridley Scott
2 M. Night Shyamalan
3 Christophe Lourdelet
4 David Ayer
...
995 Billy Ray
996 Eli Roth
997 Jon M. Chu
998 Scot Armstrong
999 Barry Sonnenfeld
Name: Director, Length: 1000, dtype: object
np. unique( movie[ "Director" ] )
array(['Aamir Khan', 'Abdellatif Kechiche', 'Adam Leon', 'Adam McKay',
'Adam Shankman', 'Adam Wingard', 'Afonso Poyart', 'Aisling Walsh',
'Akan Satayev', 'Akiva Schaffer', 'Alan Taylor', 'Albert Hughes',
'Alejandro Amenábar', 'Alejandro González Iñárritu',
...
'Tomas Alfredson', 'Tony Gilroy', 'Tony Scott', 'Travis Knight',
'Tyler Shields', 'Wally Pfister', 'Walt Dohrn', 'Walter Hill',
'Warren Beatty', 'Werner Herzog', 'Wes Anderson', 'Wes Ball',
'Wes Craven', 'Whit Stillman', 'Will Gluck', 'Will Slocombe',
'William Brent Bell', 'William Oldroyd', 'Woody Allen',
'Xavier Dolan', 'Yimou Zhang', 'Yorgos Lanthimos', 'Zack Snyder',
'Zackary Adler'], dtype=object)
np. unique( movie[ "Director" ] ) . size
644
movie[ "Rating" ] . plot( kind= "hist" , figsize= ( 20 , 8 ) , fontsize= 40 )
<matplotlib.axes._subplots.AxesSubplot at 0x2471ce18708>
import matplotlib. pyplot as plt
plt. figure( figsize= ( 20 , 8 ) , dpi= 100 )
plt. hist( movie[ "Rating" ] , 20 )
plt. xticks( np. linspace( movie[ "Rating" ] . min ( ) , movie[ "Rating" ] . max ( ) , 21 ) )
plt. grid( linestyle= "--" , alpha= 0.5 )
plt. show( )
movie[ "Rating" ]
0 8.1
1 7.0
2 7.3
3 7.2
4 6.2
...
995 6.2
996 5.5
997 6.2
998 5.6
999 5.3
Name: Rating, Length: 1000, dtype: float64
movie_genre = [ i. split( "," ) for i in movie[ "Genre" ] ]
movie_genre
[['Action', 'Adventure', 'Sci-Fi'],
['Adventure', 'Mystery', 'Sci-Fi'],
['Horror', 'Thriller'],
['Animation', 'Comedy', 'Family'],
['Action', 'Adventure', 'Fantasy'],
...
['Horror'],
['Drama', 'Music', 'Romance'],
['Adventure', 'Comedy'],
['Comedy', 'Family', 'Fantasy']]
[ j for i in movie_genre for j in i]
['Action',
'Adventure',
'Sci-Fi',
'Adventure',
'Mystery',
'Sci-Fi',
...
'Animation',
'Action',
'Adventure',
'Action',
'Adventure',
'Drama',
...]
movie_class = np. unique( [ j for i in movie_genre for j in i] )
movie_class
array(['Action', 'Adventure', 'Animation', 'Biography', 'Comedy', 'Crime',
'Drama', 'Family', 'Fantasy', 'History', 'Horror', 'Music',
'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Sport', 'Thriller',
'War', 'Western'], dtype='<U9')
len ( movie_class)
20
count = pd. DataFrame( np. zeros( shape= [ 1000 , 20 ] , dtype= "int32" ) , columns= movie_class)
count. head( )
Action
Adventure
Animation
Biography
Comedy
Crime
Drama
Family
Fantasy
History
Horror
Music
Musical
Mystery
Romance
Sci-Fi
Sport
Thriller
War
Western
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
2
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
3
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
4
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
count. loc[ 0 , movie_genre[ 0 ] ]
Action 0
Adventure 0
Sci-Fi 0
Name: 0, dtype: int32
movie_genre[ 0 ]
['Action', 'Adventure', 'Sci-Fi']
for i in range ( 1000 ) :
count. loc[ i, movie_genre[ i] ] = 1
count
Action
Adventure
Animation
Biography
Comedy
Crime
Drama
Family
Fantasy
History
Horror
Music
Musical
Mystery
Romance
Sci-Fi
Sport
Thriller
War
Western
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
1
0
1
0
0
0
0
0
0
0
0
0
0
0
1
0
1
0
0
0
0
2
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
3
0
0
1
0
1
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
4
1
1
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
995
0
0
0
0
0
1
1
0
0
0
0
0
0
1
0
0
0
0
0
0
996
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
997
0
0
0
0
0
0
1
0
0
0
0
1
0
0
1
0
0
0
0
0
998
0
1
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
999
0
0
0
0
1
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
1000 rows × 20 columns
count. sum ( axis= 0 )
Action 303
Adventure 259
Animation 49
Biography 81
Comedy 279
Crime 150
Drama 513
Family 51
Fantasy 101
History 29
Horror 119
Music 16
Musical 5
Mystery 106
Romance 141
Sci-Fi 120
Sport 18
Thriller 195
War 13
Western 7
dtype: int64
count. sum ( axis= 0 ) . sort_values( ascending= False )
Drama 513
Action 303
Comedy 279
Adventure 259
Thriller 195
Crime 150
Romance 141
Sci-Fi 120
Horror 119
Mystery 106
Fantasy 101
Biography 81
Family 51
Animation 49
History 29
Sport 18
Music 16
War 13
Western 7
Musical 5
dtype: int64
count. sum ( axis= 0 ) . sort_values( ascending= False ) . plot( kind= "bar" , fontsize= 20 , figsize= ( 20 , 9 ) , colormap= "cool" )
<matplotlib.axes._subplots.AxesSubplot at 0x2472450c1c8>