Python-Pandas之DataFrame用法總結

DataFrame：類似於表的數據結構

通過與array以及series對比進行學習，會更清楚DataFrame的用法和特點。

本文對Pandas包中二維（多維）數據結構DataFrame的特點和用法進行了總結歸納。
可以參考：pandas用法速覽

3.1 增加數據

3.1.1 創建數據框Object Creation

numpy.random.randn(m,n)：是從標準正態分佈中返回m行n列個樣本中；
numpy.random.rand(m,n)：是從[0,1)中隨機返回m行n列個樣本。

import pandas as pd
import numpy as np
#通過Numpy array來創建數據框
dates=pd.date_range('2018-09-01',periods=7)
dF1=pd.DataFrame(np.random.rand(7,4),index=dates) #從[0,1)中隨機返回一個數組
>>>
		0		1		2		3
2018-09-01	0.445283	0.798458	0.818208	0.340356
2018-09-02	0.249172	0.535308	0.811825	0.224133
2018-09-03	0.466948	0.178802	0.997567	0.361670
2018-09-04	0.720670	0.407122	0.120310	0.180888
2018-09-05	0.545400	0.169919	0.171649	0.030347
2018-09-06	0.553405	0.013866	0.582740	0.030837
2018-09-07	0.185981	0.137448	0.817721	0.768875

#通過dict來創建數據框
dataDict={'A':1.,
          'B':pd.Timestamp('20180901'),
          'C':pd.Series(1,index=range(4),dtype='float'),
          'D':np.array([3]*4,dtype='int'),
          'E':pd.Categorical(['test','train','test','train']),
          'F':'foo'
         }
dF2=pd.DataFrame(dataDict)
>>>
	A	B		C	D	E	F
0	1.0	2018-09-01	1.0	3	test	foo
1	1.0	2018-09-01	1.0	3	train	foo
2	1.0	2018-09-01	1.0	3	test	foo
3	1.0	2018-09-01	1.0	3	train	foo

3.1.2 整合數據

Concat/Merge/Append
Concat:將數據框拼接在一起（可按rows或columns）
Merge:類似於SQL中Join的用法
Append:將數據按rows拼接到數據框中

#Concat:將數據框拼接在一起（可按rows或columns）
dF=pd.DataFrame(np.random.randn(10,4))
>>>

	0		1		2		3
0	-1.135930	-0.371505	0.349293	-2.788323
1	-0.505594	0.012753	0.539757	0.044460
2	1.208134	-0.436352	1.361564	-0.777053
3	-0.909025	0.929461	0.411863	0.866106
4	-0.300255	-0.023755	-1.382157	0.042096
5	0.335969	-0.176301	0.751841	-0.016906
6	0.545919	1.202155	0.705825	-2.305620
7	-0.820600	-2.588532	-0.475357	0.475708
8	-0.097844	0.141700	0.322873	0.586568
9	0.941772	0.789850	-1.017382	-0.762623

#將數據框拆分後在拼接
pieces1=dF[:3]
>>>
	0		1		2		3
0	-1.135930	-0.371505	0.349293	-2.788323
1	-0.505594	0.012753	0.539757	0.044460
2	1.208134	-0.436352	1.361564	-0.777053

pieces2=dF[3:7]
pieces3=dF[7:] 
pd.concat([pieces1,pieces2,pieces3],axis=0) #拼接

#Merge（類似於SQL中Join的用法）
left=pd.DataFrame({'key':['foo','foo'],'value':[1,2]})
>>>
	key	value
0	foo	1
1	foo	2

right=pd.DataFrame({'key':['foo','foo'],'value':[4,5]})
>>>
key	value
0	foo	4
1	foo	5

#根據key進行連接
pd.merge(left,right,on='key')
>>>
	key	value_x	value_y
0	foo	1	4
1	foo	1	5
2	foo	2	4
3	foo	2	5

Python中Merge()函數用法

#Append:將數據按rows拼接到數據框中
df=pd.DataFrame(np.random.randn(8,4),columns=['A','B','C','D']
                ,index=range(1,9))
>>>
	A		B		C		D
1	0.048111	-0.973745	0.150854	1.839696
2	-0.718782	-0.858483	0.824083	-1.042301
3	-1.197431	1.129919	-0.041504	0.457233
4	-1.273139	2.535986	-0.173237	-0.504907
5	-0.210177	-1.958532	-0.076133	-0.569886
6	0.706548	-1.267755	0.908618	-0.142500
7	1.977968	-0.273628	0.160981	-0.574506
8	-0.034995	0.375605	0.105764	0.317471

s=df.iloc[0] #提取第一行數據
>>>
A    0.048111
B   -0.973745
C    0.150854
D    1.839696
Name: 1, dtype: float64

df.append(s,ignore_index=False) #ignore_index若爲Ture則插入數據後索引將更新，否則保持原有索引值
>>>
	A		B		C		D
1	-0.437891	-0.716900	-1.379668	-0.617532
2	-1.605923	-0.685957	1.093090	0.063530
3	0.673912	0.391528	-1.161709	-0.263566
4	0.360196	-0.392037	0.395013	-1.575099
5	1.521031	0.557268	1.443565	-1.098274
6	1.530103	-0.124313	-0.347624	-0.852735
7	-0.154532	-0.337005	0.536932	0.482449
8	-2.165410	-1.606653	0.079391	-0.013447
1	-0.437891	-0.716900	-1.379668	-0.617532

3.1.3 導入/導出數據Getting Data In/Out

Csv/Excel

#Csv
#導出爲Csv文件，名稱及位置（默認和notebook文件同一目錄下）
df.to_csv('foo.csv') 
#導入Csv文件（且不導出索引，index默認爲True）
fileDf=df.read_csv('foo.csv',index=False) 

#Excel
#導出爲xlsx文件
df.to_excel('foo.xlsx',sheet_name='Sheet1') 
#導入制定表的sheet數據
fileDf=pd.read_excel('Python數據/朝陽醫院2018年銷售數據.xlsx','Sheet1')

3.2 查看數據

3.2.1 查看數據Viewing Data

查看數據三部曲：
head():查看數據前幾項，看數據長什麼樣
info():查看數據類型，以及數據缺失情況
descibe():查看數據描述統計性信息，數據大概分佈情況）

#導入數據
fileDf=pd.read_excel('Python數據/朝陽醫院2018年銷售數據.xlsx','Sheet1') 
#fileDf.head(n)  n代表顯示前幾行數據
fileDf.head()
fileDf.info()
fileDf.describe()

此外，可以分別用fileDf.shape和fileDf.dtypes來查看數據的維度和各字段數據的類型。（注意在調用的時候不帶括號）

#數據轉置
fileDf.T
#按指定屬性值排序
fileDf.sort_values('銷售數量',ascending=False) #按照‘銷售數量’降序排列數據
#查看某數據數值的分佈
fileDf['商品名稱'].value_counts()  #查看各項數量
fileDf['商品名稱'].value_counts(normalize=True)  #查看各項佔比

Pandas中排序函數sort_values()用法

3.2.2 選取數據Selection

#直接切片獲取數據（行根據位置，列根據列名）
fileDf['商品名稱']  #根據屬性名，獲取列
fileDf[0:3] #切片獲取位置(0:2)的數據，相當於fileDf.iloc[0:3]

#利用loc根據標籤值Label獲取數據:可以交叉取值
fileDf.loc[0:3,['商品名稱','銷售數量']] #獲取索引值爲(0:3)的中'商品名稱'的數據
fileDf.loc[1] #獲取索引值爲2的所有數據

#利用iloc根據位置獲取Position數據
fileDf.iloc[1] #獲取第二行的所有數據
fileDf.iloc[1:3,[0,3]] #獲取第二、三行，一、四列的數據

#利用布爾值判斷取數
fileDf[fileDf['銷售數量']>30] #提取“銷售數量”大於30的數據

#isin()方法，類似於SQL中的in方法
fileDf[fileDf['商品名稱'].isin(['感康'])] #提取“商品名稱”爲感康的所有數據

#提取“商品名稱”爲感康，且“銷售數量”大於5的所有數據
fileDf[fileDf['商品名稱'].isin(['感康'])&(fileDf['銷售數量']>5)] 


#isin()方法的逆函數，即在函數前面加 ~ ，如提取“商品名稱”非感康的商品
fileDf[~fileDf['商品名稱'].isin(['感康'])]

3.2.3 數據操作Operations

stats/Apply
Apply:（用於dataframe，對row或column進行操作）類似於map（python自帶，用於series，元素級別的操作）

#stats
fileDf.mean() #求均值
fileDf['實收金額'].mean() #求某列均值

#apply
df.apply(lambda x:x.max()-x.min())

3.2.4 分組操作Grouping

類似於SQL中的group by 分組操作

#根據商品名稱進行分組求和，得到每種商品的'銷售數量','應收金額','實收金額'
fileDf.groupby('商品名稱').sum()[['銷售數量','應收金額','實收金額']]
#根據時間和商品名稱進行分組求和，得到每天每種商品的'銷售數量','應收金額','實收金額'
fileDf.groupby(['購藥時間','商品名稱']).sum()[['銷售數量','應收金額','實收金額']]

3.3 修改數據

3.3.1 缺失值處理Missing Data

pandas中主要用np.nan來代表缺失值(NaN)，缺失值一般不進行計算操作

#剔除有缺失值的行
fileDf.dropna(how='any')
#填充缺失值
fileDf.fillna(value=5) #用特定值填充
#找出‘商品名稱’中有空缺值的行
fileDf[fileDf['商品名稱'].isnull()]

3.3.2 改變形狀Reshaping

Pivot Tables:類似excel中的數據透視表，重新組合行和列

#利用字典創建數據框
df=pd.DataFrame({'A':['one','one','two','three']*3,
                'B':['A','B','C']*4,
                'C':['foo','foo','foo','bar','bar','bar']*2,
                'D':np.random.randn(12),
                'E':np.random.randint(0,5,12)}
)
print(df)
>>>        
	A  B    C   D  E
0     one  A  foo -0.616543  4
1     one  B  foo -0.146041  1
2     two  C  foo -0.562578  4
3   three  A  bar -0.299173  0
4     one  B  bar -0.550978  1
5     one  C  bar -0.150658  1
6     two  A  foo  1.598310  3
7   three  B  foo  1.588566  2
8     one  C  foo  0.414795  0
9     one  A  bar  0.901496  3
10    two  B  bar  0.326600  0
11  three  C  bar -2.296521  0

#分析D數據在A/B/C屬性不同時的值
pd.pivot_table(df,values='D',index=['A','B'],columns=['C'])
>>>
	C	bar			foo
A	B		
one	A	0.901496	-0.616543
	B	-0.550978	-0.146041
	C	-0.150658	0.414795
three A	-0.299173	NaN
	B	NaN			1.588566
	C	-2.296521	NaN
two	A	NaN			1.598310
	B	0.326600	NaN
	C	NaN			-0.562578

重塑和軸轉向：
stack：將數據的列“旋轉”爲行
unstack：將數據的行“旋轉”爲列

3.3.3 修改指定列中數據

需要先用loc將數據提取出來，再賦值修改；
若需修改索引，可直接賦值df.index=##;
若需修改列名，可直接賦值df.columns=##。

#利用字典創建數據框
df=pd.DataFrame({'A':['one','one','two','three']*3,
                'B':['A','B','C']*4,
                'C':['foo','foo','foo','bar','bar','bar']*2,
                'D':np.random.randn(12),
                'E':np.random.randint(0,5,12)}
)
print(df)
>>>
        A  B    C         D  E
0     one  A  foo -0.362885  0
1     one  B  foo  0.185367  2
2     two  C  foo  0.487103  4
3   three  A  bar -0.724691  4
4     one  B  bar -1.725003  4
5     one  C  bar -0.011432  3
6     two  A  foo -0.471946  3
7   three  B  foo -0.057156  0
8     one  C  foo -0.220240  2
9     one  A  bar  0.687409  0
10    two  B  bar -0.640526  2
11  three  C  bar  0.900257  4

#將A列爲“one”，C列爲“bar”的E列數據修改爲110
df.loc[(df['A']=='one')&(df['C']=='bar'),'E']=110
print(df)
>>>
        A  B    C         D    E
0     one  A  foo -0.362885    0
1     one  B  foo  0.185367    2
2     two  C  foo  0.487103    4
3   three  A  bar -0.724691    4
4     one  B  bar -1.725003  110
5     one  C  bar -0.011432  110
6     two  A  foo -0.471946    3
7   three  B  foo -0.057156    0
8     one  C  foo -0.220240    2
9     one  A  bar  0.687409  110
10    two  B  bar -0.640526    2
11  three  C  bar  0.900257    4

#修改索引爲1-12
df.index=range(1,13)
print(df)
>>>
        A  B    C         D  E
1     one  A  foo  1.752461  0
2     one  B  foo  0.050103  2
3     two  C  foo -0.238459  3
4   three  A  bar  0.036248  2
5     one  B  bar -1.482152  3
6     one  C  bar  0.842914  0
7     two  A  foo  0.610023  4
8   three  B  foo -0.323742  2
9     one  C  foo  2.806338  1
10    one  A  bar  1.251093  4
11    two  B  bar  0.391565  2
12  three  C  bar -0.322481  0

#修改列名爲'EDCBA'
df.columns=['E','D','C','B','A']
print(df)
>>>
        E  D    C         B  A
1     one  A  foo  1.752461  0
2     one  B  foo  0.050103  2
3     two  C  foo -0.238459  3
4   three  A  bar  0.036248  2
5     one  B  bar -1.482152  3
6     one  C  bar  0.842914  0
7     two  A  foo  0.610023  4
8   three  B  foo -0.323742  2
9     one  C  foo  2.806338  1
10    one  A  bar  1.251093  4
11    two  B  bar  0.391565  2
12  three  C  bar -0.322481  0

修改列名有兩種方法：
1、修改所有列的列名，即直接賦值df.columns=##
2、修改指定列的列名，df.rename(columns={‘a’:‘A’})

import pandas as pd
df = pd.DataFrame({'a':[1,2,3],'b':[1,2,3]})

>>>a  b
0  1  1
1  2  2
2  3  3

#1、修改列名a，b爲A、B。
df.columns = ['A','B']
#2、只修改列名a爲A
df.rename(columns={'a':'A'})

3.4 時間序列Time Series

#創建間隔爲1s總數10個時間序列
rng=pd.date_range('20180901',periods=10,freq='S')
>>>
DatetimeIndex(['2018-09-01 00:00:00', '2018-09-01 00:00:01',
               '2018-09-01 00:00:02', '2018-09-01 00:00:03',
               '2018-09-01 00:00:04', '2018-09-01 00:00:05',
               '2018-09-01 00:00:06', '2018-09-01 00:00:07',
               '2018-09-01 00:00:08', '2018-09-01 00:00:09'],
              dtype='datetime64[ns]', freq='S')
              
#以時間序列爲索引值，創建Series
ts=pd.Series(np.random.randint(0,500,len(rng)),index=rng)
>>>
2018-09-01 00:00:00    249
2018-09-01 00:00:01    409
2018-09-01 00:00:02     85
2018-09-01 00:00:03     40
2018-09-01 00:00:04    157
2018-09-01 00:00:05    113
2018-09-01 00:00:06    152
2018-09-01 00:00:07    107
2018-09-01 00:00:08    259
2018-09-01 00:00:09    110
Freq: S, dtype: int64

#以1min爲間隔進行求和
ts.resample('1Min').sum()
>>>
2018-09-01    1681
Freq: T, dtype: int64

#創建間隔爲1天總數5個時間序列
rng=pd.date_range('9/1/2018 00:00',periods=5,freq='D')
>>>
DatetimeIndex(['2018-09-01', '2018-09-02', '2018-09-03', '2018-09-04',
               '2018-09-05'],
              dtype='datetime64[ns]', freq='D')
              
#以時間序列爲索引值，創建Series
ts=pd.Series(np.random.randn(len(rng)),index=rng)
>>>
2018-09-01   -0.955735
2018-09-02    0.004711
2018-09-03   -2.177743
2018-09-04   -0.263494
2018-09-05   -1.760504
Freq: D, dtype: float64

4. 其他

4.1 字段或行數太多無法全部顯示

#顯示所有列
pd.set_option('display.max_columns', None)
#顯示所有行
pd.set_option('display.max_rows', None)
#設置value的顯示長度爲100，默認爲50
pd.set_option('max_colwidth',100)

如果對於本文中代碼或數據有任何疑問，歡迎評論或私信交流
相近文章：
Numpy中Array用法總結
 Pandas中Series用法總結

Python-Pandas之DataFrame用法總結

DataFrame：類似於表的數據結構

3.1 增加數據

3.1.1 創建數據框Object Creation

3.1.2 整合數據

3.1.3 導入/導出數據Getting Data In/Out

3.2 查看數據

3.2.1 查看數據Viewing Data

3.2.2 選取數據Selection

3.2.3 數據操作Operations

3.2.4 分組操作Grouping

3.3 修改數據

3.3.1 缺失值處理Missing Data

3.3.2 改變形狀Reshaping

3.3.3 修改指定列中數據

3.4 時間序列Time Series

4. 其他

4.1 字段或行數太多無法全部顯示

釘釘打卡速度慢

Nginx R31 doc 官方文檔-01-nginx 如何安裝

Python 潮流週刊#51：用 Python 繪製美觀的圖表

Qt/C++音視頻開發74-合併標籤圖形/生成yolo運算結果圖形/文字和圖形合併成一個/水印濾鏡

挑戰程序設計競賽 2.2章習題 POJ - 3617 Best Cow Line 貪心

字節面試：MySQL什麼時候鎖表？如何防止鎖表？

.NET8連接SQL SERVER 2008 R2 報：證書鏈是由不受信任的頒發機構頒發的

golang開發環境搭建(win10)

python計算機視覺學習筆記——PIL庫的用法

Golang初學：獲取程序內存使用情況，std runtime

Pandas-排序函數sort_values()

Python-格式化符%

機器學習-集成學習(ensemble learning)

Pandas-object字符類型轉時間類型to_datetime()函數

Pandas-去除重複項函數drop_duplicates()

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結