pandas 是基於 Numpy 構建的含有更高級數據結構和工具的數據分析包

類似於 Numpy 的核心是 ndarray，pandas 也是圍繞着 Series 和 DataFrame 兩個核心數據結構展開的。Series 和 DataFrame 分別對應於一維的序列和二維的表結構。pandas 約定俗成的導入方法如下：

1 2	`from` `pandas` `import` `Series,DataFrame` `import` `pandas as pd`

Series

Series 可以看做一個定長的有序字典。基本任意的一維數據都可以用來構造 Series 對象：

1

2

3

4

5

6

7


>>>
s =

Series([1,2,3.0,'abc'])

>>>
s

0     

1

1     

2

2     

3

3   

abc

dtype: object

雖然 dtype:object 可以包含多種基本數據類型，但總感覺會影響性能的樣子，最好還是保持單純的 dtype。

Series 對象包含兩個主要的屬性：index 和 values，分別爲上例中左右兩列。因爲傳給構造器的是一個列表，所以 index 的值是從 0 起遞增的整數，如果傳入的是一個類字典的鍵值對結構，就會生成 index-value 對應的 Series；或者在初始化的時候以關鍵字參數顯式指定一個 index 對象：


>>>
s =

Series(data=[1,3,5,7],index =

['a','b','x','y'])

>>>
s

a    1

b    3

x    5

y    7

dtype:
int64

>>>
s.index

Index(['a', 'b', 'x', 'y'],
dtype='object')

>>>
s.values

array([1, 3, 5, 7],
dtype=int64)

Series 對象的元素會嚴格依照給出的 index 構建，這意味着：如果 data 參數是有鍵值對的，那麼只有 index 中含有的鍵會被使用；以及如果 data 中缺少響應的鍵，即使給出 NaN 值，這個鍵也會被添加。

注意 Series 的 index 和 values 的元素之間雖然存在對應關係，但這與字典的映射不同。index 和 values 實際仍爲互相獨立的 ndarray 數組，因此 Series 對象的性能完全 ok。

Series 這種使用鍵值對的數據結構最大的好處在於，Series 間進行算術運算時，index 會自動對齊。

另外，Series 對象和它的 index 都含有一個 name 屬性：


>>>
s.name =

'a_series'

>>>
s.index.name =

'the_index'

>>>
s

the_index

a            1

b            3

x            5

y            7

Name:
a_series, dtype: int64

DataFrame

DataFrame 是一個表格型的數據結構，它含有一組有序的列（類似於 index），每列可以是不同的值類型（不像 ndarray 只能有一個 dtype）。基本上可以把 DataFrame 看成是共享同一個 index 的 Series 的集合。

DataFrame 的構造方法與 Series 類似，只不過可以同時接受多條一維數據源，每一條都會成爲單獨的一列：


>>>
data =

{'state':['Ohino','Ohino','Ohino','Nevada','Nevada'],

        'year':[2000,2001,2002,2001,2002],

        'pop':[1.5,1.7,3.6,2.4,2.9]}

>>>
df =

DataFrame(data)

>>>
df

   pop  
state  year

0 

1.5   
Ohino  2000

1 

1.7   
Ohino  2001

2 

3.6   
Ohino  2002

3 

2.4  
Nevada  2001

4 

2.9  
Nevada  2002

[5

rows x 3

columns]

雖然參數 data 看起來是個字典，但字典的鍵並非充當 DataFrame 的 index 的角色，而是 Series 的 “name” 屬性。這裏生成的 index 仍是 “01234”。

較完整的 DataFrame 構造器參數爲：DataFrame(data=None,index=None,coloumns=None)，columns 即 “name”：


>>>
df =

DataFrame(data,index=['one','two','three','four','five'],

               columns=['year','state','pop','debt'])

>>>
df

       year  
state  pop debt

one    2000  

Ohino  1.5 

NaN

two    2001  

Ohino  1.7 

NaN

three  2002  

Ohino  3.6 

NaN

four   2001 

Nevada  2.4 

NaN

five   2002 

Nevada  2.9 

NaN

[5

rows x 4

columns]

同樣缺失值由 NaN 補上。看一下 index、columns 和索引的類型：

1

2

3

4

5

6


>>>
df.index

Index(['one', 'two', 'three', 'four', 'five'],
dtype='object')

>>>
df.columns

Index(['year', 'state', 'pop', 'debt'],
dtype='object')

>>> type(df['debt'])

<class

'pandas.core.series.Series'>

DataFrame 面向行和麪向列的操作基本是平衡的，任意抽出一列都是 Series。

對象屬性

重新索引

Series 對象的重新索引通過其 .reindex(index=None,**kwargs) 方法實現。**kwargs 中常用的參數有倆：method=None,fill_value=np.NaN：


ser =

Series([4.5,7.2,-5.3,3.6],index=['d','b','a','c'])

>>>
a =

['a','b','c','d','e']

>>>
ser.reindex(a)

a   -5.3

b    7.2

c    3.6

d    4.5

e   
NaN

dtype:
float64

>>>
ser.reindex(a,fill_value=0)

a   -5.3

b    7.2

c    3.6

d    4.5

e    0.0

dtype:
float64

>>>
ser.reindex(a,method='ffill')

a   -5.3

b    7.2

c    3.6

d    4.5

e    4.5

dtype:
float64

>>>
ser.reindex(a,fill_value=0,method='ffill')

a   -5.3

b    7.2

c    3.6

d    4.5

e    4.5

dtype:
float64

.reindex() 方法會返回一個新對象，其 index 嚴格遵循給出的參數，method:{'backfill', 'bfill', 'pad', 'ffill', None} 參數用於指定插值（填充）方式，當沒有給出時，自動用fill_value 填充，默認爲 NaN（ffill = pad，bfill = back fill，分別指插值時向前還是向後取值）

DataFrame 對象的重新索引方法爲：.reindex(index=None,columns=None,**kwargs)。僅比 Series 多了一個可選的 columns 參數，用於給列索引。用法與上例類似，只不過插值方法 method 參數只能應用於行，即軸 0。


>>>
state =

['Texas','Utha','California']

>>>
df.reindex(columns=state,method='ffill')

    Texas 
Utha  California

a      1  

NaN           2

c      4  

NaN           5 

d      7  

NaN           8

[3

rows x 3

columns]

>>>
df.reindex(index=['a','b','c','d'],columns=state,method='ffill')

   Texas 
Utha  California

a      1  

NaN           2

b      1  

NaN           2

c      4  

NaN           5

d      7  

NaN           8

[4

rows x 3

columns]

不過 fill_value 依然對有效。聰明的小夥伴可能已經想到了，可不可以通過 df.T.reindex(index,method='**').T 這樣的方式來實現在列上的插值呢，答案是可行的。另外要注意，使用 reindex(index,method='**') 的時候，index 必須是單調的，否則就會引發一個 ValueError: Must be monotonic for forward fill，比如上例中的最後一次調用，如果使用 index=['a','b','d','c'] 的話就不行。

回到頂部

刪除指定軸上的項

即刪除 Series 的元素或 DataFrame 的某一行（列）的意思，通過對象的 .drop(labels, axis=0) 方法：


>>>
ser

d    4.5

b    7.2

a   -5.3

c    3.6

dtype:
float64

>>>
df

   Ohio 
Texas  California

a     0     

1           
2

c     3     

4           
5

d     6     

7           
8

[3

rows x 3

columns]

>>>
ser.drop('c')

d    4.5

b    7.2

a   -5.3

dtype:
float64

>>>
df.drop('a')

   Ohio 
Texas  California

c     3     

4           
5

d     6     

7           
8

[2

rows x 3

columns]

>>>
df.drop(['Ohio','Texas'],axis=1)

   California

a           2

c           5

d           8

[3

rows x 1

columns]

.drop() 返回的是一個新對象，元對象不會被改變。

回到頂部

索引和切片

就像 Numpy，pandas 也支持通過 obj[::] 的方式進行索引和切片，以及通過布爾型數組進行過濾。

不過須要注意，因爲 pandas 對象的 index 不限於整數，所以當使用非整數作爲切片索引時，它是末端包含的。


>>>
foo

a    4.5

b    7.2

c   -5.3

d    3.6

dtype:
float64

>>>
bar

0   

4.5

1   

7.2

2  

-5.3

3   

3.6

dtype:
float64

>>>
foo[:2]

a    4.5

b    7.2

dtype:
float64

>>>
bar[:2]

0   

4.5

1   

7.2

dtype:
float64

>>>
foo[:'c']

a    4.5

b    7.2

c   -5.3

dtype:
float64

這裏 foo 和 bar 只有 index 不同——bar 的 index 是整數序列。可見當使用整數索引切片時，結果與 Python 列表或 Numpy 的默認狀況相同；換成 'c' 這樣的字符串索引時，結果就包含了這個邊界元素。

另外一個特別之處在於 DataFrame 對象的索引方式，因爲他有兩個軸向（雙重索引）。

可以這麼理解：DataFrame 對象的標準切片語法爲：.ix[::,::]。ix 對象可以接受兩套切片，分別爲行（axis=0）和列（axis=1）的方向：


>>>
df

   Ohio 
Texas  California

a     0     

1           
2

c     3     

4           
5

d     6     

7           
8

[3

rows x 3

columns]

>>>
df.ix[:2,:2]

   Ohio 
Texas

a     0     

1

c     3     

4

[2

rows x 2

columns]

>>>
df.ix['a','Ohio']

0

而不使用 ix ，直接切的情況就特殊了：

索引時，選取的是列
切片時，選取的是行

這看起來有點不合邏輯，但作者解釋說 “這種語法設定來源於實踐”，我們信他。


>>>
df['Ohio']

a    0

c    3

d    6

Name:
Ohio, dtype: int32

>>>
df[:'c']

   Ohio 
Texas  California

a     0     

1           
2

c     3     

4           
5

[2

rows x 3

columns]

>>>
df[:2]

   Ohio 
Texas  California

a     0     

1           
2

c     3     

4           
5

[2

rows x 3

columns]

使用布爾型數組的情況，注意行與列的不同切法（列切法的 : 不能省）：


>>>
df['Texas']>=4

a    False

c     True

d     True

Name:
Texas, dtype: bool

>>>
df[df['Texas']>=4]

   Ohio 
Texas  California

c     3     

4           
5

d     6     

7           
8

[2

rows x 3

columns]

>>>
df.ix[:,df.ix['c']>=4]

   Texas 
California

a      1          

2

c      4          

5

d      7          

8

[3

rows x 2

columns]

算術運算和數據對齊

pandas 最重要的一個功能是，它可以對不同索引的對象進行算術運算。在將對象相加時，結果的索引取索引對的並集。自動的數據對齊在不重疊的索引處引入空值，默認爲 NaN。


>>>
foo =

Series({'a':1,'b':2})

>>>
foo

a    1

b    2

dtype:
int64

>>>
bar =

Series({'b':3,'d':4})

>>>
bar

b    3

d    4

dtype:
int64

>>>
foo +

bar

a  
NaN

b     5

d  
NaN

dtype:
float64

DataFrame 的對齊操作會同時發生在行和列上。

當不希望在運算結果中出現 NA 值時，可以使用前面 reindex 中提到過 fill_value 參數，不過爲了傳遞這個參數，就需要使用對象的方法，而不是操作符：df1.add(df2,fill_value=0)。其他算術方法還有：sub(), div(), mul()。

Series 和 DataFrame 之間的算術運算涉及廣播，暫時先不講。

回到頂部

函數應用和映射

Numpy 的 ufuncs（元素級數組方法）也可用於操作 pandas 對象。

當希望將函數應用到 DataFrame 對象的某一行或列時，可以使用 .apply(func, axis=0, args=(), **kwds) 方法。


f =

lambda 
x:x.max()-x.min()

>>>
df

   Ohio 
Texas  California

a     0     

1           
2

c     3     

4           
5

d     6     

7           
8

[3

rows x 3

columns]

>>>
df.apply(f)

Ohio          6

Texas         6

California    6

dtype:
int64

>>>
df.apply(f,axis=1)

a    2

c    2

d    2

dtype:
int64

回到頂部

排序和排名

Series 的 sort_index(ascending=True) 方法可以對 index 進行排序操作，ascending 參數用於控制升序或降序，默認爲升序。

若要按值對 Series 進行排序，當使用 .order() 方法，任何缺失值默認都會被放到 Series 的末尾。

在 DataFrame 上，.sort_index(axis=0, by=None, ascending=True) 方法多了一個軸向的選擇參數與一個 by 參數，by 參數的作用是針對某一（些）列進行排序（不能對行使用 by 參數）：


>>>
df.sort_index(by='Ohio')

   Ohio 
Texas  California

a     0     

1           
2

c     3     

4           
5

d     6     

7           
8

[3

rows x 3

columns]

>>>
df.sort_index(by=['California','Texas'])

   Ohio 
Texas  California

a     0     

1           
2

c     3     

4           
5

d     6     

7           
8

[3

rows x 3

columns]

>>>
df.sort_index(axis=1)

   California 
Ohio  Texas

a           2    

0      
1

c           5    

3      
4

d           8    

6      
7

[3

rows x 3

columns]

排名（Series.rank(method='average', ascending=True)）的作用與排序的不同之處在於，他會把對象的 values 替換成名次（從 1 到 n）。這時唯一的問題在於如何處理平級項，方法裏的 method 參數就是起這個作用的，他有四個值可選：average, min, max, first。


>>>
ser=Series([3,2,0,3],index=list('abcd'))

>>>
ser

a    3

b    2

c    0

d    3

dtype:
int64

>>>
ser.rank()

a    3.5

b    2.0

c    1.0

d    3.5

dtype:
float64

>>>
ser.rank(method='min')

a    3

b    2

c    1

d    3

dtype:
float64

>>>
ser.rank(method='max')

a    4

b    2

c    1

d    4

dtype:
float64

>>>
ser.rank(method='first')

a    3

b    2

c    1

d    4

dtype:
float64

注意在 ser[0]=ser[3] 這對平級項上，不同 method 參數表現出的不同名次。

DataFrame 的 .rank(axis=0, method='average', ascending=True) 方法多了個 axis 參數，可選擇按行或列分別進行排名，暫時好像沒有針對全部元素的排名方法。

回到頂部

統計方法

pandas 對象有一些統計方法。它們大部分都屬於約簡和彙總統計，用於從 Series 中提取單個值，或從 DataFrame 的行或列中提取一個 Series。

比如 DataFrame.mean(axis=0,skipna=True) 方法，當數據集中存在 NA 值時，這些值會被簡單跳過，除非整個切片（行或列）全是 NA，如果不想這樣，則可以通過 skipna=False 來禁用此功能：


>>>
df

    one 
two

a  1.40 

NaN

b  7.10

-4.5

c  
NaN  NaN

d  0.75

-1.3

[4

rows x 2

columns]

>>>
df.mean()

one    3.083333

two   -2.900000

dtype:
float64

>>>
df.mean(axis=1)

a    1.400

b    1.300

c     
NaN

d   -0.275

dtype:
float64

>>>
df.mean(axis=1,skipna=False)

a     
NaN

b    1.300

c     
NaN

d   -0.275

dtype:
float64

其他常用的統計方法有：

########################	******************************************
count	非 NA 值的數量
describe	針對 Series 或 DF 的列計算彙總統計
min , max	最小值和最大值
argmin , argmax	最小值和最大值的索引位置（整數）
idxmin , idxmax	最小值和最大值的索引值
quantile	樣本分位數（0 到 1）
sum	求和
mean	均值
median	中位數
mad	根據均值計算平均絕對離差
var	方差
std	標準差
skew	樣本值的偏度（三階矩）
kurt	樣本值的峯度（四階矩）
cumsum	樣本值的累計和
cummin , cummax	樣本值的累計最大值和累計最小值
cumprod	樣本值的累計積
diff	計算一階差分（對時間序列很有用）
pct_change	計算百分數變化

處理缺失數據

pandas 中 NA 的主要表現爲 np.nan，另外 Python 內建的 None 也會被當做 NA 處理。

處理 NA 的方法有四種：dropna , fillna , isnull , notnull 。

回到頂部

is(not)null

這一對方法對對象做元素級應用，然後返回一個布爾型數組，一般可用於布爾型索引。

回到頂部

dropna

對於一個 Series，dropna 返回一個僅含非空數據和索引值的 Series。

問題在於對 DataFrame 的處理方式，因爲一旦 drop 的話，至少要丟掉一行（列）。這裏的解決方式與前面類似，還是通過一個額外的參數：dropna(axis=0, how='any', thresh=None) ，how 參數可選的值爲 any 或者 all。all 僅在切片元素全爲 NA 時才拋棄該行(列)。另外一個有趣的參數是 thresh，該參數的類型爲整數，它的作用是，比如 thresh=3，會在一行中至少有 3 個非 NA 值時將其保留。

回到頂部

fillna

fillna(value=None, method=None, axis=0) 中的 value 參數除了基本類型外，還可以使用字典，這樣可以實現對不同的列填充不同的值。method 的用法與前面 .reindex() 方法相同，這裏不再贅述。

inplace 參數

前面有個點一直沒講，結果整篇示例寫下來發現還挺重要的。就是 Series 和 DataFrame 對象的方法中，凡是會對數組作出修改並返回一個新數組的，往往都有一個 replace=False 的可選參數。如果手動設定爲 True，那麼原數組就可以被替換。

轉載：Python 數據分析包：pandas 基礎

Python 數據分析包：pandas 基礎

Series

DataFrame

對象屬性

重新索引

刪除指定軸上的項

索引和切片

算術運算和數據對齊

函數應用和映射

排序和排名

統計方法

處理缺失數據

is(not)null

dropna

fillna

inplace 參數

如何使用 JS 判斷用戶是否處於活躍狀態

Mono 支持LoongArch架構

lightdb秒級增加列和刪除列（not null帶默認值）

lightdb數據庫超時相關控制參數

通過HPA+CronHPA組合應對業務複雜彈性伸縮場景

❤️‍🔥 Solon Cloud Event 新的事務特性與應用

lightdb mysql 8.0兼容之不可見主鍵

使用 JS 實現在瀏覽器控制檯打印圖片 console.image()

基於Ubuntu-22.04安裝K8s-v1.28.2實驗（四）使用域名訪問網站應用

python實例26[計算MD5]

怎麼解決windows下使用eclipse和python編譯時候 "Non-ASCII character"錯誤問題

用Pandas完成Excel中常見的任務(1)

決策樹建模

Python 數據分析包：pandas 基礎

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結