Python之pandas使用簡介

2018/04/15 在我自己記錄的時候，通過搜索相關的，發現了一些其他博主寫的很全，因爲我這裏只是記錄我遇到的，大家也可以看看其他博主的

https://blog.csdn.net/u011089523/article/details/60349591

http://blog.sina.com.cn/s/blog_154861eae0102xbsq.html

http://www.jb51.net/article/134615.htm

其實我主要也是爲了加強自己的記憶，多多練習，有小夥伴也可以多多交流，最近在學機器學習= =要學得到東西是真的多

在pandas中有兩類非常重要的數據結構，即序列Series和數據框DataFrame。Series類似於numpy中的一維數組，除了通吃一維數組可用的函數或方法，而且其可通過索引標籤的方式獲取數據，還具有索引的自動對齊功能；DataFrame類似於numpy中的二維數組，同樣可以通用numpy數組的函數和方法，而且還具有其他靈活應用，後續會介紹到。

導入包

進入命令行，敲命令pip install pandas

創建pandas的數據結構

1、Series的創建

1）通過一維數組創建序列

pd.Series(數組)

2）通過字典的方式創建序列

pd.Series(字典)

2、DataFrame的創建

1）通過二維數組創建數據框

pd.DataFrame(數組)

以下以兩種字典來創建數據框，一個是字典列表，一個是嵌套字典。

dic2 = {'a':[1,2,3,4],'b':[5,6,7,8],
'c':[9,10,11,12],'d':[13,14,15,16]}

df2 = pd.DataFrame(dic2)

dic3 = {'one':{'a':1,'b':2,'c':3,'d':4},
'two':{'a':5,'b':6,'c':7,'d':8},
'three':{'a':9,'b':10,'c':11,'d':12}}

pd.DataFrame(dic3)

3）通過DtatFrame的方式創建數據框

df4 = df3[['one','three']]

s3 = df3['one']

查看索引：df.index

查看列標籤:df.columns

刪除行索引重排：
ser.reset_index(drop = True)
df.reset_index(drop = True)
------------------------------------------
直接修列索引：
df = pd.DataFrame(df,columns = ['One','Two','Three'])

方法用法：

import pandas as pd

1、讀取/保存爲csv文件

df=pd.read_csv('cancer_data.csv')

df=pd.read_csv('cancer_data.csv',sep='分割號') 如果從csv讀取的數據全部融合在一起，你需要用sep分隔

tips:如果是同樣在當前的目錄下，可以直接輸入你的文件名，一般建議輸入絕對路勁，我這個是因爲使用了jupyter，所以直接放在當前目錄下

pd.to_csv('cancer_data.csv')

獲取行列數

列數 df.columns.size

行數 df.iloc[:,0].size

df.ix[[0]].values[0][0]#第一行第一列的值

df.ix[[1]].values[0][1]#第二行第二列的值

#獲取特定行列的數據

df.iat[1,1] #第二行第二列的值

2、展示讀取的文件(不填寫展示前面五行)

df.head(20) ——展示20行

3、以數組形式返回標籤

df.columns

tips:for i,v in enumerate(df.columns): 這樣可以更直觀看標籤

返回數據框維度的元組（行數、列數）

df.shape

4、返回列的數據類型

tips:字符串在pandas中是以對象（Object）的方式存在

df.dtypes()

5、顯示數據框的簡明摘要，包括每列非空值的數量

df.info()

下列展示結果：

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 32 columns):
id                        569 non-null int64
diagnosis                 569 non-null object
radius_mean               569 non-null float64
texture_mean              548 non-null float64
perimeter_mean            569 non-null float64
area_mean                 569 non-null float64
smoothness_mean           521 non-null float64
compactness_mean          569 non-null float64
concavity_mean            569 non-null float64
concave_points_mean       569 non-null float64
symmetry_mean             504 non-null float64
fractal_dimension_mean    569 non-null float64
radius_SE                 569 non-null float64
texture_SE                548 non-null float64
perimeter_SE              569 non-null float64
area_SE                   569 non-null float64
smoothness_SE             521 non-null float64
compactness_SE            569 non-null float64
concavity_SE              569 non-null float64
concave_points_SE         569 non-null float64
symmetry_SE               504 non-null float64
fractal_dimension_SE      569 non-null float64
radius_max                569 non-null float64
texture_max               548 non-null float64
perimeter_max             569 non-null float64
area_max                  569 non-null float64
smoothness_max            521 non-null float64
compactness_max           569 non-null float64
concavity_max             569 non-null float64
concave_points_max        569 non-null float64
symmetry_max              504 non-null float64
fractal_dimension_max     569 non-null float64
dtypes: float64(30), int64(1), object(1)
memory usage: 142.3+ KB

6、返回每列數據的有效描述性統計（總數，均值，均差，最小——百分比位置——最大）

df.describe()

7、可以使用 loc 和 iloc 選擇哪一行哪一列的數據

選擇從 'id' 到最後一個均值列的所有列
df_means = df.loc[:,'id':'fractal_dimension_mean']

df_means.head()

用索引號重複以上步驟，所有行，0到10列，相當於【0：11）
df_means = df.iloc[:,:11]

df_means.head()

tips:如果想有跳轉的查看

df_means = df.iloc[ [0:3,8,15] , [1,5:11] ]

8、查看哪一個位置是空,返回DataFrame，如果是NaN則是True,否則爲false

df.isnull() 或者 pd.isnull(df)

求每列的空值和NaN數量

df.isnull().sum() df.isunll().sum(axis=1) 這個是求沒一行的空值數量

判斷哪些”列”存在缺失值

df.isnull().any()

返回二維數組，如果是NaN則是True,否則爲false

df_08.isnull().values

返回所有缺失值的行的數據

df[df.isnull().any(axis=1)]

刪除NaN所在的行：
刪除表中全部爲NaN的行
df.dropna(axis=0,how='all')
刪除表中含有任何NaN的行

df.dropna(axis=0,how='any') #drop all rows that have any NaN values

刪除NaN所在的列：
刪除表中全部爲NaN的行
df.dropna(axis=1,how='all') or df=df.dropna(how='all')
刪除表中含有任何NaN的行
df.dropna(axis=1,how='any') or df=df.drop()

9、查找數據的唯一值,返回一串唯一的數據數組

unique()

10、# 返回每列數據的有效描述性統計，先提取列，再調用describe()

df['header'].describe()

查看單列的數據，每一個唯一數據出現的次數

df['header'].value_counts()

小技巧:df['header'].value_counts().index 這樣可以獲取這一列的所有的唯一值，用來作爲你畫直方圖得到時候的標籤

11、查看數據行是否是重複的，冗餘，返回一個隊列，True表示有重複的

df.duplicated()

查看重複的行

df[df.duplicated()]

去掉重複的數據

df.drop_duplicates(inplace=True)

根據某行的一個，只判斷一個標籤是否是重複的

可參考：https://blog.csdn.net/weixin_37226516/article/details/72846687

12、發現有數據是空的，要根據實際情況，比如我們去他們的平均值填充進空值

mean=df['view_duration'].mean()

df[''view_duration].fillna(mean,inplace=True) inplace表示會實際修改數據並保存，默認爲false

13、兩個dataframe拼接

df1.appened(df2)

如果是不需要df2的index

df1.appened(df2,ignore_index=True)

14、修改列標籤

old_lables = list(df.columns)

old_lables[index]='new_lable' #替換某個標籤

df.columns = old_lables

df.columns = new_lables #全部重新替換

#利用自帶的方法rename

df=df.rename(columns={'old_name':'new_name'})

#數據集中用下劃線代替空格，並且將標籤大寫改爲小寫

df.rename(columns=lambda x: x.strip().lower().replace(" ", "_"), inplace=True) lambda函數會自動映射

提示：在使用lambda函數的時候，不用dict的形式去替代，直接替換就可以了官方文檔詳情：點擊打開鏈接

15、使用groupby方法進行分組

df.groupby(['列名1','列名2',...],as_index=False)

表示根據列名1，列名2....進行分組，as_index默認是True,如果是False，表示不用分組的列名來作爲索引，索引還是0/1/2...

df.groupby(['列名1','列名2',...],as_index=False)['列名1'，'列名2'....]

後面添加的，表示我只想看到這幾列數據

#根據A、B進行分組，對C進行求平均值

df = df.groupby(['A','B'],as_index = False)['C'].mean()

也可以看看這個

https://www.cnblogs.com/zhangzhangwhu/p/7219651.html

https://blog.csdn.net/qq_24753293/article/details/78338263

16、cut函數

pandas.cut(x, bins, right=True, labels=None, retbins=False, precision=3,include_lowest=False)

x——代表dataframe數據

bins : int，標量序列，或區間索引

如果bins取一個int整數，它定義了在x範圍內的等寬bins的數量。然而，在這種情況下，x的範圍在每一邊擴展了0.1%，包含了x的最小值或最大值。如果bins是一個序列，它就定義了bins的邊界即寬度，允許不均勻的bins寬度。在這種情況下沒有擴展x的範圍。

right：boolean 值，可選

表明bins是否包括最右邊緣。如果right == True（默認值），則bin[1,2,3,4]表示（1,2]，（2,3]，（3,4]。

labels ：數組或布爾，默認無,
indicators ['ɪndə,ketɚ] n. 指示器（indicator的複數）；指示燈
用作結果bins的標籤。必須與得到的bins的長度相同。如果爲False，則只返回bin的整數指示符。

retbins：boolean值，可選
是否返回bins？如果bins是給定的一個標量，可以使用它。（自己：retbins就是return bins的縮寫，即這個參數是表示是否返回bins參數的內容，true就返回，false就不返回）

precision :int,可選的
存儲和顯示bins標籤的精度

include_lowest :bool，可選

left-inclusive interval 左閉合區間

第一個間隔是否應該是左邊的。

17、清除不需要的列

df.drop（['列標籤1','列標籤2'...]，axis=1, inplace=True）

如果axis=0，則沿着縱軸進行操作；axis=1，則沿着橫軸進行操作。如果不填寫，默認是全體操作

18、使用query查詢，返回我們需要的數據，查詢的結果其實是根據條件返回索引

# selecting malignant records in cancer data
df_m = df[df['diagnosis'] == 'M']
df_m = df.query('diagnosis == "M"')

# selecting records of people making over $50K
df_a = df[df['income'] == ' >50K']
df_a = df.query('income == " >50K"')

19、從字符串中讀取整數

df['B'].str.extract('(\d+)').astype(int)

20、數據類型的轉化

1)直接使用astype('數據類型')

df['']= df[''].astype('數據類型')

21、篩選出某列中存在你需要的數據，比如存在字符串的 '/'

hb = df[df['標籤'].str.contains('/')] 該列中存在字符串/,就返回這一行

22、使用pandas自帶的merge進行dataframe拼接

鏈接：merge

23、去掉、過濾數據集中的某些值或者某些行

1) 去掉某列存在的值的行

df[(True-df['appPlatform'].isin([2]))]

解釋：去掉'appPlatform'這一列存在值爲2的行

df[(True-df['appID'].isin([278,382]))]

解釋：去掉'appID'這一列，存在值爲278或者382的行

df[(True-df['appID'].isin([278,382]))&(True-df['appPlatform'].isin([2]))]

解釋：過濾掉appPlatform=2而且appID=278和appID=382的樣本

2）過濾掉某個範圍的值

df[df['creativeID']<=10000]

23、情景：按年份分組後，求出不同電影類型數量的最大值

1、先對年份和電影類別進行分組，求出他們的同個年份不同電影類別的數量

2、對數據取最大值

24、爲了計算方便，我們需要統計某一列標籤中出現的“字符串”次數

25、將label標籤下的數據統一換成0/1

Python之pandas使用簡介

再談23種設計模式（3）：行爲型模式（學習筆記）

Power Automate Desktop 安裝完，登錄後老是提示one driver 錯誤

微前端學習筆記(4):從微前端到微模塊之EMP與hel-micro方案探索

微前端學習筆記（1）：微前端總體架構概述，從微服務發微

985 碩士程序員，空窗 4 個月沒有 Offer！

一文搞懂 Spring 循環依賴

賽博鬥地主——使用大語言模型扮演Agent智能體玩牌類遊戲。

VScode右鍵打開(添加到右鍵)

記一次 .NET某工控視覺自動化系統卡死分析

WindowsServer--SQL Server搭建主從同步實現讀寫分離 - 事務性分發

免費試用一年微軟雲服務【圖文並茂】

Git推送到GitHub的詳細教程（圖文並茂）

如何在Linux環境安裝JDK

如何快速搭建漏洞環境

【FTP+BAT】Windows 腳本自動備份，通過FTP傳輸到其他服務器，並設置定時任務【詳細】

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結