pandas中的數據轉換包括過濾、清理等
去除重複數據
duplicated() 判斷各行是否是重複行
drop_duplicated() 移除重複行(保留第一次出現的)
沒啥好說的,直接看例子:
In [20]: s = pd.DataFrame({'key':['a']*4+['b']*3,'key0':[1,1,2 ...: ,3,3,4,4]}) In [21]: s.duplicated() Out[21]: 0 False 1 True 2 False 3 False 4 False 5 False 6 True dtype: bool In [22]: s.drop_duplicates() Out[22]: key key0 0 a 1 2 a 2 3 a 3 4 b 3 5 b 4 In [23]: s.drop_duplicates('key') # 可以根據某列去除重複的行 Out[23]: key key0 0 a 1 4 b 3 In [24]: s.drop_duplicates(['key','key0']) # 傳入一個列組成的列表,去除重複的行 Out[24]: key key0 0 a 1 2 a 2 3 a 3 4 b 3 5 b 4
In [25]: s.key.drop_duplicates() # 嗯,這樣寫也是可以的
Out[25]:
0 a
4 b
Name: key, dtype: object
利用函數或映射進行數據轉換
In [61]: data = pd.DataFrame({'food':['bacon','pulled pork','b ...: acon','Pastrami','corned beef','Bacon','pastrami','honey ham','nova lox'], ...: 'ounces':[4,3,12,6,7.5,8,3,5,6]}) In [62]: data Out[62]: food ounces 0 bacon 4.0 1 pulled pork 3.0 2 bacon 12.0 3 Pastrami 6.0 4 corned beef 7.5 5 Bacon 8.0 6 pastrami 3.0 7 honey ham 5.0 8 nova lox 6.0假如你想添加一列表示該肉類食物來源的動物類型,我們先編寫一個肉類到動物的映射。
In [63]: meat_to_animal = { ...: 'bacon':'pig', ...: 'pulled pork':'pig', ...: 'pastrami':'cow', ...: 'corned beef':'cow', ...: 'honey ham':'pig', ...: 'nova lox':'salmon' ...: }
Series的map方法可以接受一個函數或含有映射關係的字典型對象,
但是這裏有個問題:有些大寫了,有些沒有。因此需要先轉換大小寫
In [64]: data['animal'] = data['food'].map(str.lower).map(meat_to_animal)下面看一下map用來執行函數,即將data['food']的每個元素應用到隱含函數
In [65]: data['food'].map(lambda x:meat_to_animal[x.lower()]) Out[65]: 0 pig 1 pig 2 pig 3 cow 4 cow 5 pig 6 cow 7 pig 8 salmon Name: food, dtype: object
替換值
replace()
In [26]: re = pd.Series([1,-9999,-9999,2,3,4,5,-1000,0]) In [27]: re Out[27]: 0 1 1 -9999 2 -9999 3 2 4 3 5 4 6 5 7 -1000 8 0 dtype: int64 In [28]: re.replace(-9999,np.nan) # 替換值 Out[28]: 0 1.0 1 NaN 2 NaN 3 2.0 4 3.0 5 4.0 6 5.0 7 -1000.0 8 0.0 dtype: float64 In [29]: re.replace([-9999,-1000],np.nan) # 替換多個 Out[29]: 0 1.0 1 NaN 2 NaN 3 2.0 4 3.0 5 4.0 6 5.0 7 NaN 8 0.0 dtype: float64 In [30]: re.replace([-9999,-1000],[np.nan,0]) # 值與替換值對應的列表 Out[30]: 0 1.0 1 NaN 2 NaN 3 2.0 4 3.0 5 4.0 6 5.0 7 0.0 8 0.0 dtype: float64 In [32]: re.replace({-9999:np.nan,-1000:0}) # 參數可以是一個字典 Out[32]: 0 1.0 1 NaN 2 NaN 3 2.0 4 3.0 5 4.0 6 5.0 7 0.0 8 0.0 dtype: float64
重命名軸索引
rename() 會創建數據的副本,也可以傳入 inplace=True 參數進行就地修改
In [41]: data = pd.DataFrame(np.arange(6).reshape((2, 3)),inde
...: x=pd.Index(['Oh', 'Co'], name='state'),columns=pd.Ind
...: ex(['one', 'two', 'three'], name='number'))
In [42]: data.rename(index=str.title,columns=str.upper)
Out[42]:
number ONE TWO THREE
state
Oh 0 1 2
Co 3 4 5
In [43]: data.rename(index={'co':'sx'},columns={'one':'first'} # 傳入字典,可以部分修改
...: )
Out[43]:
number first two three
state
Oh 0 1 2
Co 3 4 5
離散化和麪元劃分
cut()
In [45]: ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32] In [47]: In [47]: bins = [18, 25, 35, 60, 100] In [48]: cats = pd.cut(ages, bins) # 可以指定哪邊的區間是開的,例如左閉右開,只需要設置 pd.cut(ages, bins,right=False) In [49]: cats # 結果返回的是一個特殊的 Categories 對象 Out[49]: [(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]] Length: 12 Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]
也可以爲面元設置名稱:
In [56]: group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior'] In [57]: pd.cut(ages, bins, labels=group_names) Out[57]: [Youth, Youth, Youth, YoungAdult, Youth, ..., YoungAdult, Senior, MiddleAged, MiddleAged, YoungAdult] Length: 12 Categories (4, object): [Youth < YoungAdult < MiddleAged < Senior]
若不傳入具體的面元劃分邊界,只傳入劃分的面元個數,則會自動等長劃分面元:
In [58]: data = np.random.rand(20)
In [59]: pd.cut(data, 4, precision=2) # 分爲4組,精度爲2位
Out[59]:
[(0.29, 0.52], (0.75, 0.98], (0.75, 0.98], (0.057, 0.29], (0.29, 0.52], ...,
(0.75, 0.98], (0.75, 0.98], (0.75, 0.98], (0.057,0.29], (0.29, 0.52]]
Length: 20
Categories (4, interval[float64]): [(0.057, 0.29] < (0.29, 0.52] < (0.52, 0.75] < (0.75, 0.98]]
qcut函數是一個類似於cut的函數,可以根據樣本分位數對數據進行面元劃分。根據數據,cut可能無法是各個面元數量數據點相同,qcut使用的是樣本分位數,因此可以得大小基本相等的面元。
qcut就不舉例了。
排列和隨機採樣
下面是隨機選取一個DataFrame的一些行,做法就是隨機產生行號,然後進行選取即可。
利用 numpy.random.permutation() 函數可以實現隨機重排。 In [67]: df = pd.DataFrame(np.arange(5 * 4).reshape(5, 4)) In [68]: ran = np.random.permutation(5) In [70]: ran Out[70]: array([2, 3, 0, 1, 4]) In [71]: df Out[71]: 0 1 2 3 0 0 1 2 3 1 4 5 6 7 2 8 9 10 11 3 12 13 14 15 4 16 17 18 19 In [72]: df.take(ran) Out[72]: 0 1 2 3 2 8 9 10 11 3 12 13 14 15 0 0 1 2 3 1 4 5 6 7 4 16 17 18 19
計算指標/啞變量
將分類變量轉換爲“啞變量矩陣”(dummy matrix)或“指標矩陣”(indicator matrix)。如果DataFrame的某一列有k各不同的值,可以派生出一個k列的矩陣或者DataFrame(值爲1和0)
In [74]: df = pd.DataFrame({'key':['b','b','a','c','a','b'],'d ...: ata1' : range(6)}) In [75]: pd.get_dummies(df['key']) Out[75]: a b c 0 0 1 0 1 0 1 0 2 1 0 0 3 0 0 1 4 1 0 0 5 0 1 0
給指標DataFrame的列加上一個前綴
In [76]: dummies = pd.get_dummies(df['key'],prefix = 'key') In [77]: dummies Out[77]: key_a key_b key_c 0 0 1 0 1 0 1 0 2 1 0 0 3 0 0 1 4 1 0 0 5 0 1 0
順便看看下面這個例子:
In [80]: df['data1'] Out[80]: 0 0 1 1 2 2 3 3 4 4 5 5 Name: data1, dtype: int64 In [81]: type(df['data1']) Out[81]: pandas.core.series.Series In [82]: df[['data1']] Out[82]: data1 0 0 1 1 2 2 3 3 4 4 5 5 In [83]: type(df[['data1']]) Out[83]: pandas.core.frame.DataFrame
df['data1']得到一個Series,而df[['data1']]得到一個DataFrame
字符串操作
Python有簡單易用的字符串和文本處理功能。大部分文本運算直接做成了字符串對象的內置方法。當然還能用正則表達式。pandas對此進行了加強,能夠對數組數據應用字符串表達式和正則表達式,而且能處理煩人的缺失數據。
字符串對象方法
舉幾個簡單的例子:
In [87]: zifuchuan = ' i can be a can, i do not balabala' # 最前面有個空格
In [88]: sp = zifuchuan.split(',')
In [89]: sp
Out[89]: [' i can be a can', ' i do not balabala']
In [90]: ':::'.join(sp)
Out[90]: ' i can be a can::: i do not balabala'
In [91]: zifuchuan.index('can')
Out[91]: 3
In [92]: zifuchuan.index('i')
Out[92]: 1
In [94]: zifuchuan.count('can')
Out[94]: 2
正則表達式
正則表達式(regex)提供了一種靈活的在文本中搜索、匹配字符串的模式。
python中的正則表達式用的是re模塊。re模塊的函數分爲3類:模式匹配、替換、拆分。
關於正則表達式的總結:http://www.cnblogs.com/huxi/archive/2010/07/04/1771073.html
幾舉個簡單的例子,以後有時間關於正則表達式再做一個總結。
In [119]: import re # 首先導入Python的re模塊 In [120]: text = 'I love\t you' # 注意:這句後面沒有空格 In [121]: re.split('\s+',text) Out[121]: ['I', 'love', 'you'] In [122]: text = 'I love\t you ' # 這句末尾多了一個空格,匹配時會把末尾的空格也算一個字符串 In [123]: re.split('\s+',text) Out[123]: ['I', 'love', 'you', '']
上面的例子首先正則表達式會被編譯,然後在text上調用其split方法。
也可以這樣寫:
In [125]: patten = re.compile('\s+') # 先編譯,得到一個可以重用的 regex 對象
In [126]: patten.split(text)
Out[126]: ['I', 'love', 'you', '']
findall、search、match 、sub
In [132]: text = """Dave [email protected] ...: Steve [email protected] ...: Rob [email protected] ...: Ryan [email protected] ...: """ In [133]: patten = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}' # 匹配郵箱 In [134]: regex = re.compile(patten,flags=re.IGNORECASE) #先編譯,忽略大小寫 In [135]: regex.findall(text) # 返回所有匹配到的模式 Out[135]: ['[email protected]', '[email protected]', '[email protected] ', '[email protected]'] In [137]: m = regex.search(text) # 返回匹配到的第一個模式 In [138]: m Out[138]: <_sre.SRE_Match object; span=(5, 20), match='dave@goo gle.com'> In [141]: text[m.start():m.end()] Out[141]: '[email protected]' In [144]: m.string # 返回原始匹配串 Out[144]: 'Dave [email protected]\nSteve [email protected]\nRob rob @gmail.com\nRyan [email protected]\n' In [143]: print(regex.match(text)) # match 只匹配開頭 這裏開頭是‘Dave’,所以沒有匹配到,返回None None
In [147]: print(regex.sub('replace',text)) # 將匹配到的模式全部替換
Dave replace
Steve replace
Rob replace
Ryan replace
將匹配到的模式分組:
In [148]: patten = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2, ...: 4})' In [149]: regex = re.compile(patten,flags=re.IGNORECASE) In [153]: regex.findall(text) Out[153]: [('dave', 'google', 'com'), ('steve', 'gmail', 'com'), ('rob', 'gmail', 'com'), ('ryan', 'yahoo', 'com')] In [155]: s = regex.search(text) In [156]: s Out[156]: <_sre.SRE_Match object; span=(5, 20), match='dave@goo gle.com'> In [157]: s.groups() Out[157]: ('dave', 'google', 'com')
給匹配到到的模式命名:
In [163]: regex = re.compile(r""" ...: (?P<username>[A-Z0-9._%+-]+) ...: @(?P<domain>[A-Z0-9.-]+) ...: \.(?P<suffix>[A-Z]{2,4})""", ...: flags=re.IGNORECASE|re.VERBOSE) ...: In [164]: m = regex.match('[email protected]') In [165]: m.groupdict() Out[165]: {'domain': 'bright', 'suffix': 'net', 'username': 'we sm'} In [171]: f = regex.search(text) In [172]: f.group('username') Out[172]: 'dave' In [173]: f = regex.findall(text) In [174]: f Out[174]: [('dave', 'google', 'com'), ('steve', 'gmail', 'com'), ('rob', 'gmail', 'com'), ('ryan', 'yahoo', 'com')]
pandas矢量化字符串
In [176]: series = pd.Series({'Dave': '[email protected]', 'Stev ...: e': '[email protected]','Rob': '[email protected]', 'Wes': ...: np.nan}) In [177]: series Out[177]: Dave [email protected] Rob [email protected] Steve [email protected] Wes NaN dtype: object
通過Series的str方法可以對Series的內容進行操作:
In [176]: series = pd.Series({'Dave': '[email protected]', 'Stev ...: e': '[email protected]','Rob': '[email protected]', 'Wes': ...: np.nan}) In [177]: series Out[177]: Dave [email protected] Rob [email protected] Steve [email protected] Wes NaN dtype: object In [178]: series.str.contains('rob') Out[178]: Dave False Rob True Steve False Wes NaN dtype: object In [179]: patten = '([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\\.([A-Z]{2, ...: 4})' In [180]: series.str.findall(patten,re.IGNORECASE) Out[180]: Dave [(dave, google, com)] Rob [(rob, gmail, com)] 。。。
map函數在遇到NA值時會報錯:
In [199]: matches = series.str.upper() In [200]: matches Out[200]: Dave [email protected] Rob [email protected] Steve [email protected] Wes NaN dtype: object In [202]:series.map(str.upper) --------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-202-029ad3593723> in <module>() ----> 1 matches = series.map(str.upper) ... TypeError: descriptor 'upper' requires a 'str' object but received a 'float'