[PYTHON]核心編程筆記(15.Python正則表達式)

核心筆記:查找與匹配的比較

15.1.1 您的第一個正則表達式:

15.2 正則表達式使用的特殊符號和字符

常用正則表達式符號和特殊符號:

記號說明舉例

literal 匹配字符串的值 foo

re1|re2 匹配正則表達式re1或re2 foo|bar

. 匹配任何字符(換行符除外) b.b

^ 匹配字符串的開始 ^Dear

$ 匹配字符串的結尾 /bin/*sh$

* 匹配前面出現的正則表達式零次或多次 [A-Za-z0-9]*

+ 匹配前面出現的正則表達式一次或多次 [a-z]+\.com

? 匹配前面出現的正則表達式零次或一次 goo?

{N} 匹配前面出現的正則表達式N次 [0-9]{3}

{M,N} 匹配重複出現M次到N次的正則表達式 [0-9]{5,9}

[....] 匹配字符組裏出現的任意一個字符 [aeiou]

[..x-y..] 匹配從字符x到y中的任意一個字符 [0-9],[A-Za-z]

[^...] 不匹配此字符集中出現的任何一個字符,[^aeiou],

包括某一範圍的字符(如果在此字符集中出現) [^A-Za-z0-9_]

{*|+|?|{}}? 用於上面出現的任何"非貪婪"版本 .*?[a-z]

重複匹配次數符號

{...} 匹配封閉括號中正則表達式(RE),並保存爲子組 {[0-9]{3}}?,f(00|u)bar

\d 匹配任何數字,和[0-9]一樣(\D是\d的反義,

任何非數符字) data\d+.txt

\w 匹配任何數字字母字符,和[A-Za-z0-9_]相同

(\W是\w的反義) [A-Za-z_]\w+

\s 匹配任何空白符,和[\n\t\r\v\f]相同(\S同上)

of\sthe\b匹配單詞邊界(\B同上) \bThe\b

\nn 匹配已保存的子組 price: \16

\c 逐一匹配特殊字符 c(即,取消他的特殊含義,

按字面匹配) \., \\, \*

\A {\Z} 匹配字符串的起始(結束) \ADear

15.2.1 用管道符號(|)匹配多個正則表達式模式

表示選擇被管道符號分割的多個不同的正則表達式中的一個

正則表達式模式匹配的字符串

at|homeat,home

r2d2|c3por2d2,c3po

bat|bet|bitbat,bet,bit

管道符是的它可以匹配多個字符串

15.2.2 匹配任意一個單個字符(.)

正則表達式匹配的字符串

f.o在f和o中間的任意字符,如fao,f0o,f#o等

..任意兩個字符

.end匹配在字符串end前面的任意一個字符

注:如何匹配點號或句號?

答:在其前面使用反斜槓\進行轉義

15.2.4 從字符串的開頭或結尾或單詞邊界開始匹配(^/$ /\b /\B)

正則表達式模式匹配的字符串

^From 匹配任何以From開始的字符串

/bin/tesh$ 匹配任何以/bin/tcsh結束的字符串

^Subject: hi$ 匹配僅由Subject: hi組成的字符串

.*\$$ 匹配任何以美元符號$結尾的字符串

the 任何包含有'the'的字符串

\b 任何以"the"開始的字符串

\bthe\b 僅匹配單詞"the"

\Bthe 任意包含"the"但不以"the"開頭的單詞

15.2.5 創建字符類([])

正則表達式匹配的字符串

b[aeiu]tbat,bet,bit,but

[cr][23][dp][o2]一個包含4個字符的字符串: 第一個字符是"r"或"c",後面是"2"或"3",以此類推

例如:c2do,r3p2,r2d2,c3po等

15.2.5 指定範圍(-)和否定(^)

正則表達式模式匹配的字符

z.[0-9] 字符"z",後面跟任意一個字符,然後是一個十進制數字

[r-u][env-y][us] 略

[^aeiou] 一個非元音字符

[^\t\n] 除了TAB製表符和換行符意外的任意一個字符

["-a] 順序值在'"'和"a"之間的任意一個字符,即順序號在

34和97之間的某一個字符

15.2.6 使用閉包操作符(*,+,?,{})實現多次出現/重複匹配

[dn]ot? 字符"d"或"o",後面是一個"o",最後是最多一個字符"t",即,do,no,dot,not

0?[1-9] 從1到9中任意一位數字,前面可能還有一個"0"

[0-9]{15,16} 15或16位數字表示,例如:信用卡號碼

</?[^>]+> 匹配所有合法(和無效的)HTML標籤字符串

[KQRBNP][a-h][1-8]-[a-h][1-8]

在"長代數"記譜法中,表示國際象棋合法的棋盤,K,Q,R等字幕後面加上兩個用連字符連起的"a1"到"h8"之前的棋盤座標,前面的編號表示從哪裏開始走棋,後面代表走到哪個棋格位置去

15.2.7 特殊字符表示字符集

正則表達式模式匹配的字符串

\w+-\d+ 一個由字母或數字組成的字符串,和至少一個數字,兩部分中間由連字符連接

[A-Za-z]\w* 第一個字符時字母,其餘字符,是字母或數字

\d{3}-\d{3}-\d{4} 電話號碼,前面帶區號前綴,例如800-555-1212

\w+@\w+\.com 簡單的[email protected]格式的電子郵件地址

15.2.8 用圓括號(())組建組

一對圓括號(())和正則表達式一起使用時可以實現以下任意一個功能:

對正則表達式進行分組匹配子組

正則表達式模式匹配的字符串

\d+(\.\d*)? 任意個十進制數+小數點+零或多個十進制數

(Mr?s?\.)?[A-Z][a-z]* [A-Za-z-]+ 首字母大寫,其他字母小寫,允許多個單詞,橫線,大寫字母

15.3 正則表達式和Python語言

15.3.1 re模塊:核心函數和方法

我們來看兩個主要的函數/方法,match()和search(),以及compile()函數

常見的正則表達式函數與方法:

re模塊的函數描述

compile(pattern,flags=0) 對正則表達式模式pattern進行編譯,flags是可選標識符,並返回一個regex對象

re模塊的函數和regex對象的方法

match(pattern,string,flags=0) 嘗試用正則表達式模式pattern匹配字符串string,flags是可選標識符,如果匹配成功,則返回一個匹配對象,否則返回None

search(pattern,string,flags=0) 在字符串string中查找正則表達式模式pattern的第一次出現,flags是可選標識符,如果匹配成功,則返回一個匹配對象,否則返回None

findall(pattern,string[,flags]) 在字符串string中查找正則表達式模式pattern的所有(非重複)出現,返回一個匹配對象的列表

finditer(pattern,string[,flags]) 和findall()相同,但返回的不是列表而是迭代器,對於每個匹配,該迭代器返回一個匹配對象

15.3 常見的正則表達式函數與方法(繼續)

函數/方法描述

匹配對象的方法

split(pattern,string,max=0) 根據正則表達式pattern中的分隔符把字符string分割爲一個列表,返回成功匹配的列表,最多分割max次

sub(pattern,repl,string,max=0) 把字符串string中所有匹配正則pattern的地方替換成字符串repl,如果max的值沒有給出,則對所有匹配的地方進行替換

group(num=0) 返回全部匹配對象或指定編號是num的子組

group() 返回一個包含全部匹配的子組的元祖

15.3.2 使用compile()編譯正則表達式

15.3.3 匹配對象和group(),groups()方法

15.3.4 用match()匹配字符串

>>> import re

>>> m = re.match('foo','foo')

>>> if m is not None:

... m.group()

...

'foo'

m爲一個匹配對象的實例:

>>> m

<_sre.SRE_Match object at 0x7f407788a030>

當匹配失敗的例子,它返回None:

>>> m = re.match('foo','bar')

>>> if m is not None:

... m.group()

...

匹配失敗,所以m被賦值爲None

模式'foo'在字符串"food on the table"中找到一個匹配,因爲它是從該字符串開頭進行匹配

>>> m = re.match('foo','food on the table')

>>> m.group()

'foo'

利用Python語言面向對象,間接省略中間結果,將最終結果保存到一起

>>> re.match('foo','food on the table').group()

'foo'

注:上面如果匹配失敗,會引發一個AttributeError異常

15.3.6 search()在一個字符串中查找一個模式(搜索與匹配的比較)

match()和search()區別在於match()會從字符串起始處進行匹配,search()會從字符串首次出現進行匹配

>>> m = re.match('foo','seafood')

>>> if m is not None: m.group()

...

>>> m = re.search('foo','seafood')

>>> if m is not None: m.group()

...

'foo'

15.3.6 匹配多個字符串(|)

>>> bt = 'bat|bet|bit'

>>> m = re.match(bt,'bat')

>>> if m is not None: m.group()

...

'bat'

>>> m =re.match(bt,'blt')

>>> if m is not None: m.group()

...

>>> m = re.match(bt,'He bit me!')

>>> if m is not None:m.group()

...

>>> m = re.search(bt,'He bit me!')

>>> if m is not None: m.group()

...

'bit'

15.3.7 匹配任意單個字符(.)

>>> anyend = '.end'

>>> m = re.match(anyend,'bend')#點號匹配'b'

>>> if m is not None: m.group()

...

'bend'

>>> m = re.match(anyend,'end')#沒有字符匹配

>>> if m is not None:m.group()

...

>>> m = re.match(anyend, '\nend')#匹配字符(\n除外)

>>> if m is not None:m.group()

...

>>> m = re.search('.end','The end.')#匹配' '

>>> if m is not None:m.group()

...

' end'

在正則中,用反斜線對它進行轉義,使點號失去它的特殊意義:

>>> patt314 = '3.14'#正則點號

>>> pi_patt = '3\.14'#浮點(小數點)

>>> m = re.match(pi_patt,'3.14') #完全匹配

>>> if m is not None: m.group()

...

'3.14'

>>> m = re.match(patt314,'3014')#點號匹配0

>>> if m is not None: m.group()

...

'3014'

>>> m = re.match(patt314,'3.14')#點號匹配'.'

>>> if m is not None:m.group()

...

'3.14'

15.3.8 創建字符集合([])

>>> m = re.match('[cr][23][dp][o2]','c3po')

>>> if m is not None:m.group()

...

'c3po'

>>> m = re.match('[cr][23][dp][o2]','c2do')

>>> if m is not None: m.group()

...

'c2do'

>>> m = re.match('r2d2|c3po','c2do')

>>> if m is not None: m.group()

...

>>> m = re.match('r2d2|c3po','r2d2')

>>> if m is not None: m.group()

...

'r2d2'

15.3.9 重複,特殊字符和子組

郵件地址支持域名前添加主機名功能:

>>> patt = '\w+@(\w+\.)?\w+\.com'

>>> re.match(patt,'[email protected]').group()

'[email protected]'

>>> re.match(patt,'[email protected]').group()

'[email protected]'

允許任意數量的子域名存在

>>> patt = '\w+@(\w+\.)*\w+\.com'

>>> re.match(patt,'[email protected]').group()

'[email protected]'

>>> import re

>>> m = re.match('\w\w\w-\d\d\d','abc-123')

>>> if m is not None:m.group()

...

'abc-123'

>>> import re

>>> m = re.match('\w\w\w-\d\d\d','abc-123')

>>> if m is not None:m.group()

...

'abc-123'

>>> m = re.match('\w\w\w-\d\d\d','abc-xyz')

>>> if m is not None: m.group()

...

使其分別提取包含字母或數字的部分和僅含數字的部分

>>> m = re.match('(\w\w\w)-(\d\d\d)','abc-123')

>>> m.group()

'abc-123'

>>> m.group(1)

'abc'

>>> m.group(2)

'123'

>>> m.groups()

('abc', '123')

子組的不同排列組合

>>> m = re.match('ab','ab')#無子組

>>> m.group()

'ab'

>>> m.groups()# 所有匹配的子組

()

>>> m = re.match('(ab)','ab')# 一個子組

>>> m.group()# 所有匹配

'ab'

>>> m.group(1)#匹配的子組1

'ab'

>>> m.groups()#所有匹配子組

('ab',)

>>> m = re.match('(a)(b)','ab')#兩個子組

>>> m.group()

'ab'

>>> m.group(1)#匹配的子組1

'a'

>>> m.group(2)#匹配的子組2

'b'

>>> m.groups() #所有匹配子組的元祖

('a', 'b')

>>>

>>> m = re.match('(a(b))','ab')#兩個子組

>>> m.group()#所有匹配的部分

'ab'

>>> m.group(1)#匹配的子組1

'ab'

>>> m.group(2)#匹配的子組2

'b'

>>> m.groups()#所有匹配的子組的元祖

('ab', 'b')

15.3.10 從字符串的開頭或結尾匹配在單詞邊界上的匹配

>>> m = re.search('The','The end.')

>>> if m is not None: m.group()

...

'The'

>>> m = re.search('The','end. The')

>>> if m is not None: m.group()

...

'The'

>>> m = re.search(r'\bthe','bite the dog')

>>> if m is not None: m.group()

...

'the'

>>> m = re.search(r'\bthe','bitethe dog')

>>> if m is not None: m.group()

...

>>> m = re.search(r'\Bthe','bitethe dog')

>>> if m is not None: m.group()

...

'the'

15.3.11 用findall()找到每個出現的匹配部分

findall()返回一個列表,如果findall()沒有找到匹配的部分,會返回空列表

>>> re.findall('car','car')

['car']

>>> re.findall('car','scary')

['car']

>>> re.findall('car','carry the barcardi to the car')

['car', 'car', 'car']

15.3.12 用sub()[和 subn()]進行搜索和替換

sub()和subn(),將某字符串中所有匹配正則模式的部分進行替換,用來替換的部分通常是一個字符串,也可能是一個函數,該函數返回一個用來替換的字符串

>>> re.sub('X','Mr.Smith','attn: X\n\nDear X,\n')

'attn: Mr.Smith\n\nDear Mr.Smith,\n'

>>> re.subn('X','Mr.Smith','attn: X\n\nDear X,\n')

('attn: Mr.Smith\n\nDear Mr.Smith,\n', 2)

>>> print re.sub('X','Mr.Smith','attn:X\n\nDear X,\n')

attn:Mr.Smith

Dear Mr.Smith,

>>> re.sub('[ae]','X','abcdef')

'XbcdXf'

>>> re.subn('[ae]','X','abcdef')

('XbcdXf', 2)

15.3.13 用split()分割

>>> re.split(':','str1:str2:str3')

['str1', 'str2', 'str3']

例,保存用戶的登錄信息(登錄名,用戶登錄時的電傳,登陸時間及地址)

# who > whodata.txt

# vi rewho.py

----------------------------------

#!/usr/bin/env python

import re

f = open('whodata.txt','r')

for eachLine in f.readlines():

print re.split('\s\s+',eachLine)

f.close

-----------------------------------

# python rewho.py

['root', 'tty1', '2013-12-03 00:33\n']

['root', 'pts/1', '2013-12-04 04:06 (192.168.8.2)\n']

例,unix下who命令輸出結果進行分割(rewho.py)

# vi rewho1.py

-------------------------------

#!/usr/bin/env python

from os import popen

from re import split

f = popen('who','r')

for eachLine in f.readlines():

print split('\s\s+|\t',eachLine.strip())

f.close()

-------------------------------

下面的例子用來說明退格鍵"\b"和正則表達式"\b"(包含或不包含原始字符串)之間的區別:

>>> import re

>>> m = re.match('\bblow','blow')#退格鍵,沒有匹配

>>> if m is not None: m.group()

...

>>> m = re.match('\\bblow','blow')#用\轉義後,現在匹配了

>>> if m is not None: m.group()

...

'blow'

>>> m = re.match(r'\bblow','blow')#改用原始字符串

>>> if m is not None: m.group()

...

'blow'

15.4 正則表達式示例:

例,正則聯繫的數據生成代碼

爲聯繫使用正則生成隨機數據,並將產生的數據輸出到屏幕

# vi gendata.py #

----------------------------------

#!/usr/bin/env python

from random import randint, choice, randrange

from string import lowercase

from sys import maxint

from time import ctime

doms = ("com","edu", "net","org","gov")

fobj = open("redata.txt","a+")

for i in range(randint(5,10)):

#dtint = randint(0, maxint - 1)

dtint = randrange(2**32)

dtstr = ctime(dtint)

shorter = randint(4, 7)

em = ""

for j in range(shorter):

em += choice(lowercase)

longer = randint(shorter, 12)

dn = ""

for j in range(longer):

dn += choice(lowercase)

fobj.write("%s::%s@%s.%s::%d-%d-%d\n" % (dtstr, em, dn, choice(doms), dtint,

shorter, longer))

fobj.close()

----------------------------------

# python gendata.py

# cat gendata.py

-------------------------------------

# cat redata.txt

Wed Dec 19 21:38:05 1984::[email protected]::472311485-5-9

Sun Nov 28 10:37:21 1971::[email protected]::60147441-5-9

Sat Mar 23 00:14:13 2080::[email protected]::3478349653-7-12

Fri Aug 9 18:53:53 1996::[email protected]::839588033-7-12

Thu Oct 3 12:32:07 2069::[email protected]::3148000327-5-10

Tue Sep 16 12:30:29 2070::[email protected]::3178067429-5-7

Tue May 10 23:45:35 2061::[email protected]::2882965535-5-7

Mon Dec 3 12:28:38 2007::[email protected]::1196656118-4-6

Fri May 29 04:24:41 2020::[email protected]::1590697481-6-7

Fri Sep 22 17:25:03 1972::[email protected]::86005503-4-11

Wed Nov 29 18:11:33 2006::[email protected]::1164795093-4-6

Sun Jun 19 13:34:50 1977::[email protected]::235550090-5-5

Sat Sep 12 11:44:45 2009::[email protected]::1252727085-5-6

Tue Dec 1 17:21:46 2093::[email protected]::3910497706-5-11

Thu Aug 19 14:08:40 2010::[email protected]::1282198120-4-6

Sun Nov 1 14:44:05 1998::[email protected]::909902645-4-8

Thu Sep 11 21:39:14 2059::[email protected]::2830513154-4-4

Wed Oct 22 13:01:19 2053::[email protected]::2644722079-4-5

Thu Apr 27 01:43:14 2051::[email protected]::2566143794-6-11

Sun Dec 14 00:15:46 2087::[email protected]::3722170546-6-11

Tue Sep 6 23:48:58 2061::[email protected]::2893247338-5-5

Sat Apr 24 14:28:42 2010::[email protected]::1272090522-4-10

Tue Apr 17 12:05:06 2096::[email protected]::3985473906-7-10

Mon Dec 27 13:32:20 2077::[email protected]::3407808740-4-6

Fri Jul 4 06:07:54 2098::[email protected]::4055263674-5-8

-------------------------------------

15.4.1 匹配一個字符串

在第一個例子中,我們將寫一個正則表達式,用它從文件readate.txt的每一行中(僅)提取時間戳中有關星期的數據字段,我們將用到以下正則:

"^Mon|^Tue|^Wed|^Thu|^Fri|^Sat|^Sun"

或只用一個"^"

"^(Mon|Tue|Wed|Thu|Fri|Sat|Sun)"

例:

>>> import re

>>> data = 'Sat Apr 24 14:28:42 2010::[email protected]::1272090522-4-10'

>>> patt = "^(Mon|Tue|Wed|Thu|Fri|Sat|Sun)"

>>> m = re.match(patt, data)

>>> m.group()

'Sat'

>>> m.group(1)

'Sat'

>>> m.groups()

('Sat',)

例,要求字符串以三個由字符或數字組成的字符做開頭

>>> patt = '^(\w{3})'

>>> m =re.match(patt,data)

>>> if m is not None: m.group()

...

'Sat'

>>> m.group(1)

'Sat'

如果把{3}寫在圓括號外((\w){3}),就變成三個連續的單個由字符或數字組成的字

符

>>> patt = '^(\w){3}'

>>> m = re.match(patt, data)

>>> if m is not None:m.group()

...

'Sat'

>>> m.group(1)

't'

15.4.2 搜索與匹配的比較,"貪婪"匹配

查找三個由連字符號(-)分隔的整數集

>>> patt = '\d+-\d+-\d+'

>>> re.search(patt, data).group()

'1272090522-4-10'

匹配全部字符串的正則表達式,用".+"來標識任意個字符集

>>> patt = '.+\d+-\d+-\d+'

>>> re.match(patt, data).group()

'Sat Apr 24 14:28:42 2010::[email protected]::1272090522-4-10'

獲取每行末尾數字的字段:

>>> patt = '.+(\d+-\d+-\d+)'

>>> re.match(patt, data).group(1)

'2-4-10'

正則表達式本身是貪心匹配的,通配字在從左到右求值時會盡量抓取滿足匹配的最長字符串, ".+"會從字符串起始處抓取模式最長字符,\d+只需要一位數字,從而不會影響\d+的匹配,所以他匹配了最右邊的2

一個解決辦法是用"非貪婪"操作符,"?",他的作用是要求正則引擎匹配的字符越少越好

>>> patt = '.+?(\d+-\d+-\d+)'

>>> re.match(patt, data).group(1)

'1272090522-4-10'

抽取三個整數字段裏中間那個整數部分

>>> patt = '-(\d+)-'

>>> m = re.search(patt,data)

>>> m.group()

'-4-'

>>> m.group(1)

'4'

[PYTHON]核心編程筆記(15.Python正則表達式)

[PYTHON] 核心編程筆記(12.Python模塊)

[SHELL] LAMP一鍵安裝腳本設計(v1,v2)

CentOS6.3下修復yum安裝工具

[SHELL] LNMP一鍵安裝腳本設計(v1.0)

[SHELL] LINUX流量監控腳本

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結