Chapter 7: 正則表達式
正則替換
import re
s = '100 BROAD ROAD APT. 3'
re.sub(r'\bROAD$', 'RD.', s) # '100 BROAD ROAD APT. 3'
re.sub(r'\bROAD\b', 'RD.', s) # '100 BROAD RD. APT 3'
re.sub
實現正則表達式方法的替換r
字符表示這個字符串是一個raw字符串,無需解決反斜線的轉義問題,在寫正則時最好如此\b
表示單詞邊界,$
表示單詞結尾
正則搜索
校驗羅馬數字千位數:M, MM, MMM或空
import re
pattern = '^M?M?M?$'
re.search(pattern, '') # <_sre.SRE_Match at 0x103839a58>
re.search(pattern, 'MMMM') # None, 不顯示輸出
校驗百位數,有以下可能:
- 100=C
- 200=CC
- 300 = CCC
- 400=CD
- 500=D
- 600=DC
- 700 = DCC
- 800 = DCCC
- 900=CM
因此有四種可能的模式:
- CM
- CD
- 零到三次出現 C 字符 (出現零次表示百位數爲 0)
- D,後面跟零個到三個 C 字符
pattern = '^M?M?M?(CM|CD|D?C?C?C?)$'
# pattern可以用{m, n}方式改寫爲
pattern2 = '^M{0,3}(CM|CD|D?C{0,3})$'
加入個位和十位:
pattern = '^M?M?M?(CM|CD|D?C?C?C?)(XC|XL|L?X?X?X?)(IX|IV|V?I?I?I?)$'
pattern2 = '^M{0,3}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$'
帶有內聯註釋 (Inline Comments) 的正則表達式
使用鬆散的正則表達式來添加正則註釋,其中的空格、換行和註釋均會被忽略
pattern = """
^ # beginning of string
M{0,3} # thousands - 0 to 3 M's
(CM|CD|D?C{0,3}) #hundreds - 900 (CM), 400 (CD), 0-300 (0 to 3 C's),
# or 500-800 (D, followed by 0 to 3 C's)
(XC|XL|L?X{0,3}) #tens - 90 (XC), 40 (XL), 0-30 (0 to 3 X's),
#or 50-80 (L, followed by 0 to 3 X's)
(IX|IV|V?I{0,3}) #ones - 9 (IX), 4 (IV), 0-3 (0 to 3 I's),
#or 5-8 (V, followed by 0 to 3 I's)
$ # end of string
"""
在使用時需要指定re.VERBOSE
來聲明它是一個鬆散的正則表達式
import re
re.search(pattern, 'M', re.VERBOSE)
正則匹配並獲取其中的內容
需要識別的格式包括:
- 800-555-1212
- 800 555 1212
- 800.555.1212
- (800) 555-1212
- 1-800-555-1212
- 800-555-1212-1234
- 800-555-1212x1234
- 800-555-1212 ext. 1234
- work 1-(800) 555.1212 #1234
首先編寫測試函數,可以不斷修改phonePattern
函數測試結果。
import re
phone_numbers = ['800-555-1212','800 555 1212','800.555.1212','(800) 555-1212','1-800-555-1212','800-555-1212-1234','800-555-1212x1234','800-555-1212 ext. 1234','work 1-(800) 555.1212 #1234']
phonePattern = re.compile(r'^(\d{3})\D+(\d{3})\D+(\d{4})\D+(\d+)$')
for number in phone_numbers:
print number,":",
if phonePattern.search(number):
print phonePattern.search(number).groups()
else:
print "failed"
結果爲
800-555-1212 : failed
800 555 1212 : failed
800.555.1212 : failed
(800) 555-1212 : failed
1-800-555-1212 : failed
800-555-1212-1234 : ('800', '555', '1212', '1234')
800-555-1212x1234 : ('800', '555', '1212', '1234')
800-555-1212 ext. 1234 : ('800', '555', '1212', '1234')
work 1-(800) 555.1212 #1234 : failed
- 無分機號的無法處理
- 無連字符的無法處理
修改爲
phonePattern = re.compile(r'^(\d{3})\D*(\d{3})\D*(\d{4})\D*(\d*)$')
結果爲
800-555-1212 : ('800', '555', '1212', '')
800 555 1212 : ('800', '555', '1212', '')
800.555.1212 : ('800', '555', '1212', '')
(800) 555-1212 : failed
1-800-555-1212 : failed
800-555-1212-1234 : ('800', '555', '1212', '1234')
800-555-1212x1234 : ('800', '555', '1212', '1234')
800-555-1212 ext. 1234 : ('800', '555', '1212', '1234')
work 1-(800) 555.1212 #1234 : failed
非數字開頭的無法匹配,可以修改最前面的匹配字符串頭,改爲
phonePattern = re.compile(r'^\D*(\d{3})\D*(\d{3})\D*(\d{4})\D*(\d*)$')
結果爲
800-555-1212 : ('800', '555', '1212', '')
800 555 1212 : ('800', '555', '1212', '')
800.555.1212 : ('800', '555', '1212', '')
(800) 555-1212 : ('800', '555', '1212', '')
1-800-555-1212 : failed
800-555-1212-1234 : ('800', '555', '1212', '1234')
800-555-1212x1234 : ('800', '555', '1212', '1234')
800-555-1212 ext. 1234 : ('800', '555', '1212', '1234')
work 1-(800) 555.1212 #1234 : failed
因爲\D
匹配非數字,所以前面有1的匹配失敗,數字開頭對於匹配無作用,所以把最前面的匹配開頭都去掉
phonePattern = re.compile(r'(\d{3})\D*(\d{3})\D*(\d{4})\D*(\d*)$')
結果爲
800-555-1212 : ('800', '555', '1212', '')
800 555 1212 : ('800', '555', '1212', '')
800.555.1212 : ('800', '555', '1212', '')
(800) 555-1212 : ('800', '555', '1212', '')
1-800-555-1212 : ('800', '555', '1212', '')
800-555-1212-1234 : ('800', '555', '1212', '1234')
800-555-1212x1234 : ('800', '555', '1212', '1234')
800-555-1212 ext. 1234 : ('800', '555', '1212', '1234')
work 1-(800) 555.1212 #1234 : ('800', '555', '1212', '1234')
全部匹配成功。
使用前面所用的鬆散的正則表達式加入註釋,可寫爲:
phonePattern = re.compile(r'''
# don't match beginning of string, number can start anywhere
(\d{3}) # area code is 3 digits (e.g. '800')
\D* # optional separator is any number of non-digits
(\d{3}) # trunk is 3 digits (e.g. '555') # optional separator
\D* # optional separator
(\d{4}) # rest of number is 4 digits (e.g. '1212') # optional separator
\D* # optional separator
(\d*) # extension is optional and can be any number of digits # end of string
$
''', re.VERBOSE)
匹配規則彙總:
-d
匹配數字,-D
匹配非數字的任意字符+
匹配1或多個,*
匹配0或多個,?
匹配0或1個^
匹配開頭,$
x{n,m}
匹配 x 字符,至少 n 次,至多 m 次。(a|b|c)
要麼匹配 a,要麼匹配 b,要麼匹配 c。(x)
一般情況下表示一個記憶組 (remembered group)。你可以利用re.search
函數返回對象的groups()
函數獲取它的值。