一.爲什麼要學習正則表達式

很好，我們現在已經能夠寫出獲得網站源代碼的程序了，我們有了第一個問題:如何從雜亂的代碼中找到我們所需的信息呢？此時，正則表達式的學習就顯得很有必要了。有人打趣說，當你想到用正則表達式解決一個問題時，你就擁有了兩個問題。從這句話中可以看出正則表達式學習的困難程度，但是爲了寫出好的爬蟲，我們必須對其進行學習。

二.正則表達式的語法規則

’>圖片轉自http://cuiqingcai.com/tag/%E7%88%AC%E8%99%AB

1.正則表達式的一些註解

（一）貪婪與非貪婪模式

正則表達式通常用於在文本中查找匹配的字符串。Python裏數量詞默認是貪婪的（在少數語言裏也可能是默認非貪婪），總是嘗試匹配儘可能多的字符；非貪婪的則相反，總是嘗試匹配儘可能少的字符。例如：正則表達式”ab*”如果用於查找”abbbc”，將找到”abbb”。而如果使用非貪婪的數量詞”ab*?”，將找到”a”。

注：我們一般使用非貪婪模式來提取。

（二）反斜槓問題

與大多數編程語言相同，正則表達式裏使用”\”作爲轉義字符，這就可能造成反斜槓困擾。假如你需要匹配文本中的字符”\”，那麼使用編程語言表示的正則表達式裏將需要4個反斜槓”\\”：前兩個和後兩個分別用於在編程語言裏轉義成反斜槓，轉換成兩個反斜槓後再在正則表達式裏轉義成一個反斜槓。

Python裏的原生字符串很好地解決了這個問題，這個例子中的正則表達式可以使用r”\”表示。同樣，匹配一個數字的”\d”可以寫成r”\d”。有了原生字符串，寫出來的表達式更加直觀。

2.python的Re模塊

python的Re模塊提供對正則表達式的支持，主要用到下列幾種方法

#返回pattern對象
re.compile(string[,flag])  
#以下爲匹配所用函數
re.match(pattern, string[, flags])
re.search(pattern, string[, flags])
re.split(pattern, string[, maxsplit])
re.findall(pattern, string[, flags])
re.finditer(pattern, string[, flags])
re.sub(pattern, repl, string[, count])
re.subn(pattern, repl, string[, count])

我們注意到代碼中的pattern，下面對pattern進行介紹。pattern是一個匹配模式，所謂模式，是正則表達式最基本的元素，它們是一組描述字符串特徵的字符，是之後用來匹配字符的基礎。那麼如何得到pattern模式呢，我們可以使用re中的compile方法，見代碼:

pattern=re.compile(r'hello')

我們傳進一個原生字符串’hello’，通過compile方法編譯出一個pattern對象，之後我們將用這個對象進行匹配。也就是說之後要做的事就是對要匹配的字符串與’hello’作比較。

下面我們看看剛纔提到的re模塊中的幾種方法

1.re.match(pattern, string[, flags])

這個方法從我們要匹配的字符串的頭部開始，當匹配到string的尾部還沒有匹配結束時，返回None;
當匹配過程中出現了無法匹配的字母，返回None。
下面給出一組代碼來進行具體認識

# -*- coding=utf-8 -*-
import re
pattern=re.compile(r'hello')
result1=re.match(pattern,'hello')
result2=re.match(pattern,'helloc')
result3=re.match(pattern,'helo')
result4=re.match(pattern,'hello world')
if result1:
    print result1.group()
else:
    print '1匹配失敗'
if result2:
    print result2.group()
else:
    print '2失敗'
if result3:
    print result3.group()
else:
    print '3失敗'
if result4:
    print result4.group()
else:
    print '4失敗'

運行結果:

hello
hello
3失敗
hello

對匹配結果進行分析:
1.string與pattern完全相同,成功匹配
2.string雖然多出一個字母，但pattern匹配完成時，匹配成功，後面的c不再匹配。
3.string匹配到最後一個字母時，發現仍然無法完全匹配，匹配失敗。
4.原理同2

下面解釋一下result.group的含義，match是一次匹配的結果，包含很多關於這次匹配的信息，我們可以通過match提供的屬性和方法讀取到。

下面給出幾個例子對match提供的一些方法加以解釋，代碼如下

# -*- coding=utf-8 -*-
import re
m=re.match(r'(\w+) (\w+)(?P<char>.*)','hello world!')
print 'm.string',m.string
print 'm.start:',m.start()
print 'm.start:',m.start(2)
print 'm.end:',m.end()
print 'm.end:',m.end(2)
print 'm.pos:',m.pos
print 'm.endpos:',m.endpos
print 'm.group:',m.group(2)
print 'm.groupdict:',m.groupdict()
print 'm.lastgroup:',m.lastgroup
print 'm.lastindex',m.lastindex
print 'm.span:',m.span(2)
print 'm.span',m.span()
print 'm.re:',m.re
print m.expand(r'\2 \1 \3')

運行結果:

m.start: 0
m.start: 0
m.end: 12
m.end: 11
m.pos: 0
m.endpos: 12
m.group: world
m.groupdict: {'char': '!'}
m.lastgroup: char
m.lastindex 3
m.span: (6, 11)
m.span (0, 12)
m.re: <_sre.SRE_Pattern object at 0x02669760>
m.string hello world!
world hello !

結果分析:
1.string是匹配時所用的文本
2.start可以返回指定的組在字符串中開始匹配的位置，默認值爲0
3.end返回指定的組在字符串中開始匹配的位置，默認值爲0
4.pos,endpos用於返回起始和終止匹配位置
5.group可以返回指定組的起始匹配位置
6.groupdict返回有別名的組的別名爲鍵、以該組截獲的子串爲值的字典，沒有別名的組不包含在內。
7.lastgroup返回最後一個被捕獲的分組的別名。如果這個分組沒有別名或者沒有被捕獲的分組，將爲None。
8.lastindex返回最後一個分組在文本中的索引
9.span 返回(start(group), end(group))，即該組的起始和終止位置
10.expand可以實現分組之間順序的調整

python網絡爬蟲學習(三)正則表達式的使用之re.match方法

一.爲什麼要學習正則表達式

二.正則表達式的語法規則

1.正則表達式的一些註解

（一）貪婪與非貪婪模式

（二）反斜槓問題

2.python的Re模塊

1.re.match(pattern, string[, flags])

對於C語言free()函數的一些反思

二叉樹幾種遍歷算法的非遞歸實現

Git學習之路(一) 建立版本庫並實現文件操作

棧的思想用於求解迷宮問題

面試中關於二叉樹的常見習題(持續更新)

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結