爬蟲提取規則之正則表達式的使用

原創

2020-06-28 15:52

Re模式：

re.match函數

re.match(pattern, string, flags=0)

pattern	匹配的正則表達式
string	要匹配的字符串。
flags	標誌位，用於控制正則表達式的匹配方式，如：是否區分大小寫，多行匹配等等。

flags做爲可選值如下

 • re.I(全拼：IGNORECASE): 忽略大小寫（括號內是完整寫法，下同）
 • re.M(全拼：MULTILINE): 多行模式，改變'^'和'$'的行爲（參見上圖）
 • re.S(全拼：DOTALL): 點任意匹配模式，改變'.'的行爲
 • re.L(全拼：LOCALE): 使預定字符類 \w \W \b \B \s \S 取決於當前區域設定
 • re.U(全拼：UNICODE): 使預定字符類 \w \W \b \B \s \S \d \D 取決於unicode定義的字符屬性
 • re.X(全拼：VERBOSE): 詳細模式。這個模式下正則表達式可以是多行，忽略空白字符，並可以加入註釋。

這種必須在字符串的起始位置就開始匹配，如果從字符串的任意非起始位置開始匹配的話match()就返回none。

>>> print(re.match('www', 'www.baidu.com'))
<re.Match object; span=(0, 3), match='www'>
>>> print(re.match('baidu', 'www.baidu.com'))
None

其中group()可以用來提取字符串
group() 匹配的整個表達式的字符串，group() 可以一次輸入多個組號，在這種情況下它將返回一個包含那些組所對應值的元組。
groups()返回一個包含所有小組字符串的元組，從 1 到所含的小組號。

>>> import re
>>> line = "Stay foolish,stay hungry" 
>>> matchObj = re.match( r'(.*),(.*?) .*', line, re.S)   
>>> print ("matchObj.group(2) : ", matchObj.group(2))   
matchObj.group(1) :  stay
>>> print ("matchObj.group(1) : ", matchObj.group(1))
matchObj.group(1) :  Stay foolish
>>> print ("matchObj.groups(1) : ", matchObj.groups(1))  
matchObj.groups(1) :  ('Stay foolish', 'stay')  //注意這裏的
group()和groups()是不同的
>>> print ("matchObj.group() : ", matchObj.group())   
matchObj.group() :  Stay foolish,stay hungry

re.serach函數

從整個字符串的任意位置開始匹配
re.search(pattern, string, flags=0)
同理上述的re.match,唯一的區別就是匹配的位置不同，字符串得到的結果不同。

>>> print(re.search('www', 'www.baidu.com'))
<re.Match object; span=(0, 3), match='www'>
>>> print(re.search('baidu', 'www.baidu.com')) 
<re.Match object; span=(4, 9), match='baidu'>

re.sub函數

re.sub(pattern, repl, string, count=0, flags=0)
前三個爲必選參數，後兩個爲可選參數。

pattern : 正則中的模式字符串。
repl : 替換後的字符串，也可爲一個函數。
string : 要被查找替換的原始字符串。
count : 模式匹配後替換的最大次數，默認0表示替換所有的匹配。
flags : 編譯時用的匹配模式，數字形式。

#!/usr/bin/python3
import re
phone = "2004-959-559 # 這是一個電話號碼"
# 刪除註釋
num = re.sub(r'#.*$', "", phone)
print ("電話號碼 : ", num)
# 移除非數字的內容
num = re.sub(r'\D', "", phone)
print ("電話號碼 : ", num)

re.compile函數

re.compile(pattern[, flags])
compile 函數用於編譯正則表達式，生成一個正則表達式（ Pattern ）對象，供 match() 和 search() 這兩個函數使用。

re.findall函數

re.findall(string[, pos[, endpos]])

string 待匹配的字符串。
pos 可選參數，指定字符串的起始位置，默認爲 0。
endpos 可選參數，指定字符串的結束位置，默認爲字符串的長度。

在字符串中找到正則表達式所匹配的所有子串，並返回一個列表，如果沒有找到匹配的，則返回空列表。
注意： match 和 search 是匹配一次 findall 匹配所有。

import re
pattern = re.compile(r'\d+')
print re.findall(pattern,'1one2two3three')

輸出爲

['1', '2', '3']

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

爬蟲提取規則之正則表達式的使用

Re模式：

re.match函數

re.serach函數

re.sub函數

re.compile函數

re.findall函數

word打開提示無法加載此程序mathpage.wll

最簡單的方法將python的IDLE編輯器設置爲默認文本編輯器

爬蟲使用selenium瀏覽器出現Element is not clickable at point的解決辦法總結

編程或者安裝軟件時遇到錯誤怎麼辦？減少我們走的彎路

Java第二講——數組和函數

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結