簡介

正則表達式，又稱規則表達式，通常被用來檢索、替換那些符合某個模式(規則)的文本。

正則表達式是對字符串操作的一種邏輯公式，就是用事先定義好的一些特定字符、及這些特定字符的組合，組成一個“規則字符串”，這個“規則字符串”用來表達對字符串的一種過濾邏輯。

安裝

內置引擎無需安裝

知識

正則基礎

re模塊處理字符串與系統內置str類型的函數的區別：

re模塊可以用規則表達式表示多個參數，同時處理，而且功能更強大
str裏邊的一些函數只能一個個參數處理，比較繁瑣，功能上沒有re強大

一般字符

預定義字符

數量匹配

邊界匹配

邏輯分組

正則函數

1.re.match函數-startswith

re.match 嘗試從字符串的起始位置匹配一個模式，如果不是起始位置匹配成功的話，match()就返回none。

函數語法：re.match(pattern, string, flags=0)

2. re.search函數-find

re.search 掃描整個字符串並返回第一個成功的匹配。

函數語法：re.search(pattern, string, flags=0)

re.match與re.search的區別
re.match只匹配字符串的開始，如果字符串開始不符合正則表達式，則匹配失敗，函數返回None；而re.search匹配整個字符串，直到找到一個匹配。

3.re.findall函數

在字符串中找到正則表達式所匹配的所有子串，並返回一個列表，如果沒有找到匹配的，則返回空列表。

注意： match 和 search 是匹配一次 findall 匹配所有。

語法格式爲：findall(string[, pos[, endpos]])

4.re.sub函數–replace

檢索和替換

Python 的 re 模塊提供了re.sub用於替換字符串中的匹配項。

語法：re.sub(pattern, repl, string, count=0, flags=0)

5.re.split函數-split

分割字符串，返回列表

語法： re.split(pattern, repl, string, count=0, flags=0)

正則表達式修飾符

正則表達式可以包含一些可選標誌修飾符來控制匹配的模式。修飾符被指定爲一個可選的標誌。多個標誌可以通過按位 OR(|) 它們來指定。如 re.I | re.M 被設置成 I 和 M 標誌：

貪婪模式與非貪婪模式

貪婪模式：在整個表達式匹配成功的前提下，儘可能多的匹配( * )

非貪婪模式：在整個表達式匹配成功的前提下，儘可能少的匹配( ? )；
Python裏數量詞默認是貪婪的

測試一

import re

str = "abbbbbbc"

# 貪婪模式
# * 決定了儘可能的多匹配b
pattern = re.compile(r"ab*")
print(pattern.match(str).group())

# 非貪婪模式
# *? 決定了儘可能少匹配b，結果是a
pattern = re.compile(r"ab*?")
print(pattern.match(str).group())

測試二

import re

str = "<html>
              <div>aa</div>
              <div>bb</div>
              <div>cc</div>
          </html>"

# 貪婪模式
# 儘可能的多匹配 找到最後一個</div>爲後邊界
pattern = re.compile("<div>.*</div>",re.S)
print(pattern.findall(str))

# 非貪婪模式
# 儘可能的少匹配 找到第一個</div>爲後邊界
pattern = re.compile("<div>.*?</div>",re.S)
print(pattern.findall(str))
# 提取標籤數據
for item in pattern.findall(str):
    print(re.search(">(.*?)<",item).group(1))

爬蟲常用正則符號

數量：

* >= 0
+ >=1
? 0 or 1

print("=================")

匹配符號的：
. 統配符
/s（\n\r\t空格）
/w（字符）
/d （數字）

print("=================")

分組符號：
()
(.*?)非貪婪模式

正則表達式分組：

分組就是用一對圓括號“()”括起來的正則表達式，匹配出的內容就表示一個分組。從正則表達式的左邊開始看，看到的第一個左括號“(”表示第一個分組，第二個表示第二個分組，依次類推，需要注意的是，有一個隱含的全局分組（就是0），就是整個正則表達式。

分完組以後，要想獲得某個分組的內容，直接使用group(num)和groups()函數去直接提取就行。

提取HTML標籤或屬性

匹配標籤之間的文本值

案例：抓取百度首頁title標籤間的內容

import requests
import re

headers = {
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0",
}
html = requests.get("http://www.baidu.com",headers=headers).content.decode()
print(html)
# 匹配title標籤之間的文本值 title沒有任何其他屬性干擾

pattern = re.compile("<title>(.*?)</title>",re.M|re.S)

print(pattern.search(html).group(1))

案例：抓取百度首頁超鏈接標籤間的內容

import requests
import re

headers = {
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0",
}
html = requests.get("http://www.baidu.com",headers=headers).content.decode()
print(html)

# 抓取所有a標籤之間的文本值
pattern = re.compile("<a .*?>(.*?)</a>",re.M|re.S)
print(pattern.findall(html))


# 抓取所有a標籤且a標籤class爲mnav的文本值
pattern = re.compile('<a.*?class="mnav".*?>(.*?)</a>',re.M|re.S)
print(pattern.findall(html))


//a[@class="mnav"][@href]

# 抓取所有a標籤的href且a標籤class爲mnav的文本值
pattern = re.compile('<a.*?href="(.*?)".*?class="mnav".*?>(.*?)</a>',re.M|re.S)
print(pattern.findall(html))

匹配標籤的屬性

案例：抓取hao123門戶中部分href屬性

import requests
import re

headers = {
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0",
}
html = requests.get("https://www.hao123.com/",headers=headers).content.decode()
# 抓取所有a標籤且a標籤class爲g-gc的href屬性
pattern = re.compile('<a class="g-gc".*?href="(.*?)"')
for a in pattern.findall(html):
    print(a)

案例：抓取hao123門戶中部分鏈接的data-title屬性

import requests
import re

headers = {
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0",
}
html = requests.get("https://www.hao123.com/",headers=headers).content.decode()

# 抓取所有a標籤且a標籤class爲g-gc的data-title屬性
pattern = re.compile('<a class="sitelink sub-site".*?data-title="(.*?)">.*?</a>',re.M|re.S)
for a in pattern.findall(html):
    print(a)

任務

爬取中超聯賽新聞
http://sports.163.com/zc/

小站教育所有招聘信息
https://jobs.51job.com/all/co2847909.html#syzw

第五章正則：通喫一切字符串處理

簡介

安裝

知識

正則基礎

一般字符

預定義字符

數量匹配

邊界匹配

邏輯分組

正則函數

1.re.match函數-startswith

2. re.search函數-find

3.re.findall函數

4.re.sub函數–replace

5.re.split函數-split

正則表達式修飾符

貪婪模式與非貪婪模式

測試一

測試二

爬蟲常用正則符號

提取HTML標籤或屬性

匹配標籤之間的文本值

案例：抓取百度首頁title標籤間的內容

案例：抓取百度首頁超鏈接標籤間的內容

匹配標籤的屬性

案例：抓取hao123門戶中部分href屬性

案例：抓取hao123門戶中部分鏈接的data-title屬性

任務

第五章正則：通喫一切字符串處理

win10 tensorflow2.2 安裝踩坑總結

第十二章 Scrapy中間件與圖片管道

第九章爬蟲基礎總結

第十一章 Scrapy入門：多線程+異步

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

第五章 正則：通喫一切字符串處理

簡介

安裝

知識

正則基礎

一般字符

預定義字符

數量匹配

邊界匹配

邏輯分組

正則函數

1.re.match函數-startswith

2. re.search函數-find

3.re.findall函數

4.re.sub函數–replace

5.re.split函數-split

正則表達式修飾符

貪婪模式與非貪婪模式

測試一

測試二

爬蟲常用正則符號

提取HTML標籤或屬性

匹配標籤之間的文本值

案例：抓取百度首頁title標籤間的內容

案例：抓取百度首頁超鏈接標籤間的內容

匹配標籤的屬性

案例：抓取hao123門戶中部分href屬性

案例：抓取hao123門戶中部分鏈接的data-title屬性

任務

第五章正則：通喫一切字符串處理