Python爬蟲requests後的html亂碼解決(gzip, deflate, br)

1、問題如下
  • 亂碼bug集1
  • 前提:resp.encoding編碼與網頁源碼編碼一致;本例編碼爲’utf-8’;
  • 直接輸出reponse.text會報異常:UnicodeEncodeError: ‘gbk’ codec can’t encode character ‘\ufffd’ in position 0: illegal multibyte sequence
headers = {
     'accept': "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,v=b3",
     'accept-Encoding': "gzip, deflate, br",
     'accept-Language': "zh-CN,zh;q=0.9",
     'connection': "close",
     'Upgrade-Insecure-Requests': '1',
     'user-agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36",
      }
resp = requests.get(url, headers=headers, proxies=proxy, timeout=20)
resp.encoding = 'utf-8'
print(resp.text)

在這裏插入圖片描述

  • 修改print(resp.text),出現如下亂碼
headers = {
   'accept': "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,v=b3",
   'accept-Encoding': "gzip, deflate, br",
   'accept-Language': "zh-CN,zh;q=0.9",
   'connection': "close",
   'Upgrade-Insecure-Requests': '1',
   'user-agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36",
    }
resp = requests.get(url, headers=headers, proxies=proxy, timeout=20)
resp.encoding = 'utf-8'
print(resp.text.encode('gbk', 'ignore').decode('gbk'))

在這裏插入圖片描述

2、解決問題
  • 將 ‘accept-Encoding’: "gzip, deflate, br"裏面的br去掉即可,或者這一行直接註釋掉
headers = {
     'accept': "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,v=b3",
     'accept-Encoding': "gzip, deflate",
     'accept-Language': "zh-CN,zh;q=0.9",
     'connection': "close",
     'Upgrade-Insecure-Requests': '1',
     'user-agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36",
      }
resp = requests.get(url, headers=headers, proxies=proxy, timeout=20)
resp.encoding = 'utf-8'
print(resp.text.encode('gbk', 'ignore').decode('gbk'))

在這裏插入圖片描述

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章