網絡段子爬蟲程序

本文參考了一下資料

urllib2庫的基本使用
https://blog.csdn.net/kingov/article/details/80173251

傳智播客黑馬社區
http://bbs.itheima.com/thread-344264-1-1.html

urllib2庫的基本使用
所謂網頁抓取，就是把URL地址中指定的網絡資源從網絡流中讀取出來，保存到本地。在Python中，我們使用urllib2這個組件來抓取網頁。

urllib2 是 Python2.7 自帶的模塊(不需要下載)，是Python的一個獲取URLs(Uniform Resource Locators)的重要組件。

urllib2 官方文檔：https://docs.python.org/2/library/urllib2.html

urllib2 源碼：https://hg.python.org/cpython/file/2.7/Lib/urllib2.py

時間花在學習正則表達式上
pattern = re.compile(r'(?<=.html">)(.*?)(?=</a></h1>)')
正則表達式慢慢積累吧


# -*- coding: utf-8 -*-
import urllib2
import re


class Spider:
    """
    段子爬蟲類
    """

    def __init__(self):
        self.enable = True
        self.page = 1  # 當前要爬去第幾頁

    @staticmethod
    def load_page(page):
        """
        @brief 傳入地址得到http數據包
        @param page 生成url使用的參數
        @returns 網站response內容
        """

        # 我們要請求的url地址
        url = "http://duanziwang.com/category/經典段子/" + str(page)

        # 我們要僞裝的瀏覽器user-agent頭
        # User-Agent頭
        user_agent = 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident / 5.0'
        # 創建一個字典，使請求的headers中'User-Agent':對應我們user_agent字符串
        headers = {'User-Agent': user_agent}

        # 新建一個請求，需要將請求中的headers變量換成我們剛纔創建好的headers
        req = urllib2.Request(url, headers=headers)

        # 請求服務器,得到迴應
        response = urllib2.urlopen(req)

        # 得到迴應的內容
        html_raw = response.read()

        # 有可能 需要gbk轉unico 編碼
        # gbk_html = html.decode('gbk').encode('utf-8')

        return html_raw

    def page_to_list(self, html_raw):
        """
        @brief 頁面處理得到的段子列表
        @html 頁面
        """

        # 使用前後斷言取出內容
        pattern = re.compile(r'(?<=.html">)(.*?)(?=</a></h1>)')

        item_list = pattern.findall(html_raw)

        return item_list

    def write_file(self, text):
        """
        @brief 寫字符串到磁盤
        @param text 字符串
        """

        my_file = open("./DZ.txt", 'a')
        my_file.write(text)
        my_file.write("\n\n")
        my_file.close()

    def print_page(self, item_list, page):
        """
        @brief 取出元素並寫入磁盤
        @param item_list 得到的段子列表
        @param page 處理第幾頁
        """

        print "******* 第 %d 頁 爬取完畢...*******" % page
        for item in item_list:
            self.write_file("page:" + str(page) + ':' + item)


# main
if __name__ == '__main__':
    print '''
    ======================
    段子小爬蟲
    ======================
    '''

    print u'請按下回車開始'
    raw_input()

    # 定義一個Spider對象
    mySpider = Spider()
    # 第 1 頁 到 第 72 頁
    for i in range(1, 73):
        html = mySpider.load_page(i)
        items = mySpider.page_to_list(html)
        mySpider.print_page(items, i)

效果如圖所示

網絡段子爬蟲程序

再談23種設計模式（3）：行爲型模式（學習筆記）

Power Automate Desktop 安裝完，登錄後老是提示one driver 錯誤

微前端學習筆記(4):從微前端到微模塊之EMP與hel-micro方案探索

微前端學習筆記（1）：微前端總體架構概述，從微服務發微

985 碩士程序員，空窗 4 個月沒有 Offer！

一文搞懂 Spring 循環依賴

賽博鬥地主——使用大語言模型扮演Agent智能體玩牌類遊戲。

VScode右鍵打開(添加到右鍵)

記一次 .NET某工控視覺自動化系統卡死分析

WindowsServer--SQL Server搭建主從同步實現讀寫分離 - 事務性分發

[CVE-2014-8959] phpmyadmin任意文件包含漏洞分析（圖文）

JAVA審計學習筆記

面試題-全排列輸出其所有的排列方式的兩種解法

急需提升的MySQL設計優化能力

《互聯網安全高級指南》隨手記（一）

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結