天涯論壇搜索爬蟲

需求：
獲取天涯論壇上某關鍵字搜索出來的所有頁面裏面的每個帖子的樓主id和評論id

獲取id間的對應關係，用於粒子羣算法。

（實際上已匹配出用戶id，評論內容，用戶名等信息，需要的話自己改一下return 就行了）

分析：

天涯論壇所有界面都是靜態頁面，沒有發現反爬機制。沒有速度限制，但是還是請合理採集不要過度佔用網站資源（我爬的時候，速度快了服務器會反應不過來。。多試幾次就行了）

搜索頁面好像默認最高75頁（所需數據不多，沒有詳細分析）

python庫：pyquery， requests， urllib， time

解析庫沒有用Beautiful Soup用了pyquery 因爲pyquery的CSS僞類選擇器真的很強大，用起來很順手

筆記：

1. 天涯網頁加載比較慢，pq（url）語句對天涯不好用，用get設置timeout參數穩定一點

2.pyquery中items（）不像Beautiful Soup返回列表而是返回生成器

3. remove（）配合僞類選擇器能彌補一點items（）的不足

4.利用set（）對列表實現去重，但數據會無序化。在對二維列表取重時要先將內層元素轉化爲元組，不然會報錯

詳細代碼：

#賬號列表
#對應關係二維列表

from pyquery import PyQuery as pq
import requests
from urllib.parse import quote
from time import sleep

id_list_file = r'.\data\id_list.txt'
relation_file = r'.\data\re_list.txt'
page = 75
key_word = '糧食'

def prase_all_page(urls):
    """
    解析所有搜索頁，獲取帖子url，過濾無評論帖子
    :param urls:
    :return: content_urls
    """

    content_urls = []
    for url in urls:
        sleep(1)
        print('正在抓取：',url)
        doc = pq(requests.get(url=url, timeout=30).text)
        #print(doc)
        doc('.searchListOne li:last-child').remove()        #刪除最後一個無用li節點
        lis = doc('.searchListOne li').items()      #獲取content節點生成器
        for li in lis:
            reverse = li('.source span:last-child').text()
            print('評論數：', reverse)
            if int(reverse) <= 0:
                continue

            a = li('a:first-child')
            content_url = a.attr('href')
            content_urls.append(content_url)

    return content_urls



def prase_all_content(urls):
    """
    評論對應關係
    :param urls:
    :return:
    """

    ids = []            #id列表
    relations = []           #關係二元組
    for url in urls:
        print('正在解析：', url)
        #sleep(1)
        doc = pq(requests.get(url=url, timeout=30).text)
        #main_id = doc('.atl-info span:first-child').attr('uid')
        main_id = doc('.atl-head .atl-menu').attr('_host')
        print('樓主id：', main_id)
        #main_name = doc('.atl-info span:first-child').attr('uname')
        #title = doc('')
        if main_id and main_id not in ids:
            ids.append(main_id)

        comments = doc('.atl-main div:gt(1)').items()       #通欄廣告後的評論列表
        for comment in comments:        #處理評論
            host_id = comment.attr('_hostid')
            #user_name = comment.attr('_host')
            comment_text = comment('.bbs-content').text()
            replys = comment('.item-reply-view li').items()     #評論回覆
            if host_id and (host_id not in ids):
                print('評論id:', host_id)
                print('評論內容：', comment_text)
                ids.append(host_id)
                if (host_id != main_id):
                    relations.append((main_id, host_id))     #添加樓主和評論的關係
                    print('評論關係:', main_id, '\t', host_id)

            if replys != None:
                for reply in replys:
                    rid = reply.attr('_rid')
                    #ruser_name = reply.attr('_username')
                    rtext = reply('.ir-content').text()
                    if rid and (rid not in ids):
                        print('回覆id：', rid)
                        print('回覆內容：', rtext)
                        ids.append(rid)
                        if rid != main_id and rid != host_id:
                            relations.append((host_id, rid))         #添加評論和評論回覆的關係
                            print('回覆關係:', host_id, '\t', rid)

    return ids, relations


def file(id_list, relations):
    """
    查重，寫入
    :param id_list:
    :param relations:
    :return:
    """
    with open(id_list_file, 'w') as id_txt:
        for id in id_list:
            id_txt.write(str(id) + '\n')

    with open(relation_file, 'w') as re_txt:
        for relation in relations:
            a = id_list.index(relation[0]) + 1
            b = id_list.index(relation[1]) + 1
            print('寫入關係:', a, '\t', b)
            re_txt.write(str(a) + '\t' + str(b) + '\n')


def run(key, page):
    """

    :param key:
    :param page:
    :return:
    """
    start_urls = []
    for p in range(page):
        url = 'http://search.tianya.cn/bbs?q={}&pn={}'.format(quote(key), p)
        start_urls.append(url)
    content_urls = prase_all_page(start_urls)
    ids, relations = prase_all_content(content_urls)

    ids_set = set(ids)          #利用set去重，數據會無序化
    ids = list(ids_set)
    relations_set = set(relations)
    relations = list(relations_set)
    file(ids, relations)


if __name__ == '__main__':
    run(key_word,page)
   # urls = ['http://bbs.tianya.cn/post-worldlook-1886395-1.shtml',]
   # ids, relations = prase_all_content(urls)

我好像發現了實驗室之前的學長寫爬蟲哈哈哈哈哈哈好巧 https://blog.csdn.net/zeng_w_j123/article/details/76640147 以示親情?

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

天涯論壇搜索爬蟲

10分鐘搞定Mysql主從部署配置

如何使用 JS 判斷用戶是否處於活躍狀態

「Pygors跨平臺GUI」2：安裝MinGW-w64、MSYS2還是WSL2

[轉帖]

python列出centos7內存使用前50的進程信息

「Pygors跨平臺GUI」1：Pygors跨平臺GUI應用研究

一鍵自動化博客發佈工具,用過的人都說好(掘金篇)

lightdb數據庫超時相關控制參數

lightdb秒級增加列和刪除列（not null帶默認值）

Java ThreadPoolShutdown

Deep Learning 激活函數總結

Pycharm 報錯 Process finished with exit code -1073740940 (0xC0000374) 已解決

類vgg網絡實現端到端識別驗證碼

天涯論壇搜索爬蟲

SearchUI線程關閉&微軟小娜佔用CPU&Windows家庭版關閉微軟小娜

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結