preface: 最近一個同學需要收集去哪兒網的一些景點信息，爬蟲上場。像是這麼有規律的之間用urllib及BeautifulSoup這兩個包就可破。實際上是我想少了。

一、抓取分析

http://piao.qunar.com/ticket/detail_1.html及http://piao.qunar.com/ticket/detail_1774014993.html分別爲齊廬山和西海景區的兩個景點。顯然生成url：http://piao.qunar.com/ticket/detail_*.html，*爲0到很大的數n。知道初始url，urllib就可破。

其次，分析提取的頁面，同學想要景區名稱、級別、地址、價格這四個主要的信息，網頁另存html進行分析，發現其都在span標籤中，結合BeautifulSoup包，提取可破。BeautifulSoup的教程很多，滷煮參考了博客：python爬蟲入門八之Beautiful Soup的用法。文本在<p>標籤中，不需要也就沒提取。

二、抓取

Figure 1-1: 主要信息

#!/usr/bin/env python
# coding=utf-8
import bs4
from bs4 import BeautifulSoup
import urllib
import xlwt
import time

def crawl(html_string):
#    respons = urllib.urlopen(url)
#    soup    = BeautifulSoup(respons.read())
    soup    = BeautifulSoup(html_string)
    name    = ""
    level   = ""
    address = ""
    money   = ""
    #分析網頁提取信息，這一部分需要熟悉beautifulsoup
    for tag in soup.find_all(["span"]):				#利用beautifulsoup提取所有span標籤
        if "class" in tag.attrs:
            if "mp-description-name" in tag.attrs["class"]:	#如果其class包含name,那麼就是景區名稱，下面同理
                name = str(tag.string)
            if "mp-description-level" in tag.attrs["class"]:
                level = str(tag.string)
            if "mp-description-address" in tag.attrs["class"]:
                address = str(tag.string)
        else:
            for j in tag.children:				#提取票價時，在span的子標籤span中
                if type(j) == bs4.element.Tag and j.name=="span":
                    money = str(j.string)
    information = (address, name, level, money)
    return information
#======python讀入excel表初始化
book = xlwt.Workbook(encoding="utf-8",style_compression=0)
sheet = book.add_sheet("where_we_go",cell_overwrite_ok=True)
sheet.write(0,0,"景區地址")		#第一行爲四個屬性名字，整個表爲n行×4列，n爲抓取到的景區個數
sheet.write(0,1,"景區名稱")
sheet.write(0,2,"景區級別")
sheet.write(0,3,"景區票面價")

start_time = time.time()
line_num = 1
for i in xrange(0, 1000):#2500000000，最大2295597022	#有相當多的景區，這一點嚇尿，n大的時候必須用xrange，用range生成一個列表太大崩潰。
    url = "http://piao.qunar.com/ticket/detail_"+str(i)+".html"
    respons = urllib.urlopen(url)
    if respons.geturl()!=url:				#抓取太快又會發生重定向的問題，重定向後的url網頁顯示“尊敬的用戶，安全系統檢測到異常訪問，當前請求已經被攔截”
        print respons.geturl()
    html_string = respons.read()
    if "mp-description-detail" in html_string:		#判斷是否有這個網頁，像是標號爲0的，就沒有，1爲齊廬山，2爲中國竹藝城等等，
        information = crawl(html_string)		#進行抓取任務，返回四個主要屬性值
        for j in range(len(information)):
            sheet.write(line_num, j, information[j])	#寫入excel表中。
        line_num+=1
        print "處理url:",i, information[1]
	if line_num%10==0:
	    end_time = time.time()
	    print "time:",end_time-start_time
    else:
        continue
        
book.save("where.xls")					#要吧excel進行保存

三、抓取遇到的問題

首先網頁個數太多，n設置過大，必須要用xrange，迭代生成初始url。

其次，爬取的時候沒有設置sleep，結果爬取太快被檢查到發生url重定向，提取不到景點信息。

再次，得到信息的過程太慢，5秒才爬到20個，平均1個需要0.25s，對於大規模網頁爬取這樣的速度是不可容忍的。同仁提醒用多線程爬取。沒設置sleep也被封了。

再之，打算用scrapy框架爬取，用scrapy shell http://piao.qunar.com/ticket/detail_1774014993.html測試一下都被封掉，用scrapy繼續不了。

四、多線程爬取

多線程爬取遇到的問題是存excel，存excel需要寫入行號，需要返回值，後來發現在run()方法重寫就行。不過也依然發生重定向問題。時間的設置比較關鍵。。

coding:

#!/usr/bin/env python
# coding=utf-8
import bs4
from bs4 import BeautifulSoup
import urllib
import time
import threading

class MyThread(threading.Thread):
    def __init__(self, func, j):
        threading.Thread.__init__(self,target=func, args=j)	#繼承線程類
        self.func = func
        self.j    = j

    def run(self):
        print "starting:",self.j,self.func
        get_infor(self.j)					#在run中調用get_infor函數

def crawl(html_string):
    soup    = BeautifulSoup(html_string)
    name    = ""
    level   = ""
    address = ""
    money   = ""

    for tag in soup.find_all(["span"]):
        if "class" in tag.attrs:
            if "mp-description-name" in tag.attrs["class"]:
                name = str(tag.string)
            if "mp-description-level" in tag.attrs["class"]:
                level = str(tag.string)
            if "mp-description-address" in tag.attrs["class"]:
                address = str(tag.string)
        else:
            for j in tag.children:
                if type(j) == bs4.element.Tag and j.name=="span":
                    money = str(j.string)
    information = (address, name, level, money)
    return information

def get_infor(j):
    n=50
    for i in xrange(n):
        url = "http://piao.qunar.com/ticket/detail_"+str(i+j*n)+".html"
        respons = urllib.urlopen(url)
        if respons.geturl()!=url:
            print respons.geturl()
            continue
        html_string = respons.read()
        if "mp-description-detail" in html_string:
            information = crawl(html_string)
            print information[1]
            f = open("t.txt","a")			#以追加的方式寫入每個景點的四個屬性值
            s = "\t".join(information)			#以"\t"隔開。
            f.write(s+"\n")
            f.close()
        else:
            continue
#    return s

def main():
    threads = []
    for j in range(50):			#j爲線程個數
        t = MyThread(get_infor, j)	#對每個函數都用一個線程，將所有的網頁分爲j份，每份爲n
        threads.append(t)

    for t in threads:			#start開始線程爬取
        t.setDaemon(True)
        t.start()
        t.join()

if __name__=="__main__":
    start_time = time.time()
    main()
    end_time = time.time()
    print "time:",end_time - start_time

五、scrapy嘗試爬取

scrapy shell http://piao.qunar.com/ticket/detail_1774014993.html進入交互式終端，提取不出來。

Figure 1-2: scrapy shell嘗試分析

六、後續

爬蟲往後還有更加深入的問題，怎麼更快爬取而不被封掉，快速提取主要信息等等很多。同學又說stop，太慢了就不用爬了，其後試了多線程及scrapy後沒繼續。

參考：

有關urllib的博客：http://www.cnblogs.com/yuxc/archive/2011/08/01/2123995.html

使用scrapy建立一個網站抓取器：http://www.oschina.net/translate/build-website-crawler-based-upon-scrapy

scrapy抓取Logdown博文相關數據：http://dabing1022.github.io/2014/07/17/scrapy-crawling-logdown-blog-related-data/

轉載請認證：http://blog.csdn.net/u010454729/article/details/49328681

python 爬蟲——抓取去哪兒網站景點部分信息

一、抓取分析

二、抓取

三、抓取遇到的問題

四、多線程爬取

五、scrapy嘗試爬取

六、後續

985 碩士程序員，空窗 4 個月沒有 Offer！

一文搞懂 Spring 循環依賴

賽博鬥地主——使用大語言模型扮演Agent智能體玩牌類遊戲。

VScode右鍵打開(添加到右鍵)

記一次 .NET某工控視覺自動化系統卡死分析

WindowsServer--SQL Server搭建主從同步實現讀寫分離 - 事務性分發

java由於越界導致的報錯

算法編程——羅塞塔代碼RosettaCode-你的代碼烹飪書（code cookbook）

《機器學習實戰》筆記之十——利用K均值聚類算法對未標註數據分組

python 爬蟲——抓取去哪兒網站景點部分信息

python數據結構——排序算法——八大排序算法的Python實現

python lxml包——解析xml文件遇到的問題處理

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結