mongodb搭建校內搜索引擎——爬取網頁文本

目標：
讀取excle文檔中存儲的url列表，爬取列表中網頁的文本內容

概要：
在已經在獲得爬蟲獲取的url列表的工作基礎上，進行網頁內容的獲取。編程用到request（獲取網頁源碼），BeautifulSoup(解析html,並且獲取網頁純文本)，lxml（解析html，在使用BeautifulSoup要預先安裝），Xlrd(讀取excle中內容)

我的思考：
一開始想用正則表達式來判斷並且獲的網頁源碼中的內容，但是發現自己的需求是獲取所有純文本的內容，剛開始接觸爬蟲時嘗試過用，當你的需求是提取標題，或者特定單一的內容時，使用正則表示式還是可行的。考慮到我的要求是提取所有的文本內容，正則表達式就是比較侷限，要考慮的情況較多，難免會有遺漏，所有去google有沒有現有的python模塊能幹這種事。
果然，找到如下內容：

在這裏附上BeautifulSoup的中文文檔使用鏈接：
BeautifulSoup4.2.0中文文檔
在這裏我只是使用了最爲簡單的方法實現，BeautifulSoup不僅實現了我的需求，而且遠比我想象的還要強大。

代碼：

#-*-coding:utf8-*-
from bs4 import BeautifulSoup
import requests
import re
import xlrd
import time

def get_text(url):
    headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.97 Safari/537.36"}
    html=requests.get(url,headers = headers)#獲取網頁源碼
    html_text=html.text

    soup=BeautifulSoup(html_text,"lxml")#用BeautifulSoup獲取網頁內的純文本
    fp_result=open("result.txt","a")
    result=soup.get_text
    # print (type(result))
    row_number=0
    fp_result.write(url)#打開文件結果以txt文件輸出
    fp_result.write("\n------------------------------------------\n")

    for text in soup.stripped_strings:#soup.stripped_strings列表逐個輸出
        fp_result.write(str(row_number)+":")
        fp_result.write(text.encode("utf-8"))#以utf-8編碼輸出
        fp_result.write("\n")
        row_number+=1

    fp_result.close()#關閉文件


def get_xls(path):#打開已經通過爬蟲獲取的url列表，並且逐個讀取url
    data=xlrd.open_workbook(path)#打開excle文件
    table=data.sheets()[0]#打開工作表0
    nrows=table.nrows#獲取行數
    finished_line=0

    for i in range(nrows):#逐個輸出每行的內容
        ss=table.row_values(i)#獲取列數
        for j in range(len(ss)):#逐個輸出每列的內容
            finished_line+=1#記錄完成進度
            try:
                get_text(ss[j])
                process=1.0*finished_line/nrows
                print "have finished %.3f" %process#記錄完成進度
                print str(finished_line)
            except requests.exceptions.ConnectionError:#當出現requests.exceptions.ConnectionError時，將錯誤的鏈接記錄到error.txt文件，錯誤原因有待進一步研究解決
                fp_error.write(ss[j]+"\n")

    print "target finished "+str(nrows)#記錄總共的行數


fp_error=open("error.txt","a")#打開記錄錯誤的文件
fp_error.write(time.strftime('%Y-%m-%d',time.localtime(time.time())))#記錄此次運行時間
fp_error.write("\n------------------------------------------\n")
get_xls("cs.xlsx")#調用函數，獲取內容
fp_error.close()#關閉錯誤文件

mongodb搭建校內搜索引擎——爬取網頁文本

mongodb搭建校內搜索引擎——爬取網頁文本

centos下mongodb3.2數據庫的備份與恢復初步

SSH+rsync實現服務器的自動備份

Linux系統換源

Elasticsearch安裝分詞插件IK及問題解決

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結