【python】實驗九（爬蟲部分）

原創

2020-07-04 08:42

題目：使用標準庫urllib爬取“http://news.pdsu.edu.cn/info/1005/31269.htm”平頂山學院新聞網上的圖片，要求:保存到F盤pic目錄中，文件名稱命名規則爲“本人姓名”+ “_圖片編號”，如姓名爲張三的第一張圖片命名爲“張三_1.jpg”。

from re import findall
from urllib.request import urlopen
url = "http://news.pdsu.edu.cn/info/1005/31269.htm"
with urlopen(url) as fp:
    content = fp.read().decode("utf-8")
pattern = '<img width="500" src="(.+?)"'
result = findall(pattern,content)
path = 'D:/pic/'
xm = "趙琦"
for index,item in enumerate(result):
    urls = "http://news.pdsu.edu.cn/" + item
    with urlopen(str(urls)) as fp:
        with open(path+xm+"_"+str(index+1)+".jpg","wb") as fp1:
            fp1.write(fp.read())

第一道題沒什麼需要強調的，urllib是標準庫，無需安裝。

運行結果：

2、採用scrapy爬蟲框架，抓取平頂山學院新聞網（http://news.pdsu.edu.cn/）站上的內容，具體要求：抓取新聞欄目，將結果寫入lm.txt。

首先我們需要安裝scrapy庫，打開cmd運行pip install scrapy

(我是直接把BeautifulSoup和requests庫都裝了，方法與上述一樣)

步驟：運行cmd開始創建項目，可以自己指定路徑後再創建我這裏放在了D盤（如圖）

創建項目：scrapy startproject 項目名

scrapy startproject MyTest

創建爬蟲名：scrapy genspider 爬蟲名允許爬取的範圍

創建好之後建議先不要關閉cmd。

如：scrapy genspider xiaoshuo bbs.tianya.cn/post-16-1126849-1.shtml

按照剛纔所創建的目錄（路徑）找到打開（我這裏是D:\Python36\MyTest\MyTest\spiders）

打開後，編輯如下代碼：

# -*- coding: utf-8 -*-
import scrapy
import re
from bs4 import BeautifulSoup

class MywarmSpider(scrapy.Spider):
    name = 'mywarm'
    allowed_domains = ['pdsu.edu.cn']
    start_urls = ['http://news.pdsu.edu.cn/']

    def parse(self, response):
        html_doc=response.text
        soup=BeautifulSoup(html_doc,'html.parser')
        re=soup.find_all('h2',class_='fl')
        content=''
        for lm in re:
            print(lm.text)
            content+=lm.text+'\n'
        with open('D:\\lm.txt',"a+") as fp:
            fp.writelines(content)

可以自己修改爬取到的數據所要保存的位置（我的是d:\lm.txt）

如果將上述代碼複製後報如下錯誤：

可關閉pycharm,安裝response(pip install response)

將上述操作重新來一遍。

編輯好之後繼續在cmd執行 scrapy crawl 爬蟲名

運行結果：

3、採用request爬蟲模塊，抓取平頂山學院網絡教學平臺上的Python語言及應用課程上的每一章標題（http://mooc1.chaoxing.com/course/206046270.html）。

首先cmd運行

創建項目名: scrapy startproject yy (yy爲項目名)

創建爬蟲名：scrapy genspider zq news.mooc1.chaoxing.com/course/206046270.html

(zq爲爬蟲名稱，mooc1.chaoxing.com/course/206046270.html爲爬取起始位置)

分析：

編寫正確的正則表達式篩選信息
由關鍵信息：<div class="f16 chapterText">第一章 python概述</div>
篩選其正則表達式如下：soup.findAll('div',class_='f16 chapterText')
找到zq.py也就是上面創建的爬蟲文件
編輯：將下面代碼負責粘貼下

import requests
import bs4
import re
headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36'
}

url='http://mooc1.chaoxing.com/course/206046270.html'
response = requests.get(url,headers=headers).text
soup = bs4.BeautifulSoup(response,'html.parser')
t=soup.findAll('div',class_='f16 chapterText')
for ml in t:
    print (ml.text)

運行結果：

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

【python】實驗九（爬蟲部分）

使用neovim打造go ide(支持代碼跳轉, 代碼補全, 實時語法檢查)

挑戰程序設計競賽 2.3章習題 poj 3046 Ant Counting

Shell/Python中的用戶名獲取

【操作系統】避免死鎖（銀行家算法）

【操作系統】銀行家算法例題

【操作系統】磁盤調度算法

【操作系統】進程同步問題

自學筆記——微機原理（二）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

【python】 實驗九 （爬蟲部分）

【python】實驗九（爬蟲部分）