爬蟲練習-爬取豆瓣電影TOP250的數據

前言:

爬取豆瓣電影TOP250的數據,並將爬取的數據存儲於Mysql數據庫中

本文爲整理代碼,梳理思路,驗證代碼有效性——2020.1.4


環境:
Python3(Anaconda3)
PyCharm
Chrome瀏覽器

主要模塊: 後跟括號內的爲在cmd窗口安裝的指令
requests(pip install requests)
lxml(pip install lxml)
re
pymysql(pip install pymysql )
time

1.

在Mysql名爲mydb的數據庫中新建數據表,下爲建表語句

CREATE TABLE doubanmovie 
( 
NAME TEXT, 
director TEXT, 
actor TEXT, 
style TEXT, 
country TEXT, 
release_time TEXT, 
time TEXT, 
score TEXT 
) 
ENGINE INNODB DEFAULT CHARSET = utf8;

2.

分析爬取的網頁結構

https://movie.douban.com/top250
https://movie.douban.com/top250?start=25&filter=
https://movie.douban.com/top250?start=50&filter=
...

同豆瓣音樂,豆瓣圖書的TOP250一樣
我們對其構造列表解析式

 urls = ['https://movie.douban.com/top250?start={}'.format(str(i))for i in range(0, 250, 25)]

3.

分析html結構,獲取詳情頁鏈接
在這裏插入圖片描述

4.

進入詳情頁後,打開開發者工具(F12),分析html結構獲取詳細信息
在這裏插入圖片描述
代碼如下:

演員取前五個爲主演,不足五個的則全取

#標題
name = selector.xpath('//*[@id="content"]/h1/span[1]/text()')[0]

# 導演
director = selector.xpath('//*[@id="info"]/span[1]/span[2]/a/text()')[0]

# 演員
actors_s = selector.xpath('//span[@class="attrs"]/a/text()')
actors = ""
if len(actors_s) > 5:
    for s in actors_s[1:5]:
        actors += (s + '/')
    actors += actors_s[5]
else:
    for s in actors_s[-1]:
        actors += (s + '/')
    actors += actors_s[-1]

# 類型
styles = selector.xpath('//*[@id="info"]/span[@property="v:genre"]/text()')
style = ""
if len(styles) > 1:
    for s in styles[:-1]:
        style += (s + '/')
    style += styles[-1]
else:
    style = styles[0]

# 國家
country = re.findall('製片國家/地區:</span>(.*?)<br', html.text, re.S)[0].strip()

# 上映時間
release_time = re.findall('上映日期:</span>.*?>(.*?)\(', html.text, re.S)[0]

# 片長
time = re.findall('片長:</span>.*?>(.*?)</sp', html.text, re.S)[0]

# 評分
score = selector.xpath('//*[@id="interest_sectl"]/div[1]/div[2]/strong/text()')[0]

5.

將數據保存在Mysql數據庫中,有以下“大象裝冰箱”三步

  1. “打開冰箱” 連接數據庫及光標
conn = pymysql.connect(host='localhost', user='root', passwd='123456', db='mydb', port=3306, charset='utf8')
cursor = conn.cursor()
  1. “將大象裝進冰箱” 獲取信息插入數據庫
cursor.execute("insert into doubanmovie (name, director, actor, style, country,release_time, time,score) "
               "values(%s, %s, %s, %s, %s, %s, %s, %s)",
               (str(name), str(director), str(actors), str(style), str(country),
                str(release_time), str(time), str(score)))
  1. “關上冰箱” 提交事務
conn.commit()

完整代碼

#!/usr/bin/env python
# -*- coding: utf-8 -*-

# 導入相應的庫文件
import requests
from lxml import etree
import re
import pymysql
import time

# 連接數據庫及光標
conn = pymysql.connect(host='localhost', user='root', passwd='123456', db='mydb', port=3306, charset='utf8')
cursor = conn.cursor()

# 加入請求頭
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64)AppleWebKit/ 537.36 '
                 '(KHTML, like Gecko) Chrome/56.0.2924.87 Safari/ 537.36'
}

# 定義獲取詳細頁URL的函數
def get_movie_url(url):
    html = requests.get(url, headers=headers)
    print(url, html.status_code)
    selector = etree.HTML(html.text)
    movie_hrefs = selector.xpath('//div[@class="hd"]/a/@href')
    for movie_href in movie_hrefs:
        # 調用獲取詳細頁信息的函數
        get_movie_info(movie_href)


# 定義獲取詳細頁信息的函數
def get_movie_info(url):
    html = requests.get(url, headers=headers)
    selector = etree.HTML(html.text)

    print(url, html.status_code)

    name = selector.xpath('//*[@id="content"]/h1/span[1]/text()')[0]

    director = selector.xpath('//*[@id="info"]/span[1]/span[2]/a/text()')[0]

    actors_s = selector.xpath('//span[@class="attrs"]/a/text()')
    actors = ""
    if len(actors_s) > 5:
        for s in actors_s[1:5]:
            actors += (s + '/')
        actors += actors_s[5]
    else:
        for s in actors_s[-1]:
            actors += (s + '/')
        actors += actors_s[-1]

    styles = selector.xpath('//*[@id="info"]/span[@property="v:genre"]/text()')
    style = ""
    if len(styles) > 1:
        for s in styles[:-1]:
            style += (s + '/')
        style += styles[-1]
    else:
        style = styles[0]

    country = re.findall('製片國家/地區:</span>(.*?)<br', html.text, re.S)[0].strip()

    release_time = re.findall('上映日期:</span>.*?>(.*?)\(', html.text, re.S)[0]

    time = re.findall('片長:</span>.*?>(.*?)</sp', html.text, re.S)[0]

    score = selector.xpath('//*[@id="interest_sectl"]/div[1]/div[2]/strong/text()')[0]

    # 獲取信息插入數據庫
    cursor.execute("insert into doubanmovie (name, director, actor, style, country,release_time, time,score) "
                   "values(%s, %s, %s, %s, %s, %s, %s, %s)",
                   (str(name), str(director), str(actors), str(style), str(country),
                    str(release_time), str(time), str(score)))

    print((name, director, actors, style, country, release_time, time, score))


# 程序主入口
if __name__ == '__main__':
    urls = ['https://movie.douban.com/top250?start={}'.format(str(i))for i in range(0, 250, 25)]
    for url in urls:
        # 構造urls並循環調用函數
        get_movie_url(url)
        # 睡眠2秒
        time.sleep(2)
        # 提交事務
        conn.commit()
    # conn.commit()
    print("爬取結束!!!")

數據截圖

在這裏插入圖片描述

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章