前言

看到B站上有個爬取一個婚戀網站自己也就照着寫了一個也就是對requests庫的一個簡單使用和對爬取的信息寫入文件

對網站信息的抓取首先要對網站的結構進行分析這會讓我們在後續中提供很大的便利

www.7799520.com/api/user/pc/list/search?startage=21&endage=30&gender=2&cityid=221&startheight=161&endheight=170&marry=1&salary=2&page=1
通過它的url地址發現他是根據我們選擇的條件篩選出符合的數據加載出來然後可以通過page可以查看多頁數據（網站的多頁數據都是在一個網頁下顯示出來的）

導入包

對網站分析之後先把需要的包可以導入進來自己通過resquests庫進行爬取當然先把requests庫導入後續有文件操作所以導入os模塊其他模塊在使用過程中可以導入

import requests
import os
import json

設置爬取條件

def set_age():
    #輸入期望的年齡
    age = int(input("請輸入期望的年齡(如:25): "))
    #年齡區間
    if  21 <= age <= 30:
        startage =21
        endage = 30
    elif 31 <= age <= 40:
        startage = 21
        endage = 30
    elif 41 <= age <= 50:
        startage = 21
        endage = 30
    elif 51 <= age <= 60:
        startage = 21
        endage = 30
    else:
        startage = 0
        endage =0
    return startage,endage
def set_sex():
    #輸入性別
    sex = input("請輸入對方的性別（如:女): ")
    if sex == '男':
        gender = 1
    else:
        gender = 2
    return gender

def set_heigth():
    #輸入期望的身高
    height = int(input("請輸入期望的身高(如:162): "))
    if 0 <= height < 150:
        startheight = 0
        endheight = 150
    elif 151 <= height < 161:
        startheight = 151
        endheight =160
    elif 161 <= height < 171:
        startheight = 161
        endheight =170
    elif 171 <= height < 181:
        startheight = 171
        endheight =180
    elif 181 <= height < 191:
        startheight = 181
        endheight =190
    else:
        startheight = 0
        endheight = 0
    return  startheight,endheight

def set_salary():
    #輸入期望的薪資
    money = int(input("請輸入期望的薪資: "))
    if 2000 <= money < 5000:
        salary = 2
    elif 5000 <= money < 10000:
        salary = 3
    elif 10000 <= money < 20000:
        salary = 4
    elif 20000 <= money < 50000:
        salary = 5
    elif 50000 <= money < 100000:
        salary = 6
    elif 100000 <= money :
        salary = 7
    else:
        salary = 0
    return salary

這裏只對部分的條件設置大家有興趣可以多設置一些其他的就直接固定條件

解析網頁

解析網頁需要的參數可以通過查詢條件傳入

def get_one_page(page,startage,endage,gender,startheight,endheight,salary):
    headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36"}

    base_url = 'http://www.7799520.com/api/user/pc/list/search?startage={}&endage={}&gender={}&cityid=221&startheight={}' \
               '&endheight={}&marry=1&salary={}&page={}'.format(startage,endage,gender,startheight,endheight,salary,page)
    while True :
        try:
            response = requests.get(base_url,headers=headers)
            if response.status_code == 200:
                return response.json()
        except:
            return None

根據篩選條件獲取數據傳入網頁的url中通過for循環查找多頁數據進行將人物的圖片和信息的保存

def query_data():
    print("請輸入你的篩選條件，開始查找")
    #年齡
    startage,endage = set_age()
    #性別
    gender = set_sex()
    #身高
    startheight,endheight = set_heigth()
    #薪資
    salary = set_salary()

    for i in range(1,5):
        json = get_one_page(i,startage,endage,gender,startheight,endheight,salary)
        #print(json['data']['list'])
        for item in json['data']['list']:
            #保存頭像
            #save_image(item)
            #保存個人信息
            save_info(item)

保存頭像

頭像的url在json格式下的item中的avatar中這樣就可以獲取到頭像的url地址

def save_image(item):
    if not os.path.exists('images'):
        os.mkdir('images')
    response  = requests.get(item['avatar'])
    if response.status_code == 200:
        file_path = 'images/{}.jpg'.format(item['username'])
        if not os.path.exists(file_path):
            print("正在獲取%s的信息"%(item['username']))
             #圖片以二進制格式保存
            with open(file_path,'wb')as f:
                # content獲取圖片內存
                f.write(response.content)
        else:
            print("已經保存該圖片")

保存基本信息

將每個人的信息都保存到一個txt文件中直接可以獲取到item下的各種信息寫入文件

def save_info(item):
    if not os.path.exists('message'):
        os.mkdir('message')
    file_path = 'message/{}.text'.format(item['username'])
    with open(file_path,'w',encoding='utf-8') as f:
        f.write("username:" +item['username']+"birth:"+ item['birthdayyear'])

自己嘗試着把所有抓取的信息寫入一個文件但是打開文件發現其實格式很亂

def save_info(item):
    if not os.path.exists('message'):
        os.mkdir('message')
    data = {
        'username' : item['username'],
        'birth' : item['birthdayyear'],
        'gender':item['gender'],
        'height':item['height'],
        'education': item['education'],
        'monolog' : item['monolog'],
        'city': item['city']
    }
    #print(type(data)) dict
    #print(data)
    jsondata = json.dumps(data).encode('utf-8').decode('unicode_escape')
    print(jsondata)
    #print(type(jsondata))
    #將所有的信息放入一個文本文件中
    file_path = 'message/{}.text'.format('meizi')
    with open(file_path,'a',encoding='utf-8') as f:
         f.write(jsondata)

調用函數完成爬取

query_data()

自己剛學爬蟲還有很多不知道的地方自己也就是希望用csdn記錄下自己學爬蟲的過程

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Requests簡單爬取婚戀網站

前言

導入包

設置爬取條件

解析網頁

保存頭像

保存基本信息

調用函數完成爬取

Python 潮流週刊#52：Python 處理 Excel 的資源

Python爬蟲之selenium爬取英雄聯盟官網英雄皮膚圖片下載到本地和保存到數據庫

用IDEA 寫 javaweb工程時文件上傳遇到的問題

requests和 BeautifulSoup 獲取豆瓣的圖書列表信息

requests和lxml爬取貓眼電影TOP100

Requests簡單爬取婚戀網站

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結