使用ajax爬取今日頭條街拍圖片

分析請求

可以發現以?aid開頭的鏈接包含了內容信息
拖動頁面，獲得連續的?aid信息
- https://www.toutiao.com/api/search/content/?aid=24&app_name=web_search&offset=0&format=json&keyword=%E8%A1%97%E6%8B%8D&autoload=true&count=20&en_qc=1&cur_tab=1&from=search_tab&pd=synthesis&timestamp=1582600289707
- https://www.toutiao.com/api/search/content/?aid=24&app_name=web_search&offset=20&format=json&keyword=%E8%A1%97%E6%8B%8D&autoload=true&count=20&en_qc=1&cur_tab=1&from=search_tab&pd=synthesis&timestamp=1582601046539

可以發現不同的鏈接之間只有offset和timestamp有差異，經過試驗，timestamp取值似乎不會影響獲得的內容

獲取一組信息

import requests
headers = {
    'cookie': 
    'tt_webid=6797200619561698823; s_v_web_id=verify_k7195l9r_uXMR9eu7_6yoD_4gkg_BOXR_MKFTGKfMqteU; ttcid=258a3cc32ee8498599686a745574cf7b28; WEATHER_CITY=%E5%8C%97%E4%BA%AC; tt_webid=6797200619561698823; csrftoken=c44d5abf445176f703e9994d9aea0b16; tt_scid=WBDUqNCX24zV0vnk7GkqwcTaUwgHDmOuOTC4cg8N.K2fPREnRW.D6XVshWxiaxPAb9ed',
    'accept': 
    'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
    'User-Agent':
    'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Mobile Safari/537.36'
}
def get_info(url):
    # 輸入爲ajax的url
    try:
        response = requests.get(url, headers = headers)
        if response.status_code == 200:
            return response.json()
    except requests.ConnectionError as e:
        print('Error:', e.args)

test_json = get_info('https://www.toutiao.com/api/search/content/?aid=24&app_name=web_search&offset=0&format=json&keyword=%E8%A1%97%E6%8B%8D&autoload=true&count=20&en_qc=1&cur_tab=1&from=search_tab&pd=synthesis&timestamp=1582600289707')

解析json

for info in test_json.get('data'):
    if info.get('abstract') != None:
        print('標題：', info.get('title'))
        print('作者：', info.get('source'))
        print('類型：', info.get('display_type_self'))
        print('原文：', info.get('article_url'))

標題： 北京街拍：三里屯潮拍凸顯與衆不同的時尚街拍，與衆不同最時尚
作者： 皇城根五爺原創街拍
類型： self_gallery
原文： http://toutiao.com/group/6428768938196206081/
標題： 隨手街拍 殿堂級美女
作者： 寬城地出溜
類型： self_article
原文： http://toutiao.com/group/6777243413981954575/
標題： 街拍：有人說這樣拍，纔算時尚街拍
作者： 皇城根五爺原創街拍
類型： self_gallery
原文： http://toutiao.com/group/6609426526892982798/
標題： 路人街拍，“肥而不膩”的難得女神
作者： 西貝時尚
類型： self_article
原文： http://toutiao.com/group/6773219752874607108/
標題： 街拍：性感就是如此簡單，豐滿韻味包臀裙
作者： 感遇街拍
類型： self_article
原文： http://toutiao.com/group/6795774593480000011/
標題： 三里屯街拍
作者： 艾絲
類型： self_gallery
原文： http://toutiao.com/group/6760980286252515853/
標題： 街拍：人間極品，完美的身材，天使的臉龐，我去哪裏找
作者： 感遇街拍
類型： self_article
原文： http://toutiao.com/group/6795524234970923531/
標題： 街拍：好身材高顏值的美女們
作者： 秋水一手諮詢
類型： self_gallery
原文： http://toutiao.com/group/6797002678686712323/
標題： 街拍：冬季穿搭，三里屯潮人穿搭，永遠都是少女們的時尚風向標
作者： 皇城根五爺原創街拍
類型： self_gallery
原文： http://toutiao.com/group/6774540258336834056/
標題： 街拍：美女姐姐上街頭，美到發亮，100%回頭率
作者： 感遇街拍
類型： self_article
原文： http://toutiao.com/group/6796140055858512398/
標題： 街拍：好好了解時尚，不斷提升自己的穿搭技巧，彰顯自身的個性
作者： 皇城根五爺原創街拍
類型： self_gallery
原文： http://toutiao.com/group/6796847662701216268/
標題： 街拍：你想和圖幾有故事
作者： 秋水一手諮詢
類型： self_gallery
原文： http://toutiao.com/group/6796633637992268301/
標題： 50位街頭攝影大師，50張經典街拍作品
作者： 寧影紀
類型： self_article
原文： http://toutiao.com/group/6793614971650441731/

獲取圖片列表

這時圖片地址不在ajax中了，而是來源於js

import re
import json
# 注意這裏要換一下headers
newheaders = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36'
    }
def get_imgs(url):
    # 輸入爲文章詳情url
    try:
        response = requests.get(url, headers = newheaders)
        if response.status_code == 200:
            content = response.text
            # 獲取包含圖片的json部分並轉換爲json類型（由於要多次轉義所以需要用兩次loads方法）
            images = re.search(re.compile('gallery: JSON.parse\((.*?)\),',re.S), content)
            images = json.loads(images.group(1))
            images = json.loads(images)
            images = images['sub_images']
            # 去除額外的\\部分
            return [re.sub('\\\\', '', img.get('url')) for img in images]
    except requests.ConnectionError as e:
        print('Error:', e.args)

get_imgs('http://toutiao.com/group/6428768938196206081/')

['http://p3.pstatp.com/origin/243b0002e5ef87385601',
 'http://p1.pstatp.com/origin/243b0002e5f15c188892',
 'http://p1.pstatp.com/origin/26e300004740640ff18a',
 'http://p1.pstatp.com/origin/24390003cc898fb29a19',
 'http://p1.pstatp.com/origin/24380000f50c5c50cd76',
 'http://p3.pstatp.com/origin/24340002e8685c2bdd8f',
 'http://p3.pstatp.com/origin/243b0002e5f3a9124820',
 'http://p1.pstatp.com/origin/243a00030149960659dc',
 'http://p1.pstatp.com/origin/243b0002e5f44b3d8cdc',
 'http://p3.pstatp.com/origin/24380000f50fa79e73e6',
 'http://p1.pstatp.com/origin/24390003cc8b1daf17a6']

本地存儲

import os
from hashlib import md5
def save_img(filename, imgs):
    # 輸入爲文件名和圖片url列表
    # 文件夾不存在則新建
    if not os.path.exists(filename):
        os.mkdir(filename)
    for img in imgs:
        try:
            response = requests.get(img)
            if response.status_code == 200:
                # 圖片名稱爲內容的md5編碼
                file_path = '{0}/{1}.{2}'.format(filename, md5(response.content).hexdigest(), 'jpg')
                if not os.path.exists(file_path):
                    with open(file_path, "wb") as f:
                        f.write(response.content)
        except requests.ConnectionError as e:
            print('Error:', e.args)

save_img('hhh', get_imgs('http://toutiao.com/group/6428768938196206081/'))

整合功能

from urllib.parse import quote
import time
headers = {
    'cookie':
    'tt_webid=6797200619561698823; s_v_web_id=verify_k7195l9r_uXMR9eu7_6yoD_4gkg_BOXR_MKFTGKfMqteU; ttcid=258a3cc32ee8498599686a745574cf7b28; WEATHER_CITY=%E5%8C%97%E4%BA%AC; tt_webid=6797200619561698823; csrftoken=c44d5abf445176f703e9994d9aea0b16; tt_scid=WBDUqNCX24zV0vnk7GkqwcTaUwgHDmOuOTC4cg8N.K2fPREnRW.D6XVshWxiaxPAb9ed',
    'accept':
    'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
    'User-Agent':
    'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Mobile Safari/537.36'
}
newheaders = {
    'User-Agent':
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36'
}


def get_url(keyword, offset):
    # 輸入參數爲查詢關鍵詞和偏移量
    return 'https://www.toutiao.com/api/search/content/?aid=24&app_name=web_search&offset=' + str(
        offset
    ) + '&format=json&keyword=' + quote(
        keyword
    ) + '&autoload=true&count=20&en_qc=1&cur_tab=1&from=search_tab&pd=synthesis&timestamp=' + str(
        int(time.time() * 1000))


def toutiao(keyword, number):
    # 輸入參數爲查詢關鍵詞和查詢數量
    offset = 0
    total = 0
    baseurl = get_url(keyword, offset)
    infos = get_info(baseurl)
    if not os.path.exists(keyword):
        os.mkdir(keyword)
    while True:
        for info in infos.get('data'):
            # 這裏的爬取方式只適用於gallery類型
            if info.get('abstract') != None and info.get('display_type_self') == "self_gallery":
                filename = keyword + "/【" + info.get('source')+'】'+info.get('title')
                total += 1
                save_img(filename, get_imgs(info.get('article_url')))
                if total == number:
                    return
    offset += 20
    baseurl = get_url(keyword, offset)
    infos = get_info(baseurl)

toutiao('街拍', 4)

使用ajax爬取今日頭條街拍圖片

分析請求

獲取一組信息

解析json

獲取圖片列表

本地存儲

整合功能

System.Object未被引用的程序集中定義

Java 信號量（semaphore）搭配CountDownLatch 實現多線程處理循環內邏輯並限制創建線程數

【面試準備】項目經驗——接口自動化項目

tensorflow基礎流程

不畏網頁遮望眼，只爲我有bf4

如何查詢會議的接受率及年論文數

走近tensor常量

手動實現隨機森林並做數據實驗

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結