python爬取Instagram上偶像的帖子（包括圖片和視頻）

聲明：僅供技術交流，請勿用於非法用途，如有其它非法用途造成損失，和本博客無關

本次爬蟲使用的是:`requests`

一、配置環境

python3.7
pycharm
requests
win10
pymysql

二、分析網頁

打開之後發現它是AJAX動態加載的，並且加載完了前面的，再往後加載的話，前面加載完的會消失掉的。

按F12打開開發者工具查看XHR，可以找到對應的請求AJAX的網站，發現其規律主要是after參數的不斷更改來獲取對應的數據，但是同時發現這個參數是隨機改變的，完全沒有規律，那麼在哪能找到下一頁數據的這個參數呢，答案就是在返回的json數據裏面啦（然鵝我分析了很久怎麼生成這個參數的，最後才恍然大悟竟然就在返回的數據裏面）

但是又有一個問題來了，就是一開始的after參數去哪裏找呢，答案就是在一開始請求的網頁的源代碼上哈哈，在杰倫主頁查看源代碼即可找到
可能又有小朋友會問，源代碼這麼多這麼複雜怎麼提取這個字符串呢，答案就是正則匹配啦敢敢單單

s=requests.session()
s.headers={'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'}
r=s.get('https://www.instagram.com/jaychou/')
qvf=re.findall('"has_next_page":true,"end_cursor":"(.*?)"},"edges"',r.text)[0]

那麼獲取到這個一開始的關鍵參數，後面就是編寫在json數據中提取想要的信息的代碼了，先構造請求的url爲：

qvf='{"id":"5951385086","first":12,"after":"%s"}'%qvf
link='https://www.instagram.com/graphql/query/?query_hash=2c5d4d8b70cad329c4a6ebe3abb6eedd&variables='+urllib.request.quote(qvf)

我提取的信息有：

每條帖子的唯一id
配文
點贊人數
評論數
發佈時間
地點
圖片、視頻

具體細節請看代碼註釋

三、數據庫設計

創建表

import pymysql
conn = pymysql.connect(host='localhost', user='youruser', password='yourpassword', database='yourdatabase')
cursor = conn.cursor()
sql='''
CREATE TABLE IF NOT EXISTS jaychou(
id VARCHAR(255) PRIMARY KEY,
text longtext,
public_time datetime,
location VARCHAR(255),
good_count int,
comment_count int,
imgage_url longtext,
video_url longtext,
video_view_count VARCHAR(255)
)
ENGINE=innodb DEFAULT CHARSET=utf8;'''
cursor.execute(sql)

修改表結構（爲了能夠插入表情）

sql='ALTER TABLE jaychou CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci'
cursor.execute(sql)

經過多次的調試，發現的幾個問題

數據表插入不了表情符號
字符串的長度要設計爲longtext（穩一點）

四、完整代碼

import requests
import re
import time
import random
import pymysql
import os
import json

conn = pymysql.connect(host='localhost', user='youruser', password='yourpassword', database='yourdatabase')
cursor = conn.cursor()
sql='''
CREATE TABLE IF NOT EXISTS jaychou(
id VARCHAR(255) PRIMARY KEY,
text longtext,
public_time datetime,
location VARCHAR(255),
good_count int,
comment_count int,
imgage_url longtext,
video_url longtext,
video_view_count VARCHAR(255)
)
ENGINE=innodb DEFAULT CHARSET=utf8;'''
cursor.execute(sql)

sql='ALTER TABLE jaychou CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci'
cursor.execute(sql)

cursor.close()
conn.close()

s=requests.session()
s.headers={'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'}
r=s.get('https://www.instagram.com/jaychou/')
#拿到第一個after參數
qvf=re.findall('"has_next_page":true,"end_cursor":"(.*?)"},"edges"',r.text)[0]
k=1  #記錄循環次數
while True:
    conn = pymysql.connect(host='localhost', user='youruser', password='yourpassword', database='yourdatabase')
    cursor = conn.cursor()
    sql='select * from jaychou'
    res=cursor.execute(sql)
    res=cursor.fetchall()
    temp=[]  #避免重複插入
    if len(res) != 0:
        for i in range(len(res)):
            temp.append(list(res[i])[0])
    qvf='{"id":"5951385086","first":12,"after":"%s"}'%qvf
    link='https://www.instagram.com/graphql/query/?query_hash=2c5d4d8b70cad329c4a6ebe3abb6eedd&variables='+urllib.request.quote(qvf)
    response=s.get(link).json()
    time.sleep(random.uniform(1,1.5))
    qvf=response['data']['user']['edge_owner_to_timeline_media']['page_info']['end_cursor']
    edges=response['data']['user']['edge_owner_to_timeline_media']['edges']
    print(f'第{k}次，有{len(edges)}個edges')
    for edge in edges:
        imgage_url = []  #存圖片
        video_url = []   #存視頻
        video_view_count = []   #存視頻播放數
        id=edge['node']['id']  #唯一標識
        typename=edge['node']['__typename']
        text = '無'
        if edge['node']['edge_media_to_caption']['edges'] != []:
        	text = edge['node']['edge_media_to_caption']['edges'][0]['node']['text']  # 配文
        public_time=time.strftime('%Y-%m-%d %H:%M:%S',time.gmtime(edge['node']['taken_at_timestamp']))  #發佈時間
        good_count = edge['node']['edge_media_preview_like']['count']  # 點贊人數
        comment_count = edge['node']['edge_media_to_comment']['count']  # 評論數
        location='無' if edge['node']['location'] is None else edge['node']['location']['name']  #地點
        if typename == 'GraphImage':  #說明只有一張圖片
            imgage_url.append(edge['node']['display_url'])  #圖片地址
        elif typename == 'GraphVideo': #說明只有一個視頻
            video_url.append(edge['node']['video_url'])  #視頻地址
            video_view_count.append(str(edge['node']['video_view_count']))  #視頻播放數
        elif typename == 'GraphSidecar':  #說明是組圖，或者是組視頻，又或者是視頻加圖片
            childrens = edge['node']['edge_sidecar_to_children']['edges']
            for children in childrens:
                if children['node']['is_video'] is False:
                    imgage_url.append(children['node']['display_url'])
                else:
                    video_url.append(children['node']['video_url'])
                    video_view_count.append(str(children['node']['video_view_count']))
        if id not in temp:
            name=public_time.replace(':','')  #文件名不能含有冒號
            path=f'./jaychou/{name}/'
            if not os.path.exists(path):
                os.makedirs(path)
            with open(path+'配文.txt','w',encoding='utf-8') as f:
                f.write(text)
            for i,each in enumerate(imgage_url):  #下載圖片
                with open(path+f'{i+1}.jpg','wb') as f:
                    res=s.get(each)
                    f.write(res.content)
                    time.sleep(random.uniform(1,1.5))
            for i,each in enumerate(video_url):   #下載視頻
                with open(path+f'{i+1}.mp4','wb') as f:
                    res=s.get(each)
                    f.write(res.content)
                    time.sleep(random.uniform(1,1.5))
            sql='insert into jaychou(id,text,public_time,location,good_count,comment_count,imgage_url,video_url,video_view_count) values(%s,%s,%s,%s,%s,%s,%s,%s,%s)'
            cursor.execute(sql,[id,text,public_time,location,good_count,comment_count,','.join(imgage_url),','.join(video_url),','.join(video_view_count)])
            conn.commit()
    k+=1
    cursor.close()
    conn.close()
    if qvf is None:  #沒有下一頁的時候退出循環
        break

五、數據展示

寫在最後

很明顯，我是傑迷。從小學五年級的時候，還記得那是一個陽光明媚中午，放學的時候，每個班排好隊一起從課室走出校門的這一段路上，旁邊的一個好朋友在唱：“還記得你說家是唯一的城堡……”，然後我就問他什麼歌曲，他驕傲的說：“稻香”，於是乎回到家就上網找稻香這首歌，並把下載到MP3上，反覆地聽，還跟着唱，然後覺得一首還不夠，就把杰倫之前的歌都下載到MP3上，循環播放一遍又一遍……後來，買了一臺可以看歌詞的MP4，那時候，流行抄歌詞，我就把歌詞抄在課本上，而且那時候還流行非主流簽名，在課本上我還寫滿了周杰倫三個字，各種手寫非主流簽名。賊溜賊溜呢哈哈。周杰倫周春春，他的歌總是陪伴在我開心、傷心的每個瞬間，現在每逢坐車的時候，打代碼的時候，洗澡的時候，我就打開音樂，打開收藏夾，聽周杰倫的歌。還有太多太多跟杰倫有關的時候了，可是，還有一個最重要的時刻還沒有完成，那就是跟我的另一半一起去看一場周杰倫的演唱會。

python爬取Instagram上偶像的帖子（包括圖片和視頻）

python爬取Instagram上偶像的帖子（包括圖片和視頻）

目錄

本次爬蟲使用的是:`requests`

一、配置環境

二、分析網頁

三、數據庫設計

四、完整代碼

五、數據展示

寫在最後

2024年DataOps趨勢預測：AI不會取代數據工程師

雲原生週刊：K8s 中的服務和網絡｜ 2024.4.29

通過Http鏈接地址爬取有贊微信商城商品信息及下載至EXCEL

多人同時導出 Excel 幹崩服務器！新來的阿里大佬給出的解決方案太優雅了！

[轉帖]cpupower

今天，昨天，近七天，近30天，近90天，js封裝

華爲云云原生FinOps解決方案，釋放雲原生最大價值

python多線程爬取加密後ts文件，解密後合成mp4視頻

python多線程爬取ts文件併合成mp4視頻

python爬取Instagram上偶像的帖子（包括圖片和視頻）

【Scrapy學習心得】添加隨機用戶代理

python+appium爬取微信運動數據，並分析好友的日常步數情況

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

python爬取Instagram上偶像的帖子（包括圖片和視頻）

python爬取Instagram上偶像的帖子（包括圖片和視頻）

目錄

本次爬蟲使用的是:requests

一、配置環境

二、分析網頁

三、數據庫設計

四、完整代碼

五、數據展示

寫在最後

本次爬蟲使用的是:`requests`