python多線程爬取ts文件併合成mp4視頻

聲明：僅供技術交流，請勿用於非法用途，如有其它非法用途造成損失，和本博客無關

前言

在我看來，爬取視頻可以分爲簡單、中等以及困難三種級別。

簡單級別：網頁直接給出了mp4格式的視頻鏈接，所以可以像下載圖片一樣發個請求就可以輕鬆獲得
中等級別：就是網頁給出的是ts文件，所有的ts文件會存儲在一個m3u8文件中，我們請求這個m3u8文件即可拿到全部的ts文件的請求網址，然後把全部的ts都下載下來，最後再將它們合成一個mp4格式的視頻就行
困難級別：其實就是在中等級別的基礎上，網站給出的m3u8文件不會明文給你看到所有的ts文件，而是會利用一些加密的算法，將其加密

那麼，本文爬取視頻的級別是中等。爬取的視頻網址：點擊跳轉
廢話不多說，下面直接開始吧

一、分析頁面

首先打開開發者工具，可以看到每一集對應的url存在一個li的列表當中

然後點開到第一集視頻播放頁面，再次打開開發者工具，點擊network之後刷新頁面，可以看到在第二個m3u8文件中出現了所有的ts文件，那麼，這就是我們要找的東西了，只是這個ts文件的網址不全

再看看第一個m3u8的文件響應中有1024k/hls/index.m3u8這麼個字符串，可以知道，這個其實是第二個m3u8文件網址的末尾部分，並且ts文件網址也只是修改了第二個m3u8文件的末尾而已。ok，到這裏已經知道全部的ts文件網址了，只要拿到第一個m3u8文件的網址即可。

第一個m3u8：https://mojing.huoyanzuida.com/20200424/2487_d0fc7191/index.m3u8
第二個m3u8：https://mojing.huoyanzuida.com/20200424/2487_d0fc7191/1024k/hls/index.m3u8
第一個ts：https://mojing.huoyanzuida.com/20200424/2487_d0fc7191/1024k/hls/33a92401b72000000.ts

接下來，就是要找出第一個m3u8跟之前的網址存在什麼聯繫，首先全局搜索一下“m3u8”，發現在5014.js這個文件中發現了一個用base64加密了的字符串，

將其解密之後得到：

%u7b2c01%u96c6%24https%3A%2F%2Fmojing.huoyanzuida.com%2F20200424%2F2487_d0fc7191%2Findex.m3u8%23%u7b2c02%u96c6%24https%3A%2F%2Fmojing.huoyanzuida.com%2F20200424%2F2484_640df7e0%2Findex.m3u8%23%u7b2c03%u96c6%24https%3A%2F%2Fmojing.huoyanzuida.com%2F20200424%2F2490_0b2ee7ab%2Findex.m3u8%23%u7b2c04%u96c6%24https%3A%2F%2Fmojing.huoyanzuida.com%2F20200424%2F2485_029c4007%2Findex.m3u8%23%u7b2c05%u96c6%24https%3A%2F%2Fmojing.huoyanzuida.com%2F20200424%2F2486_957bb1f3%2Findex.m3u8%23%u7b2c06%u96c6%24https%3A%2F%2Fmojing.huoyanzuida.com%2F20200424%2F2488_06dae5ae%2Findex.m3u8%23%u7b2c07%u96c6%24https%3A%2F%2Fmojing.huoyanzuida.com%2F20200424%2F2497_4350d451%2Findex.m3u8%23%u7b2c08%u96c6%24https%3A%2F%2Fmojing.huoyanzuida.com%2F20200424%2F2489_677b9744%2Findex.m3u8%23%u7b2c09%u96c6%24https%3A%2F%2Fmojing.huoyanzuida.com%2F20200424%2F2495_3e03853a%2Findex.m3u8%23%u7b2c10%u96c6%24https%3A%2F%2Fmojing.huoyanzuida.com%2F20200424%2F2491_de7cb550%2Findex.m3u8%23%u7b2c11%u96c6%24https%3A%2F%2Fmojing.huoyanzuida.com%2F20200424%2F2492_e8221393%2Findex.m3u8%23%u7b2c12%u96c6%24https%3A%2F%2Fmojing.huoyanzuida.com%2F20200424%2F2493_5b52e7e5%2Findex.m3u8%23%u7b2c13%u96c6%24https%3A%2F%2Fmojing.huoyanzuida.com%2F20200424%2F2494_8ebe1863%2Findex.m3u8%23%u7b2c14%u96c6%24https%3A%2F%2Fmojing.huoyanzuida.com%2F20200424%2F2496_a814c3b3%2Findex.m3u8%23%u7b2c15%u96c6%24https%3A%2F%2Fmojing.huoyanzuida.com%2F20200424%2F2500_cafb68ab%2Findex.m3u8%23%u7b2c16%u96c6%24https%3A%2F%2Fmojing.huoyanzuida.com%2F20200424%2F2498_9e696bf2%2Findex.m3u8%23%u7b2c17%u96c6%24https%3A%2F%2Fmojing.huoyanzuida.com%2F20200424%2F2499_0015700c%2Findex.m3u8%23%u7b2c18%u96c6%24https%3A%2F%2Fmojing.huoyanzuida.com%2F20200424%2F2502_c39cb88d%2Findex.m3u8%23%u7b2c19%u96c6%24https%3A%2F%2Fmojing.huoyanzuida.com%2F20200424%2F2501_c12a81f8%2Findex.m3u8%23%u7b2c20%u96c6%24https%3A%2F%2Fmojing.huoyanzuida.com%2F20200424%2F2503_5fd7c956%2Findex.m3u8%23%u7b2c21%u96c6%24https%3A%2F%2Fmojing.huoyanzuida.com%2F20200424%2F2553_5efba16b%2Findex.m3u8%23%u7b2c22%u96c6%24https%3A%2F%2Fmojing.huoyanzuida.com%2F20200424%2F2510_41b6e254%2Findex.m3u8%23%u7b2c23%u96c6%24https%3A%2F%2Fmojing.huoyanzuida.com%2F20200424%2F2508_92bd89a2%2Findex.m3u8%23%u7b2c24%u96c6%24https%3A%2F%2Fmojing.huoyanzuida.com%2F20200424%2F2504_02863479%2Findex.m3u8%23%u7b2c25%u96c6%24https%3A%2F%2Fmojing.huoyanzuida.com%2F20200424%2F2505_45f36385%2Findex.m3u8%23%u7b2c26%u96c6%24https%3A%2F%2Fmojing.huoyanzuida.com%2F20200424%2F2506_307718a8%2Findex.m3u8%23%u7b2c27%u96c6%24https%3A%2F%2Fmojing.huoyanzuida.com%2F20200424%2F2507_2d365300%2Findex.m3u8%23%u7b2c28%u96c6%24https%3A%2F%2Fmojing.huoyanzuida.com%2F20200424%2F2509_2c9d20a5%2Findex.m3u8%23%u7b2c29%u96c6%24https%3A%2F%2Fmojing.huoyanzuida.com%2F20200424%2F2512_47a6b558%2Findex.m3u8%23%u7b2c30%u96c6%24https%3A%2F%2Fmojing.huoyanzuida.com%2F20200424%2F2511_da5c4e6f%2Findex.m3u8

然後在通過urllib.parse.unquote方法解析得到：

b'%u7b2c01%u96c6$https://mojing.huoyanzuida.com/20200424/2487_d0fc7191/index.m3u8#%u7b2c02%u96c6$https://mojing.huoyanzuida.com/20200424/2484_640df7e0/index.m3u8#%u7b2c03%u96c6$https://mojing.huoyanzuida.com/20200424/2490_0b2ee7ab/index.m3u8#%u7b2c04%u96c6$https://mojing.huoyanzuida.com/20200424/2485_029c4007/index.m3u8#%u7b2c05%u96c6$https://mojing.huoyanzuida.com/20200424/2486_957bb1f3/index.m3u8#%u7b2c06%u96c6$https://mojing.huoyanzuida.com/20200424/2488_06dae5ae/index.m3u8#%u7b2c07%u96c6$https://mojing.huoyanzuida.com/20200424/2497_4350d451/index.m3u8#%u7b2c08%u96c6$https://mojing.huoyanzuida.com/20200424/2489_677b9744/index.m3u8#%u7b2c09%u96c6$https://mojing.huoyanzuida.com/20200424/2495_3e03853a/index.m3u8#%u7b2c10%u96c6$https://mojing.huoyanzuida.com/20200424/2491_de7cb550/index.m3u8#%u7b2c11%u96c6$https://mojing.huoyanzuida.com/20200424/2492_e8221393/index.m3u8#%u7b2c12%u96c6$https://mojing.huoyanzuida.com/20200424/2493_5b52e7e5/index.m3u8#%u7b2c13%u96c6$https://mojing.huoyanzuida.com/20200424/2494_8ebe1863/index.m3u8#%u7b2c14%u96c6$https://mojing.huoyanzuida.com/20200424/2496_a814c3b3/index.m3u8#%u7b2c15%u96c6$https://mojing.huoyanzuida.com/20200424/2500_cafb68ab/index.m3u8#%u7b2c16%u96c6$https://mojing.huoyanzuida.com/20200424/2498_9e696bf2/index.m3u8#%u7b2c17%u96c6$https://mojing.huoyanzuida.com/20200424/2499_0015700c/index.m3u8#%u7b2c18%u96c6$https://mojing.huoyanzuida.com/20200424/2502_c39cb88d/index.m3u8#%u7b2c19%u96c6$https://mojing.huoyanzuida.com/20200424/2501_c12a81f8/index.m3u8#%u7b2c20%u96c6$https://mojing.huoyanzuida.com/20200424/2503_5fd7c956/index.m3u8#%u7b2c21%u96c6$https://mojing.huoyanzuida.com/20200424/2553_5efba16b/index.m3u8#%u7b2c22%u96c6$https://mojing.huoyanzuida.com/20200424/2510_41b6e254/index.m3u8#%u7b2c23%u96c6$https://mojing.huoyanzuida.com/20200424/2508_92bd89a2/index.m3u8#%u7b2c24%u96c6$https://mojing.huoyanzuida.com/20200424/2504_02863479/index.m3u8#%u7b2c25%u96c6$https://mojing.huoyanzuida.com/20200424/2505_45f36385/index.m3u8#%u7b2c26%u96c6$https://mojing.huoyanzuida.com/20200424/2506_307718a8/index.m3u8#%u7b2c27%u96c6$https://mojing.huoyanzuida.com/20200424/2507_2d365300/index.m3u8#%u7b2c28%u96c6$https://mojing.huoyanzuida.com/20200424/2509_2c9d20a5/index.m3u8#%u7b2c29%u96c6$https://mojing.huoyanzuida.com/20200424/2512_47a6b558/index.m3u8#%u7b2c30%u96c6$https://mojing.huoyanzuida.com/20200424/2511_da5c4e6f/index.m3u8'

可以清楚地看到其中具體的網址了，並且可以看到第一個網址正是我們第一個m3u8文件的網址，並且還發現了，這裏包含了這個電視劇所有集數的m3u8文件網址，這就太棒了，不用去請求每一集來獲取m3u8文件了。不過還沒有完，就是這個5014.js的文件網址要去那裏找呢？正是在視頻播放頁的網頁源代碼當中：

二、整體思路邏輯

1，首先在視頻播放頁的網頁源代碼中拿到那個js文件，接着請求這個js，拿到其響應中的通過base64加密的字符串
2，然後解密這個字符串，拿到所有集數的第一個m3u8文件網址，接着通過兩個m3u8文件之間存在的關係，拿到所有集數的第二個m3u8文件網址，也就是用來保存所有ts文件的那個m3u8
3，再通過m3u8和ts這兩個網址之間的關係，拿到所有的對應集數的全部的ts文件網址
4，最後，就可以通過Python多線程將它們下載下來，併合成mp4視頻

三、開始編寫代碼

# 導入相關包或模塊
import threading, queue
import time, os, subprocess
import requests, urllib, parsel
import random, re, base64

# 拿到播放頁網址
def get_bofangye_url(url):
    r=requests.get(url,headers=headers)
    response=parsel.Selector(r.text)
    bofangye_url='https://www.dsm8.cc' + response.xpath('//div[@id="vlink_1"]/ul/li/a/@href').get()
    return bofangye_url

# 拿到js文件網址
def get_js_url(bofangye_url):
    r=requests.get(bofangye_url,headers=headers)
    response=parsel.Selector(r.text)
    js_url='https://www.dsm8.cc'+response.xpath('//div[@id="flash"]/script/@src').get()
    return js_url

# 拿到所有的m3u8文件網址
def get_all_url(js_url):
    r=requests.get(js_url,headers=headers)
    a=re.findall("base64decode\('(.*?)\)",r.text)[0]
    temp_url=re.findall('\$(.*?)\#',urllib.parse.unquote(str(base64.b64decode(a))))
    r=requests.get(temp_url[0],headers=headers)
    all_url=[]
    for i in temp_url:
        all_url.append(i.replace('index.m3u8',r.text.split('\n')[-1]))
    return all_url

# 下載ts文件
def download_ts(urlQueue): 
    while True:
        try: 
            #不阻塞的讀取隊列數據 
            url = urlQueue.get_nowait()
            n=int(url[-6:-3])
        except Exception as e:
            break
        response=requests.get(url,stream=True,headers=headers)
        ts_path = "./ts/%03d.ts"%n  # 注意這裏的ts文件命名規則
        with open(ts_path,"wb+") as file:
            for chunk in response.iter_content(chunk_size=1024):
                if chunk:
                    file.write(chunk)
        print("%03d.ts OK..."%n)

if __name__ == '__main__':
    url='https://www.dsm8.cc/TVB/wanshuiqianshanzongshiqingyueyu.html' # 萬水千山總是情粵語版
    headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3741.400 QQBrowser/10.5.3863.400'}
    bofangye_url=get_bofangye_url(url)
    js_url=get_js_url(bofangye_url)
    all_url=get_all_url(js_url)
    
    # 下面開始循環下載所有劇集
    for num,url in enumerate(all_url):
        r=requests.get(url,headers=headers)
        urlQueue = queue.Queue()
        for i in r.text.split('\n'):
            if i.endswith('.ts'):
                urlQueue.put(url.replace('index.m3u8',i))
                
        # 下面開始多線程下載
        startTime = time.time()
        threads = []
        # 可以適當調節線程數,進而控制抓取速度
        threadNum = 4
        for i in range(threadNum):
            t = threading.Thread(target=download_ts, args=(urlQueue,))
            threads.append(t)
        for t in threads:
            t.start()
        for t in threads:
            t.join()
        endTime = time.time()
        print ('Done, Time cost: %s ' %  (endTime - startTime))
        
        # 下面是執行cmd命令來合成mp4視頻
        command=r'copy/b D:\python3.7\HEHE\爬蟲\ts\*.ts D:\python3.7\HEHE\爬蟲\mp4\萬水千山總是情-第{0}集.mp4'.format(num+1)
        output=subprocess.getoutput(command)
        print('萬水千山總是情-第{0}集.mp4  OK...'.format(num+1))
        
        # 下面是把這一集所有的ts文件給刪除
        file_list = []
        for root, dirs, files in os.walk('D:/python3.7/HEHE/爬蟲/ts'):
            for fn in files:
                p = str(root+'/'+fn)
                file_list.append(p)
        for i in file_list:
            os.remove(i)

四、一些技巧

ts合成mp4的cmd命令（在ts文件的路徑下）：copy/b *.ts xxx.mp4
ts文件的命名規則：要類似這樣的 000.ts，001.ts……，這樣合成的mp4纔不會亂套
下完一集併合成mp4之後要及時刪除ts文件

寫在最後

時隔3個月，我又來寫博客啦，因爲之前一直在忙畢業論文的事情，現在終於有空了。
那麼，我爲什麼寫這篇博客呢，其實主要是因爲最近一直在爬這個網站的視頻，並且爬的淨是些很久之前的粵語電視劇，然後上傳到天翼雲盤，再在電視機上播放給我爸看的，這不，天翼雲盤之前免費送了3個月黃金會員，送的內存直接是用不完的節奏呀，害得我想用電視劇把它給填滿哈哈。
那可能又有人會問了，直接找資源下載它不香嗎？這其實我也是被逼無奈呀，這些很久遠的電視劇資源是真的少，而且又要是粵語版的，就更是少得可憐，並且好不容易找到了，可是是在百度網盤上的，那個下載速度慢的呀，所以我纔會想到用爬蟲來搞，然後就找到了這網站，真的太多粵語劇了，爽歪歪呀。
最後如果大家遇到了那種困難級別的網站也可以跟我分享一下哦

最近我開了個微信公衆號，也會在公衆號同步文章的哦，大家有需要可以點點關注，謝謝！
ps：在公衆號中回覆20200526，即可拿到本文的源代碼

python多線程爬取ts文件併合成mp4視頻

python多線程爬取ts文件併合成mp4視頻

目錄

前言

一、分析頁面

二、整體思路邏輯

三、開始編寫代碼

四、一些技巧

寫在最後

2024年DataOps趨勢預測：AI不會取代數據工程師

雲原生週刊：K8s 中的服務和網絡｜ 2024.4.29

通過Http鏈接地址爬取有贊微信商城商品信息及下載至EXCEL

[轉帖]cpupower

今天，昨天，近七天，近30天，近90天，js封裝

華爲云云原生FinOps解決方案，釋放雲原生最大價值

python多線程爬取加密後ts文件，解密後合成mp4視頻

python多線程爬取ts文件併合成mp4視頻

python爬取Instagram上偶像的帖子（包括圖片和視頻）

【Scrapy學習心得】添加隨機用戶代理

python+appium爬取微信運動數據，並分析好友的日常步數情況

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結