使用請求在python中下載大文件

本文翻譯自:Download large file in python with requests

Requests is a really nice library. 請求是一個非常不錯的庫。 I'd like to use it for download big files (>1GB). 我想用它來下載大文件(> 1GB)。 The problem is it's not possible to keep whole file in memory I need to read it in chunks. 問題是不可能將整個文件保留在內存中,我需要分塊讀取它。 And this is a problem with the following code 這是以下代碼的問題

import requests

def DownloadFile(url)
    local_filename = url.split('/')[-1]
    r = requests.get(url)
    f = open(local_filename, 'wb')
    for chunk in r.iter_content(chunk_size=512 * 1024): 
        if chunk: # filter out keep-alive new chunks
            f.write(chunk)
    f.close()
    return 

By some reason it doesn't work this way. 由於某種原因,它無法按這種方式工作。 It still loads response into memory before save it to a file. 仍將響應加載到內存中,然後再將其保存到文件中。

UPDATE 更新

If you need a small client (Python 2.x /3.x) which can download big files from FTP, you can find it here . 如果您需要一個小型客戶端(Python 2.x /3.x),可以從FTP下載大文件,則可以在此處找到它。 It supports multithreading & reconnects (it does monitor connections) also it tunes socket params for the download task. 它支持多線程和重新連接(它確實監視連接),還可以爲下載任務調整套接字參數。


#1樓

參考:https://stackoom.com/question/1836h/使用請求在python中下載大文件


#2樓

Your chunk size could be too large, have you tried dropping that - maybe 1024 bytes at a time? 您的塊大小可能太大,您是否嘗試過刪除它-一次一次可能是1024個字節? (also, you could use with to tidy up the syntax) (同樣,您可以使用with整理語法)

def DownloadFile(url):
    local_filename = url.split('/')[-1]
    r = requests.get(url)
    with open(local_filename, 'wb') as f:
        for chunk in r.iter_content(chunk_size=1024): 
            if chunk: # filter out keep-alive new chunks
                f.write(chunk)
    return 

Incidentally, how are you deducing that the response has been loaded into memory? 順便說一句,您如何推斷響應已加載到內存中?

It sounds as if python isn't flushing the data to file, from other SO questions you could try f.flush() and os.fsync() to force the file write and free memory; 聽起來好像python沒有將數據刷新到文件,從其他SO問題 f.flush()您可以嘗試f.flush()os.fsync()來強制文件寫入和釋放內存;

    with open(local_filename, 'wb') as f:
        for chunk in r.iter_content(chunk_size=1024): 
            if chunk: # filter out keep-alive new chunks
                f.write(chunk)
                f.flush()
                os.fsync(f.fileno())

#3樓

With the following streaming code, the Python memory usage is restricted regardless of the size of the downloaded file: 使用以下流代碼,無論下載文件的大小如何,Python內存的使用都受到限制:

def download_file(url):
    local_filename = url.split('/')[-1]
    # NOTE the stream=True parameter below
    with requests.get(url, stream=True) as r:
        r.raise_for_status()
        with open(local_filename, 'wb') as f:
            for chunk in r.iter_content(chunk_size=8192): 
                if chunk: # filter out keep-alive new chunks
                    f.write(chunk)
                    # f.flush()
    return local_filename

Note that the number of bytes returned using iter_content is not exactly the chunk_size ; 注意,使用iter_content返回的字節數不完全是chunk_size it's expected to be a random number that is often far bigger, and is expected to be different in every iteration. 它應該是一個通常更大的隨機數,並且在每次迭代中都應該有所不同。

See http://docs.python-requests.org/en/latest/user/advanced/#body-content-workflow for further reference. 有關更多參考,請參見http://docs.python-requests.org/en/latest/user/advanced/#body-content-workflow


#4樓

It's much easier if you use Response.raw and shutil.copyfileobj() : 如果使用Response.rawshutil.copyfileobj()會容易得多:

import requests
import shutil

def download_file(url):
    local_filename = url.split('/')[-1]
    with requests.get(url, stream=True) as r:
        with open(local_filename, 'wb') as f:
            shutil.copyfileobj(r.raw, f)

    return local_filename

This streams the file to disk without using excessive memory, and the code is simple. 這樣就無需佔用過多內存就可以將文件流式傳輸到磁盤,並且代碼很簡單。


#5樓

Not exactly what OP was asking, but... it's ridiculously easy to do that with urllib : 不完全是OP的要求,但是...使用urllib做到這一點非常容易:

from urllib.request import urlretrieve
url = 'http://mirror.pnl.gov/releases/16.04.2/ubuntu-16.04.2-desktop-amd64.iso'
dst = 'ubuntu-16.04.2-desktop-amd64.iso'
urlretrieve(url, dst)

Or this way, if you want to save it to a temporary file: 或這樣,如果您要將其保存到臨時文件中:

from urllib.request import urlopen
from shutil import copyfileobj
from tempfile import NamedTemporaryFile
url = 'http://mirror.pnl.gov/releases/16.04.2/ubuntu-16.04.2-desktop-amd64.iso'
with urlopen(url) as fsrc, NamedTemporaryFile(delete=False) as fdst:
    copyfileobj(fsrc, fdst)

I watched the process: 我看了看這個過程:

watch 'ps -p 18647 -o pid,ppid,pmem,rsz,vsz,comm,args; ls -al *.iso'

And I saw the file growing, but memory usage stayed at 17 MB. 而且我看到文件在增長,但內存使用量保持在17 MB。 Am I missing something? 我想念什麼嗎?

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章