python 中遍歷文件夾一般用如下代碼:
import os
from os.path import join, getsize
for root, dirs, files in os.walk('python/Lib/email'):
print root, "consumes",
print sum([getsize(join(root, name)) for name in files]),
print "bytes in", len(files), "non-directory files"
if 'CVS' in dirs:
dirs.remove('CVS') # don't visit CVS directories
root是最外層文件夾名,dirs是該root文件夾下的所有子文件夾,files是該root文件夾下的所有文件。
今天看源碼的時候,有點懵逼,因爲用到了 生成器yield 和 遞歸 :
def walk(top, topdown=True, οnerrοr=None, followlinks=False):
import pdb # 這兩行是博主自己加的,目的是開啓單步調試。
pdb.set_trace() # n:下一步, p xxx:觀察xxx , l:查看所在代碼行
islink, join, isdir = path.islink, path.join, path.isdir
try:
# Note that listdir and error are globals in this module due
# to earlier import-*.
names = listdir(top)
except error, err:
if onerror is not None:
onerror(err)
return
dirs, nondirs = [], []
for name in names:
if isdir(join(top, name)):
dirs.append(name)
else:
nondirs.append(name)
if topdown:
yield top, dirs, nondirs
for name in dirs:
new_path = join(top, name)
if followlinks or not islink(new_path):
for x in walk(new_path, topdown, onerror, followlinks):
yield x
if not topdown:
yield top, dirs, nondirs
博主的疑問主要在於,topdown這個參數:
When topdown is true, the caller can modify the dirnames list in-place
(e.g., via del or slice assignment), and walk will only recurse into the
subdirectories whose names remain in dirnames; this can be used to prune the
search, or to impose a specific order of visiting.
看文檔topdown = True的時候,可以原地修改文件夾們,然後只會遞歸那些還留着的文件夾,可以減少查詢次數??
好吧,看得我一愣一愣的,什麼鬼嘛,只好單步下看看。
try:
# Note that listdir and error are globals in this module due
# to earlier import-*.
names = listdir(top)
except error, err:
if onerror is not None:
onerror(err)
return
dirs, nondirs = [], []
for name in names:
if isdir(join(top, name)):
dirs.append(name)
else:
nondirs.append(name)
這一段講的是把文件夾root下的子文件夾dirs 和 文件nondirs 分別找出來。沒啥難度。
我的目錄如下:
E:\projects\myApp_emits\myApp
E:\projects\myApp_emits\myApp\a.jnt
E:\projects\myApp_emits\myApp\b
E:\projects\myApp_emits\myApp\b\c.txt
我的調用函數如下:
import os
des_folder = 'e:/projects/myApp_emits/myApp'
a = os.walk(des_folder, topdown=True)
parent, dir, files = a.next()
print parent, dir, files
parent, dir, files = a.next()
print parent, dir, files
接下來這段先看topdown=True的情況,源代碼簡化爲:
if topdown:
yield top, dirs, nondirs
for name in dirs:
new_path = join(top, name)
if followlinks or not islink(new_path):
for x in walk(new_path, topdown, onerror, followlinks):
yield x
稍微解釋下:
因爲os.walk這個函數帶有yield,那麼它就不再是函數啦,是個生成器,記爲a, 不斷得a.next() 就可以不斷返回yield後面的參數。
eg: yield top, dirs, nondirs ,那麼每次a.next() 就會返回top, dirs, nondirs, 然後整個生成器掛起,直到下一個next() 觸發,從yield top, dirs, nondirs這一句後的下一句繼續執行,直到再次遇到yield,若沒有遇到就結束啦。(奇怪,博主怎麼來了一波yield講解。。)
所以按照我們的代碼結果如下:
e:/projects/myApp_emits/myApp ['b'] ['a.jnt']
e:/projects/myApp_emits/myApp\b [] ['c.txt']
for name in dirs:
new_path = join(top, name)
if followlinks or not islink(new_path):
for x in walk(new_path, topdown, onerror, followlinks):
yield x
if not topdown:
yield top, dirs, nondirs
e:/projects/myApp_emits/myApp\b [] ['c.txt']
e:/projects/myApp_emits/myApp ['b'] ['a.jnt']
對比結果我們能知道,topdown參數其實作用很簡單,True則先掃頂級目錄,False則從子目錄開掃,最後再掃頂級目錄。
Ps:
單步遇到遞歸要慢點,不然容易暈,這個例子還算好的,不暈,看tornado那個yield+裝飾器,分分鐘讓你迷失在人生道路。