使用環境:Anaconda3,Chorme
一個網絡爬蟲程序最普遍的過程:
- 訪問站點;
- 定位所需的信息;
- 得到並處理信息。
示例1:爬python官網的“python之禪”
import requests
url = 'https://www.python.org/dev/peps/pep-0020/'
res = requests.get(url)
text = res.text
text
可以看到返回的其實就是開發者工具下Elements的內容,只不過是字符串類型,接下來我們要用python的內置函數find來定位“python之禪”的索引,然後從這段字符串中取出它。
with open('zon_of_python.txt','w')as f:
f.write(text[text.find('<pre')+28:text.find('</pre>')-1])
print(text[text.find('<pre')+28:text.find('</pre')-1])
示例2:爬取豆瓣電影
import requests
import os
if not os.path.exists('image'):
os.mkdir('image')
def parse_html(url):
headers={"User-Agent":"Mozilla/5.0(Windows NT 10.0;Win64; x64)AppleWebKit/537.36(KHTML,like Gecko)Chrome/74.0.3729.169 Safari/537.36"}
res=requests.get(url,headers=headers)
text=res.text
item=[]
for i in range(25)
text=text[text.find('alt')+3:]
item.append(extract(text))
return item
def extract(text):
text=text.split('"')
name=text[1]
image=text[3]
return name,image
def write_movies_file(item, stars):
print(item)
with open('douban_film.txt','a',encoding='utf-8') as f:
f.write('排名:%d\t電影名:%s\n' % (stars, item[0]))
r = requests.get(item[1])
with open('image/' + str(item[0]) + '.jpg', 'wb') as f:
f.write(r.content)
def main():
stars = 1
for offset in range(0, 250, 25):
url = 'https://movie.douban.com/top250?start=' + str(offset) +'&filter='
for item in parse_html(url):
write_movies_file(item, stars)
stars += 1
if __name__ == '__main__':
main()