EasyDL定製化圖像識別-爬蟲清洗

百度大腦行業應用創新挑戰賽啓動中，萬元大獎等你拿

https://juejin.im/post/5bbd97c2e51d45021147dc98

“分贓”說明：

如果得到名次和獎金，發起人本人只負責分配獎金給爬蟲和數據清洗人員，不參與獎金分配。

爬蟲和數據清洗步驟：
1、爬取人臉素顏照、素顏大頭照；
2、多重檢測：
㈠調用百度人臉識別 api(detect)，保留識別到的人臉圖片；

https://aip.baidubce.com/rest/2.0/face/v3/detect

㈡調用 face++ 皮膚問題識別API接口(Face Analyze API)，分類保存圖片；

https://api-cn.faceplusplus.com/facepp/v3/face/analyze

㈢調用百度Easy DL剛訓練好的皮膚問題分類api，作機器最後篩選一遍；

https://aip.baidubce.com/rpc/2.0/ai_custom/v1/classification/anchuang

去重？

3、人工篩查，去掉錯誤圖片。

數據來源-爬蟲

1、百度，關鍵詞

臉、人臉、素顏、素顏大頭照

暗瘡、痘痘、青春痘、痤瘡、痘、粉刺；

任意一個相加組合；

黑眼圈
+
臉、人臉、素顏、素顏大頭照、貼吧

色斑 + 貼吧

2、百度貼吧

青春痘吧

http://tieba.baidu.com/f?ie=utf-8&kw=%E9%9D%92%E6%98%A5%E7%97%98&fr=search&red_tag=v3468036147

3、搜狗

https://pic.sogou.com/pics?ie=utf8&p=40230504&interV=kKIOkrELjboMmLkElbkTkKIJl7ELjboImLkEk74TkKIMkrELjbkRmLkEmrELjbgRmLkEkLY=_1258035508&query=%E6%9A%97%E7%96%AE&

百度圖片參考：

#!/usr/bin/python  
# -*- coding:utf-8 -*-  
import http.client   
import urllib  
import json  
import urllib3  
import re  
import os  
  
class BaiduImage(object):  
    def __init__(self):  
        super(BaiduImage,self).__init__()  
        print(u'圖片獲取中,CTRL+C 退出程序...') 
        self.page = 60                    #當前頁數  
        if not os.path.exists(r'./image'):  
            os.mkdir(r'./image')                      
      
      
    def request(self):  
        try:  
            while 1:  
                conn=http.client.HTTPSConnection('image.baidu.com')  
                request_url ='/search/avatarjson?tn=resultjsonavatarnew&ie=utf-8&word=%E7%BE%8E%E5%A5%B3&cg=girl&rn=60&pn='+str(self.page)  
                headers = {'User-Agent' :'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.0','Content-type': 'sinanews/html'}  
                #body = urllib.urlencode({'tn':'resultjsonavatarnew','ie':'utf-8','word':'%E7%BE%8E%E5%A5%B3','cg':'girl','pn':self.page,'rn':'60'})  
                conn.request('GET',request_url,headers = headers)  
                r= conn.getresponse()  
                #print r.status  
                if r.status == 200:  
                    data = r.read()  
                      
                    data = unicode(data, errors='ignore')  
                    decode = json.loads(data)  
                    self.download(decode['imgs'])  
              
                self.page += 60  
        except Exception as e:  
            print(e)
            
        finally:  
            conn.close()  
              
    def download(self,data):  
      
        for d in data:    
            #url = d['thumbURL']   縮略圖  尺寸200  
            #url = d['hoverURL']           尺寸360  
            url =d['objURL']  
            data =urllib3.urlopen(url).read()  
              
            pattern = re.compile(r'.*/(.*?)\.jpg',re.S)  
            item = re.findall(pattern,url)  
            FileName = str('image/')+item[0]+str('.jpg')  
              
            with open(FileName,'wb') as f:  
                f.write(data)  
      
if  __name__ == '__main__':  
    bi = BaiduImage()  
    bi.request()

百度貼吧：

測試案例.ipynb

https://colab.research.google.com/drive/1XXbWCGBNdJdH2F4mdjSiivcuVCxAI-2N#scrollTo=C10gA4EeMwQI&uniqifier=13

# -*- coding:utf-8 -*-

from urllib import request
import chardet
import re



# 獲取網頁源代碼
def getHtml(url):
    page = request.urlopen(url)
    html = page.read()
    return html

# 獲取圖片地址
def getImg(html):
    # 正則匹配
    reg = r'src="([.*\S]*\.jpg)" size="\d+" changedsize="true"'
    imgre = re.compile(reg);
    img_list = re.findall(imgre, html)

    # 返回圖片地址列表
    return img_list


if __name__ == '__main__':
    # 帖子地址
    url = 'http://tieba.baidu.com/p/5944770997?pn='

    # 保存圖片地址的列表
    imgListSum = []

    # 遍歷每一頁，獲取對應頁面的圖片地址
    for i in range(1, 12):
        # 拼接網頁分頁地址
        html = getHtml(url + str(i)).decode('utf-8')

        # 獲取網頁源代碼
        imgList = getImg(html)

        # 獲取圖片地址並添加到列表中
        imgListSum.append(imgList)

    # 遍歷下載圖片
    # 按順序自加給圖片命名
    imgName = 0

    for i in imgListSum:
        for j in i:
            # 驗證（打印圖片地址）
            print(j)

            # 合成圖片的保存路徑和名字，並下載
            f = open('pic/' + str(imgName) + '.jpg', 'wb')
            f.write(request.urlopen(j).read())
            f.close()

            # 命名 + 1
            imgName += 1

    # 結束標誌
    print('Finish')

爬人名（明星素顏照）：用途相貌相關度，顏值比較，超過多少明星。

#!/usr/bin/env python
# encoding: utf-8
import urllib3
import re
import os
import sys
# reload(sys)
# sys.setdefaultencoding("utf-8")
import importlib
importlib.reload(sys)

def img_spider(name_file):
    user_agent = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36"
    headers = {'User-Agent':user_agent}
    #讀取名單txt，生成包括所有人的名單列表
    with open(name_file) as f:
        name_list = [name.rstrip() for name in f.readlines()]
        f.close()
    #遍歷每一個人，爬取30張關於他的圖，保存在以他名字命名的文件夾中
    for name in name_list:
        #生成文件夾（如果不存在的話）
        if not os.path.exists('D:/celebrity/img_data/' + name):
            os.makedirs('D:/celebrity/img_data/' + name)
            try:
                #有些外國人名字中間是空格，要把它替換成%20，不然訪問頁面會出錯。
                url = "http://image.baidu.com/search/avatarjson?tn=resultjsonavatarnew&ie=utf-8&word=" + name.replace(' ','%20') + "&cg=girl&rn=60&pn=60"
                req = urllib2.Request(url, headers=headers)
#                 print(req)
                res = urllib2.urlopen(req)
                page = res.read()
#                 print(page)
                #因爲JSON的原因，在瀏覽器頁面按F12看到的，和你打印出來的頁面內容是不一樣的，所以匹配的是objURL這個東西，對比一下頁面裏別的某某URL，那個能訪問就用那個
                img_srcs = re.findall('"objURL":"(.*?)"', page, re.S)
                print(name,len(img_srcs))
            except:
                #如果訪問失敗，就跳到下一個繼續執行代碼，而不終止程序
                print(name," error:")
                continue
            j = 1
            src_txt = ''

            #訪問上述得到的圖片路徑，保存到本地
            for src in img_srcs:
                with open('D:/celebrity/img_data/' + name + '/' + str(j)+'.jpg','wb') as p:
                    try:
                        print("downloading No.%d"%j)
                        req = urllib2.Request(src, headers=headers)
                        #設置一個urlopen的超時，如果3秒訪問不到，就跳到下一個地址，防止程序卡在一個地方。
                        img = urllib2.urlopen(src,timeout=3)
                        p.write(img.read())
                    except:
                        print("No.%d error:"%j)
                        p.close()
                        continue
                    p.close()
                src_txt = src_txt + src + '\n'
                if j==30:
                    break
                j = j+1
            #保存30個圖片的src路徑爲txt，我要一行一個，所以加換行符
            with open('D:/celebrity/img_data/' + name + '/' + name +'.txt','wb') as p2:
                p2.write(src_txt)
                p2.close()
                print("save %s txt done"%name)

#主程序，讀txt文件開始爬
if __name__ == '__main__':
    name_file = "name_lists1.txt"
    img_spider(name_file)

數據清洗：

百度：

服務名稱：  皮膚問題分類
模型版本：  V2
接口地址：  https://aip.baidubce.com/rpc/2.0/ai_custom/v1/classification/anchuang
服務狀態：  已發佈

㈢圖像分類API

http://ai.baidu.com/docs#/EasyDL_VIS_API/6d673ae4

在Python下請求我們的接口服務：

https://blog.csdn.net/weixin_36512652/article/details/80706971

參考：

#!/usr/bin/python3.6
import json
import requests
import base64

'''
client_id 爲官網獲取的AK， client_secret 爲官網獲取的SK
https://aip.baidubce.com/rpc/2.0/ai_custom/v1/classification/anchuang

應用名稱
AppID
API Key
Secret Key
人臉識別暗瘡檢測
14777381
oEWnhIQ3EquDNrmBAGxwDEXU
PIZnpNWGKQkbtGAaOBBIYGi3y6G2KFx7
'''
""" 註釋"""


host = 'https://aip.baidubce.com/oauth/2.0/token?grant_type=client_credentials&client_id=oEWnhIQ3EquDNrmBAGxwDEXU' \
       '&client_secret=PIZnpNWGKQkbtGAaOBBIYGi3y6G2KFx7'
response = requests.get(host)
content = response.json()
access_token = content["access_token"]
 
#image = open(r'C:\\Users\\pain\\Desktop\\plastic.jpg', 'rb').read()
#D:\1.jpg
image = open(r'timg (20).jpg', 'rb').read()
data = {'image': base64.b64encode(image).decode()}
 
request_url = "https://aip.baidubce.com/rpc/2.0/ai_custom/v1/classification/anchuang" + "?access_token=" + access_token
response = requests.post(request_url, data=json.dumps(data))
content = response.json()
 
print(content)

暫時只看anchuang，保存score高於0.8的圖片；

㈠百度人臉檢測與屬性分析：

http://ai.baidu.com/docs#/Face-Detect-V3/top

#!/usr/bin/python3.6
# encoding:utf-8
import json
import requests
import base64

import urllib
#import urllib2  
'''
client_id 爲官網獲取的AK， client_secret 爲官網獲取的SK
https://aip.baidubce.com/rpc/2.0/ai_custom/v1/classification/anchuang

應用名稱
AppID
API Key
Secret Key
人臉識別暗瘡檢測
14777381
oEWnhIQ3EquDNrmBAGxwDEXU
PIZnpNWGKQkbtGAaOBBIYGi3y6G2KFx7
'''
""" access_token"""
host = 'https://aip.baidubce.com/oauth/2.0/token?grant_type=client_credentials&client_id=oEWnhIQ3EquDNrmBAGxwDEXU' \
       '&client_secret=PIZnpNWGKQkbtGAaOBBIYGi3y6G2KFx7'
response = requests.get(host)
content = response.json()
access_token = content["access_token"]


'''皮膚問題分類api'''
 
#image = open(r'C:\\Users\\pain\\Desktop\\plastic.jpg', 'rb').read()
#D:\1.jpg
image = open(r'1.jpg', 'rb').read()
data = {'image': base64.b64encode(image).decode()}
      
request_url = "https://aip.baidubce.com/rpc/2.0/ai_custom/v1/classification/anchuang" + "?access_token=" + access_token
response = requests.post(request_url, data=json.dumps(data))
content = response.json()
print(content)

 
'''
人臉檢測與屬性分析
''' 
request_url = "https://aip.baidubce.com/rest/2.0/face/v3/detect"
"""
image_type	是	string	圖片類型
BASE64:圖片的base64值，base64編碼後的圖片數據，編碼後的圖片大小不超過2M；
URL:圖片的 URL地址；
FACE_TOKEN: 人臉圖片的唯一標識，調用人臉檢測接口時，會爲每個人臉圖片賦予一個唯一的FACE_TOKEN，同一張圖片多次檢測得到的FACE_TOKEN是同一個。
"""
image1=base64.b64encode(image).decode()
params = {"image": image1,"image_type":'BASE64',"face_field":'facetype'}

#access_token = '[調用鑑權接口獲取的token]'
request_url = request_url + "?access_token=" + access_token


response = requests.post(request_url, data=json.dumps(params))
content2 = response.json()

if content2:
    print(content2 )

檢測人臉：

+face_probability

是

double

人臉置信度，範圍【0~1】，代表這是一張人臉的概率，0最小、1最大。

檢測是真人還是卡通：

+face_type	否	array	真實人臉/卡通人臉 face_field包含face_type時返回
++type	否	string	human: 真實人臉 cartoon: 卡通人臉
++probability	否	double	人臉類型判斷正確的置信度，範圍【0~1】，0代表概率最小、1代表最大。

{'log_id': 2099415301428326923, 'results': [{'name': '[default]', 'score': 0.995880126953125}, {'name': 'anchuang', 'score': 0.004119818564504385}]}
{'error_code': 0, 'error_msg': 'SUCCESS', 'log_id': 747956921634075431, 'timestamp': 1542163407, 'cached': 0, 'result': {'face_num': 1, 'face_list': [{'face_token': '41fa0d8f809e783149256daa3f671c9a', 'location': {'left': 59.25, 'top': 78.97, 'width': 94, 'height': 96, 'rotation': -7}, 'face_probability': 0.67, 'angle': {'yaw': 7.13, 'pitch': 0.44, 'roll': -11.28}, 'face_shape': {'type': 'oval', 'probability': 0.51}, 'face_type': {'type': 'cartoon', 'probability': 1}}]}}

先判斷'face_type' 是否是真人，過濾掉'type': 'cartoon',只保留'type': 'human'；

再判斷'face_probability'是否爲人臉；

Face++：

face++ 識別皮膚問題

https://blog.csdn.net/jacka654321/article/details/82709346

美噠

o9Gya1IK095laM5GXxykVWctQyKrf06M

KtCHz_QbjtWlh6NYwhe1PC7Nw8ql_6Wz隱藏

試用

啓用

查看

㈠Detect API （可以同時檢測人臉和皮膚狀態skinstatus）

調用URL

https://api-cn.faceplusplus.com/facepp/v3/detect

描述

傳入圖片進行人臉檢測和人臉分析。

可以檢測圖片內的所有人臉，對於每個檢測出的人臉，會給出其唯一標識 face_token，可用於後續的人臉分析、人臉比對等操作。對於正式 API Key，支持指定圖片的某一區域進行人臉檢測。

本 API 支持對檢測到的人臉直接進行分析，獲得人臉的關鍵點和各類屬性信息。對於試用 API Key，最多隻對人臉框面積最大的 5 個人臉進行分析，其他檢測到的人臉可以使用 Face Analyze API 進行分析。對於正式 API Key，支持分析所有檢測到的人臉。

https://console.faceplusplus.com.cn/documents/4888373

㈡Face Analyze API

傳入在 Detect API 檢測出的人臉標識 face_token，分析得出人臉關鍵點，人臉屬性信息。一次調用最多支持分析 5 個人臉。

調用URL

https://api-cn.faceplusplus.com/facepp/v3/face/analyze

return_attributes=skinstatus

篩選保存：

直接爬取圖片必須進過過濾，才能進行人工處理，否則發揮不出爬蟲優勢；

由於皮膚問題分類 api，調用次數只有500，而且實際準確率不高，先不上第三步，保留第一步，把第二步face++ Detect API的attributes
skinstatus
acne：青春痘
閾值設定值80，超過80的才下載，到500停止，這樣方便人工和最終的api處理；

人臉概率是人的概率青春痘的概率：人工挑選圖片/爬蟲圖片
face0.9 human0.9 acne80 ：24/72

face0.8 human0.8 acne80 ：78/262

人工清洗：

1、選取能明確判斷的暗瘡人臉圖像；

2、人臉佔比過小或背景環境複雜，截取頭像部分保存；

3、按名字分類保存，打包壓縮成zip格式，上傳平臺訓練；

4、訓練完成後，分析誤判原因，查看在識別出錯的圖片，把人眼也難以分辨的剔除，背景複雜的而誤判的也剔除；

模型準確率，跟人感官評判有出入；

模型準確率100%的模型跟感官反差最大，90%左右反而跟感官評判的比較相近；

數據提交：

用python語言；最好用 Jupyter；

爬蟲圖片數據，打包成zip格式，上傳百度雲分享鏈接；

代碼上傳GitHub；

https://github.com/jacka654321/EasyDL_face_analyzer

聯繫發起人：

JackA

電話|微信：13244829625

設計和實現一款輕量級的爬蟲框架

https://mp.weixin.qq.com/s?__biz=MzI5ODI5NDkxMw==&mid=2247488788&idx=1&sn=49f88dfd7bd85748e845a03a1aaa7eda&chksm=eca95efadbded7ec549ae39e1ddbe74798db580cb5a327e1c76f2a109e3c58f24e20b6286d79&mpshare=1&scene=1&srcid=#rd

EasyDL定製化圖像識別-爬蟲清洗

百度：

Face++：

設計和實現一款輕量級的爬蟲框架

小程序微商城-商鋪管理後臺

EasyDL定製化圖像識別-爬蟲清洗

django-web開發框架-使用Ajax

致X小姐

樹莓派Raspberry 3b+ 搭建服務器全能環境+web控制面板+外網快速訪問

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結