Scrapy中關於Export Unicode字符集問題解決

原創

2018-08-30 17:03

使用命令行scrapy crawl spider_name -o filename將制定內容的item信息輸出時，scrapy使用默認的feed export對特定的file類型文件支持，例如json文件是JsonLinesItemExporter，xml文件是XmlItemExporter，有時候我們對export的形式或者內容不太滿意時，可以自己繼承上面的類，自定義export子類。

默認顯示的中文是閱讀性較差的Unicode字符，我們需要在settings文件中定義子類顯示出原來的字符集。

from scrapy.exporters import JsonLinesItemExporter  
class CustomJsonLinesItemExporter(JsonLinesItemExporter):  
    def __init__(self, file, **kwargs):  
        super(CustomJsonLinesItemExporter, self).__init__(file, ensure_ascii=False, **kwargs)

#這裏只需要將超類的ensure_ascii屬性設置爲False即可
#同時要在setting文件中啓用新的Exporter類

FEED_EXPORTERS = {  
    'json': 'porject.settings.CustomJsonLinesItemExporter',  
}

之後使用命令行scrapy crawl spider_name -o filename.json就可以顯示出正常可閱讀的字符。

Pipeline在Doc中的定義如下：

After an item has been scraped by a spider, it is sent to the Item Pipeline which processes it through several components that are executed sequentially

所以我們也可以在Pipeline類中直接定義輸出：

from project.settings import CustomJsonLinesItemExporter
class DoubanPipeline(object):
    #定義開始爬蟲的行爲
    def open_spider(self,spider):
        self.file=open('xxx.json','wb')
        self.exporter=CustomJsonLinesItemExporter(self.file)
        self.exporter.start_exporting()
    #定義爬蟲結束的行爲
    def close_spider(self,spider):
        self.exporter.finish_exporting()
        self.file.close()
    #定義爬蟲過程中的行爲
    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Scrapy中關於Export Unicode字符集問題解決

使用c#強大的表達式樹實現對象的深克隆之解決循環引用的問題

痞子衡嵌入式：恩智浦i.MX RT1xxx系列MCU啓動那些事（12.A）- uSDHC eMMC啓動時間(RT1170)

本地SSL證書過期輸入命令在IIS自動生成

洗白BT文件

基於ISE的設計實現基礎

ISE-testbench實例

Android Sdk獲取更新

verilog過程塊與賦值

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結