pandas 處理cvs格式文件

原創

tan_qian

2018-08-26 15:10

最近處理了一個ip地址段轉換的問題，在用python轉換完成後試了下數據分析框架pandas，果然速度快了不止一倍。

記錄一下pandas處理cvs文件的用法。

源文件（170M）示例如下：

000.000.000.000,000.255.255.255,保留地址,保留地址,*,*,*,*
001.000.000.000,001.000.000.255,APNIC,APNIC,*,apnic.net,*,*
001.000.001.000,001.000.003.255,中國,福建,*,*,*,*
001.000.004.000,001.000.007.255,澳大利亞,維多利亞州,墨爾本,*,*,*
001.000.008.000,001.000.015.255,中國,廣東,*,*,*,*

目標文件示例如下：

0000000000,0016777215,保留地址,保留地址,
0016777216,0016777471,APNIC,APNIC,
0016777472,0016778239,中國,福建,
0016778240,0016779263,澳大利亞,維多利亞州,墨爾本
0016779264,0016781311,中國,廣東,

下面是普通處理：

# 標準ip地址轉整數
def ip2dec(addr):
    # 將點分十進制IP地址轉換成十進制整數
    items = [int(x) for x in addr.split(".")]
    return sum([items[i] << [24, 16, 8, 0][i] for i in range(4)])i

def ip2decStr(addr):
    return str(ip2dec(addr)).zfill(10)

def saveFile(outfile, date):
    fl = open(outfile, 'w')
    fl.writelines(date)
    fl.close()

def dealFile(infile, outfile):
    # lines = open(file)
    data = []

    with open(infile) as lines:
        for line in lines:
            sep = ','
            word = line.split(sep, 5)
            ipStart = ip2decStr(word[0])
            ipStop = ip2decStr(word[1])
            country = word[2]
            provice = word[3]
            if (word[4] == '*'):
                city = ''
            else:
                city = word[4]
            temp = ipStart+sep+ipStop+sep+country+sep+provice+sep+city+'\n'
            data.append(temp)
    saveFile(outfile, data)
    # lines.close()


if __name__ == '__main__':
    startTime = datetime.datetime.now()
    dealFile('D:\\ipfile.txt', 'D:\\ipResult01.txt')
    stopTime = datetime.datetime.now()
    print((stopTime-startTime).microseconds)

使用pandas處理：

# coding:utf-8
import datetime
import pandas as pd

# 標準ip地址轉整數
def ip2dec(addr):
    # 將點分十進制IP地址轉換成十進制整數
    items = [int(x) for x in addr.split(".")]
    return sum([items[i] << [24, 16, 8, 0][i] for i in range(4)])

def ip2decStr(addr):
    return str(ip2dec(addr)).zfill(10)

def converts(str):
    if str == '*':
        str=''
    return str

if __name__ == '__main__':
    # 使用pandas讀取文件數據
    startTime = datetime.datetime.now()
    data = pd.read_csv('D:\\ipfile.txt', encoding='gbk', usecols=[0, 1, 2, 3, 4], header=None)
    data[0] = data[0].apply(lambda x: ip2decStr(x))
    data[1] = data[1].apply(lambda x: ip2decStr(x))
    data[4] = data[4].apply(lambda x: converts(x))
    data.to_csv('D:\\ipResult02.csv', sep=',', index=False, header=None, encoding='gbk')
    stopTime = datetime.datetime.now()
    print((stopTime-startTime).microseconds)

處理時間對比：

普通處理：900000微秒 VS pandas處理：618000微秒

PS：遇到的問題：

1、由於源文件是gbk格式，所以需要指定編碼，否則默認編碼爲UTF-8

data = pd.read_csv('D:\\ipfile.txt', encoding='gbk',  header=None)

2、如上代碼讀取文件，由於文件不規範，某些行列數多導致報錯：

pandas.errors.ParserError: Error tokenizing data. C error: Expected 8 fields in line 73, saw 10

由於業務只需要前5列數據所以作如下處理：

data = pd.read_csv('D:\\ipfile.txt', encoding='gbk', usecols=[0, 1, 2, 3, 4], header=None)

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

pandas 處理cvs格式文件

一鍵自動化博客發佈工具,用過的人都說好(掘金篇)

「Pygors跨平臺GUI」2：安裝MinGW-w64、MSYS2還是WSL2

[轉帖]

python列出centos7內存使用前50的進程信息

「Pygors跨平臺GUI」1：Pygors跨平臺GUI應用研究

nodejs學習06——小案例

評估統計算法在銀行僞造鈔票檢測中的價值

C# Xmlserializer 程序集內存泄露

Java ThreadPoolShutdown

5月21日相聚上海張江！與文心大模型一起共建大模型產業應用生態圈

ELK 7.x -- elastalert 企業微信告警

Golang查詢Elasticsearch

hbase region下線故障修復

pandas 處理cvs格式文件

Zookeeper部署

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結