最近處理了一個ip地址段轉換的問題,在用python轉換完成後試了下數據分析框架pandas,果然速度快了不止一倍。
記錄一下pandas處理cvs文件的用法。
源文件(170M)示例如下:
000.000.000.000,000.255.255.255,保留地址,保留地址,*,*,*,*
001.000.000.000,001.000.000.255,APNIC,APNIC,*,apnic.net,*,*
001.000.001.000,001.000.003.255,中國,福建,*,*,*,*
001.000.004.000,001.000.007.255,澳大利亞,維多利亞州,墨爾本,*,*,*
001.000.008.000,001.000.015.255,中國,廣東,*,*,*,*
目標文件示例如下:
0000000000,0016777215,保留地址,保留地址,
0016777216,0016777471,APNIC,APNIC,
0016777472,0016778239,中國,福建,
0016778240,0016779263,澳大利亞,維多利亞州,墨爾本
0016779264,0016781311,中國,廣東,
下面是普通處理:
# 標準ip地址轉整數 def ip2dec(addr): # 將點分十進制IP地址轉換成十進制整數 items = [int(x) for x in addr.split(".")] return sum([items[i] << [24, 16, 8, 0][i] for i in range(4)])i def ip2decStr(addr): return str(ip2dec(addr)).zfill(10) def saveFile(outfile, date): fl = open(outfile, 'w') fl.writelines(date) fl.close() def dealFile(infile, outfile): # lines = open(file) data = [] with open(infile) as lines: for line in lines: sep = ',' word = line.split(sep, 5) ipStart = ip2decStr(word[0]) ipStop = ip2decStr(word[1]) country = word[2] provice = word[3] if (word[4] == '*'): city = '' else: city = word[4] temp = ipStart+sep+ipStop+sep+country+sep+provice+sep+city+'\n' data.append(temp) saveFile(outfile, data) # lines.close() if __name__ == '__main__': startTime = datetime.datetime.now() dealFile('D:\\ipfile.txt', 'D:\\ipResult01.txt') stopTime = datetime.datetime.now() print((stopTime-startTime).microseconds)使用pandas處理:
# coding:utf-8 import datetime import pandas as pd # 標準ip地址轉整數 def ip2dec(addr): # 將點分十進制IP地址轉換成十進制整數 items = [int(x) for x in addr.split(".")] return sum([items[i] << [24, 16, 8, 0][i] for i in range(4)]) def ip2decStr(addr): return str(ip2dec(addr)).zfill(10) def converts(str): if str == '*': str='' return str if __name__ == '__main__': # 使用pandas讀取文件數據 startTime = datetime.datetime.now() data = pd.read_csv('D:\\ipfile.txt', encoding='gbk', usecols=[0, 1, 2, 3, 4], header=None) data[0] = data[0].apply(lambda x: ip2decStr(x)) data[1] = data[1].apply(lambda x: ip2decStr(x)) data[4] = data[4].apply(lambda x: converts(x)) data.to_csv('D:\\ipResult02.csv', sep=',', index=False, header=None, encoding='gbk') stopTime = datetime.datetime.now() print((stopTime-startTime).microseconds)處理時間對比:
普通處理:900000微秒 VS pandas處理:618000微秒
PS:遇到的問題:
1、由於源文件是gbk格式,所以需要指定編碼,否則默認編碼爲UTF-8
data = pd.read_csv('D:\\ipfile.txt', encoding='gbk', header=None)2、如上代碼讀取文件,由於文件不規範,某些行列數多導致報錯:
pandas.errors.ParserError: Error tokenizing data. C error: Expected 8 fields in line 73, saw 10
由於業務只需要前5列數據所以作如下處理:
data = pd.read_csv('D:\\ipfile.txt', encoding='gbk', usecols=[0, 1, 2, 3, 4], header=None)