MangoDB 是關於數據預處理的課程
一個馬拉馬車的案例
- 我們不應該信任數據,要思考他們從哪兒來的-人 or 機器
數據錯誤的案例
- google街景對房子判斷錯誤
- 網絡文檔編輯錯誤
- 表格數據類型錯誤
我們需要整理的大概有(we need to assess our data to:)
- Test Assumptions About
- Values
- data types
- shape
- Identify Errors or outliers
- Find missing values
Tabular Data
表格數據:office的excel\openoffice
大家關心的:
- 字段\label
- 內容
- Row\Columns \Value
CSV is Lightweight
- each line of text is single row
- Fields are separated by a delimeter
- Just the data itself
- don’t need special softwart
Parsing CSV File In Python(in this case – csv to dict)
[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-Qj0WKZz8-1573752573640)(/Users/donghuibiao/Library/Application Support/typora-user-images/image-20191114163348757.png)]
解析CSV文件
def parse_file(datafile):
data=[]
with open (datafile,'rb') as f:
header = f.readline().split(',')
counter = 0
for line in f:
if counter ==10:
break
fields = line.split(',')
entry={}
for i, value in numerate(fields):
entry[header[i].strip()]=value.strip()
data.append(entry)
counter += 1
return data
如果用上述方法分割,含有逗號的內容容易出問題
import csv
這個模塊可以解決很多csv問題
def parse_csv(datafile):
data=[]
n=0
with open (datafile,'rb')as sd:
r=csv.DictReader(sd)
for line in r:
data.append(line)
return data
if __name__ == '__main__':
datafile = os.path.join(DATADIR,DATAFILE)
parse_csv(datafile)
d= parse_csv(datafile)
pprint.pprint(d)
XLRD 簡介
import xlrd
datafile = "2013_ERCOT_Hourly_Load_Data.xls"
def parse_file(datafile):
workbook = xlrd.open_workbook(datafile)
sheet = workbook.sheet_by_index(0)
data = [[sheet.cell_value(r, col)
for col in range(sheet.ncols)]
for r in range(sheet.nrows)]
print "\nList Comprehension"
print "data[3][2]:",
print data[3][2]
print "\nCells in a nested loop:"
for row in range(sheet.nrows):
for col in range(sheet.ncols):
if row == 50:
print sheet.cell_value(row, col),
### other useful methods:
print "\nROWS, COLUMNS, and CELLS:"
print "Number of rows in the sheet:",
print sheet.nrows
print "Type of data in cell (row 3, col 2):",
print sheet.cell_type(3, 2)
print "Value in cell (row 3, col 2):",
print sheet.cell_value(3, 2)
print "Get a slice of values in column 3, from rows 1-3:"
print sheet.col_values(3, start_rowx=1, end_rowx=4)
print "\nDATES:"
print "Type of data in cell (row 1, col 0):",
print sheet.cell_type(1, 0)
exceltime = sheet.cell_value(1, 0)
print "Time in Excel format:",
print exceltime
print "Convert time to a Python datetime tuple, from the Excel float:",
print xlrd.xldate_as_tuple(exceltime, 0)
return data
data = parse_file(datafile)
上邊是一個案例的代碼,另一種處理方式.處理xls格式
.
JSON 簡介
(JavaScript Object Notation.)
JSON is a syntax for storing and exchanging data.
JSON is text, written with JavaScript object notation.
一些csv或者xls格式沒辦法儲存的格式,如一個格子裏多個數據行,就可以用json格式,(在Python中就是字典格式)
data modeling in json
- items may have different fields
- may have nested objects(嵌套對象)
- may have nested arrays(嵌套數組)
- 兩個超級鏈接
- w3school-json
- json org
Json 練習
Quiz : Exploring JSON Data(json一個練習作業,從提供的表格中找出這些答案)
-
How many Bands named’First Aid Kit’?
-
Begin -Area Name For Queen?
-
Spanish alias for Beatles?
-
Nirvana disambiguation?
-
When was one direction formed?