將字典列表轉換爲Pandas DataFrame

本文翻譯自:Convert list of dictionaries to a pandas DataFrame

I have a list of dictionaries like this: 我有這樣的詞典列表:

[{'points': 50, 'time': '5:00', 'year': 2010}, 
{'points': 25, 'time': '6:00', 'month': "february"}, 
{'points':90, 'time': '9:00', 'month': 'january'}, 
{'points_h1':20, 'month': 'june'}]

And I want to turn this into a pandas DataFrame like this: 我想把它變成這樣的pandas DataFrame

      month  points  points_h1  time  year
0       NaN      50        NaN  5:00  2010
1  february      25        NaN  6:00   NaN
2   january      90        NaN  9:00   NaN
3      june     NaN         20   NaN   NaN

Note: Order of the columns does not matter. 注意:列的順序無關緊要。

How can I turn the list of dictionaries into a pandas DataFrame as shown above? 如何將字典列表轉換爲如上所述的pandas DataFrame?


#1樓

參考:https://stackoom.com/question/1Oat4/將字典列表轉換爲Pandas-DataFrame


#2樓

假設d是您的字典列表,簡單地:

pd.DataFrame(d)

#3樓

在熊貓16.2中,我必須執行pd.DataFrame.from_records(d)才能使其正常工作。


#4樓

You can also use pd.DataFrame.from_dict(d) as : 您還可以將pd.DataFrame.from_dict(d)用作:

In [8]: d = [{'points': 50, 'time': '5:00', 'year': 2010}, 
   ...: {'points': 25, 'time': '6:00', 'month': "february"}, 
   ...: {'points':90, 'time': '9:00', 'month': 'january'}, 
   ...: {'points_h1':20, 'month': 'june'}]

In [12]: pd.DataFrame.from_dict(d)
Out[12]: 
      month  points  points_h1  time    year
0       NaN    50.0        NaN  5:00  2010.0
1  february    25.0        NaN  6:00     NaN
2   january    90.0        NaN  9:00     NaN
3      june     NaN       20.0   NaN     NaN

#5樓

How do I convert a list of dictionaries to a pandas DataFrame? 如何將字典列表轉換爲Pandas DataFrame?

The other answers are correct, but not much has been explained in terms of advantages and limitations of these methods. 其他答案是正確的,但是就這些方法的優點和侷限性而言,並沒有太多解釋。 The aim of this post will be to show examples of these methods under different situations, discuss when to use (and when not to use), and suggest alternatives. 這篇文章的目的是展示在不同情況下這些方法的示例,討論何時使用(何時不使用),並提出替代方案。


DataFrame() , DataFrame.from_records() , and .from_dict() DataFrame()DataFrame.from_records().from_dict()

Depending on the structure and format of your data, there are situations where either all three methods work, or some work better than others, or some don't work at all. 根據數據的結構和格式,在某些情況下,這三種方法要麼全部起作用,要麼某些方法比其他方法更好,或者有些根本不起作用。

Consider a very contrived example. 考慮一個非常人爲的例子。

np.random.seed(0)
data = pd.DataFrame(
    np.random.choice(10, (3, 4)), columns=list('ABCD')).to_dict('r')

print(data)
[{'A': 5, 'B': 0, 'C': 3, 'D': 3},
 {'A': 7, 'B': 9, 'C': 3, 'D': 5},
 {'A': 2, 'B': 4, 'C': 7, 'D': 6}]

This list consists of "records" with every keys present. 該列表由“記錄”組成,其中包含每個鍵。 This is the simplest case you could encounter. 這是您可能遇到的最簡單的情況。

# The following methods all produce the same output.
pd.DataFrame(data)
pd.DataFrame.from_dict(data)
pd.DataFrame.from_records(data)

   A  B  C  D
0  5  0  3  3
1  7  9  3  5
2  2  4  7  6

Word on Dictionary Orientations: orient='index' / 'columns' 詞典定位詞: orient='index' / 'columns'

Before continuing, it is important to make the distinction between the different types of dictionary orientations, and support with pandas. 在繼續之前,重要的是要區分不同類型的字典方向和熊貓的支持。 There are two primary types: "columns", and "index". 有兩種主要類型:“列”和“索引”。

orient='columns'
Dictionaries with the "columns" orientation will have their keys correspond to columns in the equivalent DataFrame. 方向爲“列”的字典的鍵將與等效DataFrame中的列相對應。

For example, data above is in the "columns" orient. 例如,以上data以“列”方向顯示。

data_c = [
 {'A': 5, 'B': 0, 'C': 3, 'D': 3},
 {'A': 7, 'B': 9, 'C': 3, 'D': 5},
 {'A': 2, 'B': 4, 'C': 7, 'D': 6}]

pd.DataFrame.from_dict(data_c, orient='columns')

   A  B  C  D
0  5  0  3  3
1  7  9  3  5
2  2  4  7  6

Note: If you are using pd.DataFrame.from_records , the orientation is assumed to be "columns" (you cannot specify otherwise), and the dictionaries will be loaded accordingly. 注意:如果使用的是pd.DataFrame.from_records ,則假定方向爲“列”(否則無法指定),並且將相應地加載字典。

orient='index'
With this orient, keys are assumed to correspond to index values. 通過這種定向,鍵被假定爲對應於索引值。 This kind of data is best suited for pd.DataFrame.from_dict . 這種數據最適合pd.DataFrame.from_dict

data_i ={
 0: {'A': 5, 'B': 0, 'C': 3, 'D': 3},
 1: {'A': 7, 'B': 9, 'C': 3, 'D': 5},
 2: {'A': 2, 'B': 4, 'C': 7, 'D': 6}}

pd.DataFrame.from_dict(data_i, orient='index')

   A  B  C  D
0  5  0  3  3
1  7  9  3  5
2  2  4  7  6

This case is not considered in the OP, but is still useful to know. 在OP中不考慮這種情況,但仍然有用。

Setting Custom Index 設置自定義索引

If you need a custom index on the resultant DataFrame, you can set it using the index=... argument. 如果需要在結果DataFrame上使用自定義索引,則可以使用index=...參數進行設置。

pd.DataFrame(data, index=['a', 'b', 'c'])
# pd.DataFrame.from_records(data, index=['a', 'b', 'c'])

   A  B  C  D
a  5  0  3  3
b  7  9  3  5
c  2  4  7  6

This is not supported by pd.DataFrame.from_dict . pd.DataFrame.from_dict不支持此pd.DataFrame.from_dict

Dealing with Missing Keys/Columns 處理缺少的鍵/列

All methods work out-of-the-box when handling dictionaries with missing keys/column values. 當處理缺少鍵/列值的字典時,所有方法都是開箱即用的。 For example, 例如,

data2 = [
     {'A': 5, 'C': 3, 'D': 3},
     {'A': 7, 'B': 9, 'F': 5},
     {'B': 4, 'C': 7, 'E': 6}]

# The methods below all produce the same output.
pd.DataFrame(data2)
pd.DataFrame.from_dict(data2)
pd.DataFrame.from_records(data2)

     A    B    C    D    E    F
0  5.0  NaN  3.0  3.0  NaN  NaN
1  7.0  9.0  NaN  NaN  NaN  5.0
2  NaN  4.0  7.0  NaN  6.0  NaN

Reading Subset of Columns 讀取列子集

"What if I don't want to read in every single column"? “如果我不想在每一列中閱讀該怎麼辦”? You can easily specify this using the columns=... parameter. 您可以使用columns=...參數輕鬆指定。

For example, from the example dictionary of data2 above, if you wanted to read only columns "A', 'D', and 'F', you can do so by passing a list: 例如,從上面的data2示例字典中,如果您只想讀取列“ A”,“ D”和“ F”,則可以通過傳遞一個列表來做到這一點:

pd.DataFrame(data2, columns=['A', 'D', 'F'])
# pd.DataFrame.from_records(data2, columns=['A', 'D', 'F'])

     A    D    F
0  5.0  3.0  NaN
1  7.0  NaN  5.0
2  NaN  NaN  NaN

This is not supported by pd.DataFrame.from_dict with the default orient "columns". 具有默認方向“列”的pd.DataFrame.from_dict不支持此功能。

pd.DataFrame.from_dict(data2, orient='columns', columns=['A', 'B'])

ValueError: cannot use columns parameter with orient='columns'

Reading Subset of Rows 讀取行的子集

Not supported by any of these methods directly . 這些方法都不直接支持 You will have to iterate over your data and perform a reverse delete in-place as you iterate. 您將必須遍歷數據,並在進行迭代時就地執行反向刪除 For example, to extract only the 0 th and 2 nd rows from data2 above, you can use: 例如,只提取從 0和第2 的行data2以上,可以使用:

rows_to_select = {0, 2}
for i in reversed(range(len(data2))):
    if i not in rows_to_select:
        del data2[i]

pd.DataFrame(data2)
# pd.DataFrame.from_dict(data2)
# pd.DataFrame.from_records(data2)

     A    B  C    D    E
0  5.0  NaN  3  3.0  NaN
1  NaN  4.0  7  NaN  6.0

The Panacea: json_normalize for Nested Data 靈丹妙藥:嵌套數據的json_normalize

A strong, robust alternative to the methods outlined above is the json_normalize function which works with lists of dictionaries (records), and in addition can also handle nested dictionaries. 上面概述的方法的一種強大而強大的替代方法是json_normalize函數,該函數可用於詞典列表(記錄),此外還可以處理嵌套詞典。

pd.io.json.json_normalize(data)

   A  B  C  D
0  5  0  3  3
1  7  9  3  5
2  2  4  7  6

pd.io.json.json_normalize(data2)

     A    B  C    D    E
0  5.0  NaN  3  3.0  NaN
1  NaN  4.0  7  NaN  6.0

Again, keep in mind that the data passed to json_normalize needs to be in the list-of-dictionaries (records) format. 同樣,請記住,傳遞給json_normalize的數據必須採用字典列表(記錄)格式。

As mentioned, json_normalize can also handle nested dictionaries. 如前所述, json_normalize也可以處理嵌套字典。 Here's an example taken from the documentation. 這是從文檔中獲取的示例。

data_nested = [
  {'counties': [{'name': 'Dade', 'population': 12345},
                {'name': 'Broward', 'population': 40000},
                {'name': 'Palm Beach', 'population': 60000}],
   'info': {'governor': 'Rick Scott'},
   'shortname': 'FL',
   'state': 'Florida'},
  {'counties': [{'name': 'Summit', 'population': 1234},
                {'name': 'Cuyahoga', 'population': 1337}],
   'info': {'governor': 'John Kasich'},
   'shortname': 'OH',
   'state': 'Ohio'}
]

pd.io.json.json_normalize(data_nested, 
                          record_path='counties', 
                          meta=['state', 'shortname', ['info', 'governor']])

         name  population    state shortname info.governor
0        Dade       12345  Florida        FL    Rick Scott
1     Broward       40000  Florida        FL    Rick Scott
2  Palm Beach       60000  Florida        FL    Rick Scott
3      Summit        1234     Ohio        OH   John Kasich
4    Cuyahoga        1337     Ohio        OH   John Kasich

For more information on the meta and record_path arguments, check out the documentation. 有關metarecord_path參數的更多信息,請查閱文檔。


Summarising 總結

Here's a table of all the methods discussed above, along with supported features/functionality. 這是上面討論的所有方法的表格,以及受支持的功能部件/功能。

在此處輸入圖片說明

* Use orient='columns' and then transpose to get the same effect as orient='index' . *使用orient='columns'然後轉置以獲得與orient='index'相同的效果。


#6樓

I know a few people will come across this and find nothing here helps. 我知道會有幾個人遇到這個問題,但這裏沒有任何幫助。 The easiest way I have found to do it is like this: 我發現最簡單的方法是這樣的:

dict_count = len(dict_list)
df = pd.DataFrame(dict_list[0], index=[0])
for i in range(1,dict_count-1):
    df = df.append(dict_list[i], ignore_index=True)

Hope this helps someone! 希望這對某人有幫助!

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章