更改Pandas中列的數據類型

本文翻譯自:Change data type of columns in Pandas

I want to convert a table, represented as a list of lists, into a Pandas DataFrame. 我想將表示爲列表列表的錶轉換爲Pandas DataFrame。 As an extremely simplified example: 作爲一個極其簡化的示例:

a = [['a', '1.2', '4.2'], ['b', '70', '0.03'], ['x', '5', '0']]
df = pd.DataFrame(a)

What is the best way to convert the columns to the appropriate types, in this case columns 2 and 3 into floats? 將列轉換爲適當類型的最佳方法是什麼,在這種情況下,將列2和3轉換爲浮點數? Is there a way to specify the types while converting to DataFrame? 有沒有一種方法可以在轉換爲DataFrame時指定類型? Or is it better to create the DataFrame first and then loop through the columns to change the type for each column? 還是先創建DataFrame然後遍歷各列以更改各列的類型更好? Ideally I would like to do this in a dynamic way because there can be hundreds of columns and I don't want to specify exactly which columns are of which type. 理想情況下,我想以動態方式執行此操作,因爲可以有數百個列,並且我不想確切指定哪些列屬於哪種類型。 All I can guarantee is that each columns contains values of the same type. 我可以保證的是,每一列都包含相同類型的值。


#1樓

參考:https://stackoom.com/question/14fz4/更改Pandas中列的數據類型


#2樓

How about this? 這個怎麼樣?

a = [['a', '1.2', '4.2'], ['b', '70', '0.03'], ['x', '5', '0']]
df = pd.DataFrame(a, columns=['one', 'two', 'three'])
df
Out[16]: 
  one  two three
0   a  1.2   4.2
1   b   70  0.03
2   x    5     0

df.dtypes
Out[17]: 
one      object
two      object
three    object

df[['two', 'three']] = df[['two', 'three']].astype(float)

df.dtypes
Out[19]: 
one       object
two      float64
three    float64

#3樓

You have three main options for converting types in pandas: 您可以使用三種主要選項來轉換熊貓的類型:

  1. to_numeric() - provides functionality to safely convert non-numeric types (eg strings) to a suitable numeric type. to_numeric() -提供了將非數字類型(例如字符串)安全地轉換爲合適的數字類型的功能。 (See also to_datetime() and to_timedelta() .) (另請參見to_datetime()to_timedelta() 。)

  2. astype() - convert (almost) any type to (almost) any other type (even if it's not necessarily sensible to do so). astype() -將(幾乎)任何類型轉換爲(幾乎)任何其他類型(即使這樣做不一定明智)。 Also allows you to convert to categorial types (very useful). 還允許您轉換爲分類類型(非常有用)。

  3. infer_objects() - a utility method to convert object columns holding Python objects to a pandas type if possible. infer_objects() -一種實用的方法,可以將保存Python對象的對象列轉換爲熊貓類型。

Read on for more detailed explanations and usage of each of these methods. 繼續閱讀以獲取每種方法的更詳細的解釋和用法。


1. to_numeric() 1. to_numeric()

The best way to convert one or more columns of a DataFrame to numeric values is to use pandas.to_numeric() . 將DataFrame的一列或多列轉換爲數值的最佳方法是使用pandas.to_numeric()

This function will try to change non-numeric objects (such as strings) into integers or floating point numbers as appropriate. 此函數將嘗試將非數字對象(例如字符串)適當地更改爲整數或浮點數。

Basic usage 基本用法

The input to to_numeric() is a Series or a single column of a DataFrame. to_numeric()的輸入是Series或DataFrame的單個列。

>>> s = pd.Series(["8", 6, "7.5", 3, "0.9"]) # mixed string and numeric values
>>> s
0      8
1      6
2    7.5
3      3
4    0.9
dtype: object

>>> pd.to_numeric(s) # convert everything to float values
0    8.0
1    6.0
2    7.5
3    3.0
4    0.9
dtype: float64

As you can see, a new Series is returned. 如您所見,將返回一個新的Series。 Remember to assign this output to a variable or column name to continue using it: 請記住,將此輸出分配給變量或列名以繼續使用它:

# convert Series
my_series = pd.to_numeric(my_series)

# convert column "a" of a DataFrame
df["a"] = pd.to_numeric(df["a"])

You can also use it to convert multiple columns of a DataFrame via the apply() method: 您還可以通過apply()方法使用它來轉換DataFrame的多個列:

# convert all columns of DataFrame
df = df.apply(pd.to_numeric) # convert all columns of DataFrame

# convert just columns "a" and "b"
df[["a", "b"]] = df[["a", "b"]].apply(pd.to_numeric)

As long as your values can all be converted, that's probably all you need. 只要您的值都可以轉換,那可能就是您所需要的。

Error handling 錯誤處理

But what if some values can't be converted to a numeric type? 但是,如果某些值不能轉換爲數字類型怎麼辦?

to_numeric() also takes an errors keyword argument that allows you to force non-numeric values to be NaN , or simply ignore columns containing these values. to_numeric()還採用了errors關鍵字參數,該參數允許您將非數字值強制爲NaN ,或僅忽略包含這些值的列。

Here's an example using a Series of strings s which has the object dtype: 這是使用具有對象dtype的一系列字符串s的示例:

>>> s = pd.Series(['1', '2', '4.7', 'pandas', '10'])
>>> s
0         1
1         2
2       4.7
3    pandas
4        10
dtype: object

The default behaviour is to raise if it can't convert a value. 如果無法轉換值,則默認行爲是引發。 In this case, it can't cope with the string 'pandas': 在這種情況下,它不能處理字符串“ pandas”:

>>> pd.to_numeric(s) # or pd.to_numeric(s, errors='raise')
ValueError: Unable to parse string

Rather than fail, we might want 'pandas' to be considered a missing/bad numeric value. 我們可能希望將“ pandas”視爲丟失/錯誤的數值,而不是失敗。 We can coerce invalid values to NaN as follows using the errors keyword argument: 我們可以使用errors關鍵字參數將無效值強制爲NaN ,如下所示:

>>> pd.to_numeric(s, errors='coerce')
0     1.0
1     2.0
2     4.7
3     NaN
4    10.0
dtype: float64

The third option for errors is just to ignore the operation if an invalid value is encountered: errors的第三個選項是,如果遇到無效值,則忽略該操作:

>>> pd.to_numeric(s, errors='ignore')
# the original Series is returned untouched

This last option is particularly useful when you want to convert your entire DataFrame, but don't not know which of our columns can be converted reliably to a numeric type. 當您要轉換整個DataFrame,但又不知道我們哪些列可以可靠地轉換爲數字類型時,最後一個選項特別有用。 In that case just write: 在這種情況下,只需寫:

df.apply(pd.to_numeric, errors='ignore')

The function will be applied to each column of the DataFrame. 該函數將應用於DataFrame的每一列。 Columns that can be converted to a numeric type will be converted, while columns that cannot (eg they contain non-digit strings or dates) will be left alone. 可以轉換爲數字類型的列將被轉換,而不能轉換(例如,它們包含非數字字符串或日期)的列將被保留。

Downcasting 下垂

By default, conversion with to_numeric() will give you either a int64 or float64 dtype (or whatever integer width is native to your platform). 默認情況下,使用to_numeric()轉換將爲您提供int64float64 dtype(或平臺固有的任何整數寬度)。

That's usually what you want, but what if you wanted to save some memory and use a more compact dtype, like float32 , or int8 ? 通常這就是您想要的,但是如果您想節省一些內存並使用更緊湊的dtype(例如float32int8呢?

to_numeric() gives you the option to downcast to either 'integer', 'signed', 'unsigned', 'float'. to_numeric()使您可以選擇向下轉換爲'integer','signed','unsigned','float'。 Here's an example for a simple series s of integer type: 這是一個整數類型的簡單序列s示例:

>>> s = pd.Series([1, 2, -7])
>>> s
0    1
1    2
2   -7
dtype: int64

Downcasting to 'integer' uses the smallest possible integer that can hold the values: 向下轉換爲“整數”將使用可以保存值的最小整數:

>>> pd.to_numeric(s, downcast='integer')
0    1
1    2
2   -7
dtype: int8

Downcasting to 'float' similarly picks a smaller than normal floating type: 向下轉換爲“ float”類似地選擇了一個比普通浮點型小的類型:

>>> pd.to_numeric(s, downcast='float')
0    1.0
1    2.0
2   -7.0
dtype: float32

2. astype() 2. astype()

The astype() method enables you to be explicit about the dtype you want your DataFrame or Series to have. astype()方法使您可以明確表示希望DataFrame或Series具有的dtype。 It's very versatile in that you can try and go from one type to the any other. 它非常通用,可以嘗試從一種類型轉換爲另一種類型。

Basic usage 基本用法

Just pick a type: you can use a NumPy dtype (eg np.int16 ), some Python types (eg bool), or pandas-specific types (like the categorical dtype). 只需選擇一個類型即可:您可以使用NumPy np.int16 (例如np.int16 ),某些Python類型(例如bool)或特定於熊貓的類型(例如類別dtype)。

Call the method on the object you want to convert and astype() will try and convert it for you: 在要轉換的對象上調用方法, astype()將嘗試爲您轉換它:

# convert all DataFrame columns to the int64 dtype
df = df.astype(int)

# convert column "a" to int64 dtype and "b" to complex type
df = df.astype({"a": int, "b": complex})

# convert Series to float16 type
s = s.astype(np.float16)

# convert Series to Python strings
s = s.astype(str)

# convert Series to categorical type - see docs for more details
s = s.astype('category')

Notice I said "try" - if astype() does not know how to convert a value in the Series or DataFrame, it will raise an error. 注意,我說的是“嘗試”-如果astype()不知道如何轉換Series或DataFrame中的值,它將引發錯誤。 For example if you have a NaN or inf value you'll get an error trying to convert it to an integer. 例如,如果您具有NaNinf值,則嘗試將其轉換爲整數時會出錯。

As of pandas 0.20.0, this error can be suppressed by passing errors='ignore' . 從熊貓0.20.0開始,可以通過傳遞errors='ignore'來抑制此錯誤。 Your original object will be return untouched. 您的原始對象將保持原樣返回。

Be careful 小心

astype() is powerful, but it will sometimes convert values "incorrectly". astype()功能強大,但有時會“錯誤地”轉換值。 For example: 例如:

>>> s = pd.Series([1, 2, -7])
>>> s
0    1
1    2
2   -7
dtype: int64

These are small integers, so how about converting to an unsigned 8-bit type to save memory? 這些都是小整數,那麼如何轉換爲無符號的8位類型以節省內存呢?

>>> s.astype(np.uint8)
0      1
1      2
2    249
dtype: uint8

The conversion worked, but the -7 was wrapped round to become 249 (ie 2 8 - 7)! 轉換工作,但-7包裹輪成爲249(即2月8日至七日 )!

Trying to downcast using pd.to_numeric(s, downcast='unsigned') instead could help prevent this error. 嘗試使用pd.to_numeric(s, downcast='unsigned')可以幫助防止此錯誤。


3. infer_objects() 3. infer_objects()

Version 0.21.0 of pandas introduced the method infer_objects() for converting columns of a DataFrame that have an object datatype to a more specific type (soft conversions). pandas的0.21.0版引入了infer_objects()方法,用於將具有對象數據類型的DataFrame列轉換爲更特定的類型(軟轉換)。

For example, here's a DataFrame with two columns of object type. 例如,這是一個帶有兩列對象類型的DataFrame。 One holds actual integers and the other holds strings representing integers: 一個保存實際的整數,另一個保存代表整數的字符串:

>>> df = pd.DataFrame({'a': [7, 1, 5], 'b': ['3','2','1']}, dtype='object')
>>> df.dtypes
a    object
b    object
dtype: object

Using infer_objects() , you can change the type of column 'a' to int64: 使用infer_objects() ,您可以將列“ a”的類型更改爲int64:

>>> df = df.infer_objects()
>>> df.dtypes
a     int64
b    object
dtype: object

Column 'b' has been left alone since its values were strings, not integers. 由於列“ b”的值是字符串而不是整數,因此已被保留。 If you wanted to try and force the conversion of both columns to an integer type, you could use df.astype(int) instead. 如果要嘗試強制將兩列都轉換爲整數類型,則可以改用df.astype(int)


#4樓

Here is a function that takes as its arguments a DataFrame and a list of columns and coerces all data in the columns to numbers. 這是一個函數,該函數將DataFrame和列列表作爲參數,並將列中的所有數據強制轉換爲數字。

# df is the DataFrame, and column_list is a list of columns as strings (e.g ["col1","col2","col3"])
# dependencies: pandas

def coerce_df_columns_to_numeric(df, column_list):
    df[column_list] = df[column_list].apply(pd.to_numeric, errors='coerce')

So, for your example: 因此,以您的示例爲例:

import pandas as pd

def coerce_df_columns_to_numeric(df, column_list):
    df[column_list] = df[column_list].apply(pd.to_numeric, errors='coerce')

a = [['a', '1.2', '4.2'], ['b', '70', '0.03'], ['x', '5', '0']]
df = pd.DataFrame(a, columns=['col1','col2','col3'])

coerce_df_columns_to_numeric(df, ['col2','col3'])

#5樓

How about creating two dataframes, each with different data types for their columns, and then appending them together? 如何創建兩個數據框,每個數據框的列具有不同的數據類型,然後將它們附加在一起?

d1 = pd.DataFrame(columns=[ 'float_column' ], dtype=float)
d1 = d1.append(pd.DataFrame(columns=[ 'string_column' ], dtype=str))

Results 結果

In[8}:  d1.dtypes
Out[8]: 
float_column     float64
string_column     object
dtype: object

After the dataframe is created, you can populate it with floating point variables in the 1st column, and strings (or any data type you desire) in the 2nd column. 創建數據框後,可以在第一列中填充浮點變量,並在第二列中填充字符串(或所需的任何數據類型)。


#6樓

this below code will change datatype of column. 下面的代碼將更改列的數據類型。

df[['col.name1', 'col.name2'...]] = df[['col.name1', 'col.name2'..]].astype('data_type')

in place of data type you can give your datatype .what do you want like str,float,int etc. 您可以給數據類型代替數據類型。您想要什麼,例如str,float,int等。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章