Pandas練習

下面的練習來源:pandas數據分析100道練習題,將將夠了解熟悉一下pandas各種操作,我對有些題目使用到的函數還不是十分理解。
題目我都寫到一個ipynb文件裏了,已上傳到CSDN,0積分,鏈接

另外分享兩個網站練習,能通過實戰一般練習到pandas(有十個實戰的題目):
兩個練習題我沒全看,只看了和鯨Kesci網的,但是看github上的題目標題及練習的形式大概判斷它們的本質是一模一樣的。
和鯨Kesci:pandas數分析練習 \ pandas數分析練習:Github鏈接
知乎:pandas練習題100道 \ pandas練習題100道:Github鏈接




如何引入pandas並查看版本

import pandas as pd
print(pd.__version__)
print(pd.show_versions(as_json=True))
0.25.1
{'system': {'commit': None, 'python': '3.7.4.final.0', 'python-bits': 64, 'OS': 'Windows', 'OS-release': '10', 'machine': 'AMD64', 'processor': 'Intel64 Family 6 Model 142 Stepping 9, GenuineIntel', 'byteorder': 'little', 'LC_ALL': 'None', 'LANG': 'None', 'LOCALE': 'None.None'}, 'dependencies': {'pandas': '0.25.1', 'numpy': '1.16.5', 'pytz': '2019.3', 'dateutil': '2.8.0', 'pip': '19.2.3', 'setuptools': '41.4.0', 'Cython': '0.29.13', 'pytest': '5.2.1', 'hypothesis': None, 'sphinx': '2.2.0', 'blosc': None, 'feather': None, 'xlsxwriter': '1.2.1', 'lxml.etree': '4.4.1', 'html5lib': '1.0.1', 'pymysql': '0.9.3', 'psycopg2': None, 'jinja2': '2.10.3', 'IPython': '7.8.0', 'pandas_datareader': None, 'bs4': '4.8.0', 'bottleneck': '1.2.1', 'fastparquet': None, 'gcsfs': None, 'matplotlib': '3.1.1', 'numexpr': '2.7.0', 'odfpy': None, 'openpyxl': '3.0.0', 'pandas_gbq': None, 'pyarrow': None, 'pytables': None, 's3fs': None, 'scipy': '1.3.1', 'sqlalchemy': '1.3.9', 'tables': '3.5.2', 'xarray': None, 'xlrd': '1.2.0', 'xlwt': '1.3.0'}}
None

list或numpy array或dict轉pd.Series

import numpy as np
mylist = list('abcdefghijklmnopqrstuvwxyz')
myarr = np.arange(26)
mydict = zip(mylist, myarr)
ser1 = pd.Series(mylist)
ser2 = pd.Series(myarr)
ser3 = pd.Series(mydict)
print(ser3.head())
0    (a, 0)
1    (b, 1)
2    (c, 2)
3    (d, 3)
4    (e, 4)
dtype: object

series的index轉dataframe的column

df = ser3.to_frame().reset_index()
df.head()
index 0
0 0 (a, 0)
1 1 (b, 1)
2 2 (c, 2)
3 3 (d, 3)
4 4 (e, 4)

多個series合併成一個dataframe

df = pd.DataFrame({'col1': ser1, 'col2':ser2})
df.head()
col1 col2
0 a 0
1 b 1
2 c 2
3 d 3
4 e 4

根據index, 多個series合併成dataframe

s1 = ser1[:16]
s2 = ser2[14:]
pd.concat([s1,s2], axis=1)
0 1
0 a NaN
1 b NaN
2 c NaN
3 d NaN
4 e NaN
5 f NaN
6 g NaN
7 h NaN
8 i NaN
9 j NaN
10 k NaN
11 l NaN
12 m NaN
13 n NaN
14 o 14.0
15 p 15.0
16 NaN 16.0
17 NaN 17.0
18 NaN 18.0
19 NaN 19.0
20 NaN 20.0
21 NaN 21.0
22 NaN 22.0
23 NaN 23.0
24 NaN 24.0
25 NaN 25.0

頭尾拼接兩個series

pd.concat([s1,s2],axis=0)
0      a
1      b
2      c
3      d
4      e
5      f
6      g
7      h
8      i
9      j
10     k
11     l
12     m
13     n
14     o
15     p
14    14
15    15
16    16
17    17
18    18
19    19
20    20
21    21
22    22
23    23
24    24
25    25
dtype: object

找到元素 在series A中不在series B中

ser1 = pd.Series([1,2,3,4,5])
ser2 = pd.Series([4,5,6,7,8])
ser1[~ser1.isin(ser2)]
0    1
1    2
2    3
dtype: int64

兩個seiries的並集

np.union1d(ser1,ser2)
array([1, 2, 3, 4, 5, 6, 7, 8], dtype=int64)

兩個series的交集

np.intersect1d(ser1,ser2)
array([4, 5], dtype=int64)

兩個series的非共有元素

u = pd.Series(np.union1d(ser1,ser2))
i = pd.Series(np.intersect1d(ser1,ser2))
u[~u.isin(i)]
0    1
1    2
2    3
5    6
6    7
7    8
dtype: int64

如何獲得series的最小值,第25百分位數,中位數,第75位和最大值?

ser = pd.Series(np.random.normal(10, 5, 25))
np.random.RandomState(100)
np.percentile(ser, q=[0, 25, 50, 75, 100])
array([-2.54523372,  7.86187042, 10.16123596, 15.60337005, 23.12409334])

如何獲得系列中唯一項目的頻率計數?

ser = pd.Series(np.take(list('abcdefgh'), np.random.randint(8, size=30)))
ser.value_counts()
h    8
b    5
c    4
g    4
f    4
a    3
d    2
dtype: int64

series中計數排名前2的元素

v_cnt = ser.value_counts()
print(v_cnt)
cnt_cnt = v_cnt.value_counts().index[:2]
print(cnt_cnt)
cnt_cnt
h    8
b    5
c    4
g    4
f    4
a    3
d    2
dtype: int64
Int64Index([4, 5], dtype='int64')





Int64Index([4, 5], dtype='int64')
v_cnt[v_cnt.isin(cnt_cnt)].index
Index(['b', 'c', 'g', 'f'], dtype='object')

如何將數字系列分成10個相同大小的組

ser = pd.Series(np.random.random(20))
ser.head()
0    0.588945
1    0.356710
2    0.798986
3    0.170943
4    0.076717
dtype: float64
groups = pd.qcut(ser, q=[0, .10, .20, .3, .4, .5, .6, .7, .8, .9, 1], 
        labels=['1st', '2nd', '3rd', '4th', '5th', '6th', '7th', '8th', '9th', '10th'])
groups.head()
0    5th
1    3rd
2    9th
3    3rd
4    1st
dtype: category
Categories (10, object): [1st < 2nd < 3rd < 4th ... 7th < 8th < 9th < 10th]

如何將numpy數組轉換爲給定形狀的dataframe

ser = pd.Series(np.random.randint(1,10,35))
df = pd.DataFrame(ser.values.reshape(7,5))
df
0 1 2 3 4
0 8 6 5 4 1
1 1 7 8 1 4
2 5 3 5 7 5
3 8 6 3 4 6
4 5 9 2 4 3
5 3 7 6 8 7
6 6 2 2 7 5

如何從一系列中找到2的倍數的數字位置

ser = pd.Series(np.random.randint(1,10,7))
ser
0    1
1    8
2    7
3    9
4    9
5    2
6    6
dtype: int32
# ser[ser.map(lambda x: x%2 == 0)].index
np.argwhere(ser % 2 == 0)
array([[1],
       [5],
       [6]], dtype=int64)

如何從系列中的給定位置提取項目

ser = pd.Series(list('abcdefghijklmnopqrstuvwxyz'))
pos = [0,4,8,14,20]
ser.take(pos)
0     a
4     e
8     i
14    o
20    u
dtype: object

獲取元素的位置

aims = list('adhz')
[pd.Index(ser).get_loc(i) for i in aims]
[0, 3, 7, 25]

如何計算真值和預測序列的均方誤差

truth = pd.Series(range(10))
pred = pd.Series(range(10)) + np.random.random(10)
np.mean((truth - pred)**2)
0.32466394194250286

如何將系列中每個元素的第一個字符轉換爲大寫

ser = pd.Series(['how','are','you'])
ser.map(lambda x: x.title())
0    How
1    Are
2    You
dtype: object

如何計算系列中每個單詞的字符數

ser.map(lambda x:len(x))
0    3
1    3
2    3
dtype: int64

如何計算時間序列數據的差分

ser = pd.Series([1, 3, 6, 10, 15, 21, 27, 35])
# 一級差分
ser.diff()
0    NaN
1    2.0
2    3.0
3    4.0
4    5.0
5    6.0
6    6.0
7    8.0
dtype: float64
# 二級差分
ser.diff().diff()
0    NaN
1    NaN
2    1.0
3    1.0
4    1.0
5    1.0
6    0.0
7    2.0
dtype: float64

series如何將一日期字符串轉換爲時間

import pandas as pd
ser = pd.Series(
    ['01 Jan 2010', 
     '02-02-2011', 
     '20120303', 
     '2013/04/04', 
     '2014-05-05', 
     '2015-06-06T12:20']
)
data = pd.to_datetime(ser)
data
0   2010-01-01 00:00:00
1   2011-02-02 00:00:00
2   2012-03-03 00:00:00
3   2013-04-04 00:00:00
4   2014-05-05 00:00:00
5   2015-06-06 12:20:00
dtype: datetime64[ns]

series如何從時間序列中提取年/月/天/小時/分鐘/秒

data.dt.year
0    2010
1    2011
2    2012
3    2013
4    2014
5    2015
dtype: int64
data.dt.month
0    1
1    2
2    3
3    4
4    5
5    6
dtype: int64
data.dt.day
0    1
1    2
2    3
3    4
4    5
5    6
dtype: int64
data.dt.hour
0     0
1     0
2     0
3     0
4     0
5    12
dtype: int64
data.dt.minute
0     0
1     0
2     0
3     0
4     0
5    20
dtype: int64
data.dt.microsecond
0    0
1    0
2    0
3    0
4    0
5    0
dtype: int64

從series中找出包含兩個以上元音字母的單詞

ser = pd.Series(['Apple', 'Orange', 'Plan', 'Python', 'Money'])

def count(x):
    aims = 'aeiou'
    c= 0
    for i in x:
        if i in aims:
            c += 1
    return c

ser[ser.map(lambda x: count(x)) >= 2]
1    Orange
4     Money
dtype: object

如何過濾series中的有效電子郵件

import re
emails = pd.Series(['buying books at amazom.com', 
                    '[email protected]', 
                    '[email protected]',
                    '[email protected]'])
pattern ='[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,4}'
valid = emails.str.findall(pattern)
[x[0] for x in valid if len(x)]
['[email protected]', '[email protected]', '[email protected]']

series A 以series B爲分組依據, 然後計算分組後的平均值

import numpy as np
fruit = pd.Series(np.random.choice(['apple', 'banana', 'carrot'], 10))
weights = pd.Series(np.linspace(1, 10, 10))

weights.groupby(fruit).mean()
apple     6.5
banana    6.0
carrot    2.5
dtype: float64

如何計算兩個系列之間的歐氏距離

p = pd.Series([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
q = pd.Series([10, 9, 8, 7, 6, 5, 4, 3, 2, 1])

sum((p - q)**2)**.5
18.16590212458495

如何在數字系列中查找所有局部最大值(或峯值)

ser = pd.Series([2, 10, 3, 4, 9, 10, 2, 7, 3])
dd = np.diff(np.sign(np.diff(ser)))
peak_locs = np.where(dd == -2)[0] + 1
peak_locs
array([1, 5, 7], dtype=int64)

如何創建一個以’2000-01-02’開始包含10個週六的TimeSeries

pd.Series(np.random.randint(1, 10, 10),
          pd.date_range('2000-01-02',
                        periods=10,
                        freq='W-SAT'
                       )
         )
2000-01-08    3
2000-01-15    5
2000-01-22    2
2000-01-29    4
2000-02-05    7
2000-02-12    9
2000-02-19    1
2000-02-26    4
2000-03-04    8
2000-03-11    9
Freq: W-SAT, dtype: int32

如何填補TimeSeires的缺失日期

ser = pd.Series([1,10,3,np.nan], index=pd.to_datetime(['2000-01-01',
                                                       '2000-01-03', 
                                                       '2000-01-06', 
                                                       '2000-01-08']))
ser.resample('D').ffill()
2000-01-01     1.0
2000-01-02     1.0
2000-01-03    10.0
2000-01-04    10.0
2000-01-05    10.0
2000-01-06     3.0
2000-01-07     3.0
2000-01-08     NaN
Freq: D, dtype: float64

如何計算series的自相關

ser = pd.Series(np.arange(20) + np.random.normal(1, 10, 20))
autocorrelations = [ser.autocorr(i).round(2) for i in range(11)]

autocorrelations
[1.0, 0.33, -0.03, -0.49, -0.48, -0.25, -0.25, 0.52, 0.49, 0.39, -0.26]

讀取csv時, 間隔幾行讀取數據

# 生成用於測試的csv
fpath = 'testt.csv'
df = pd.DataFrame({'a': range(100), 
                   'b':np.random.choice(['apple', 'banana', 'carrot'], 100)})
df.to_csv(fpath, index=None)

# 隔行讀取csv
import csv

with open(fpath, 'r') as f:
    reader = csv.reader(f)
    out = []
    for i, row in enumerate(reader):
        if i%20 ==0:
            out.append(row)
            
print(out)            
pd.DataFrame(out[1:], columns=out[0])
[['a', 'b'], ['19', 'carrot'], ['39', 'apple'], ['59', 'apple'], ['79', 'apple'], ['99', 'carrot']]
a b
0 19 carrot
1 39 apple
2 59 apple
3 79 apple
4 99 carrot

讀取csv時進行數據轉換

pd.read_csv(fpath, 
            converters={
                'a':lambda x: 'low' if int(x) < 3 else 'high'
            }).head()
a b
0 low banana
1 low apple
2 low apple
3 high banana
4 high carrot

讀取csv時只讀取某列

pd.read_csv(fpath, usecols=['a']).head()
a
0 0
1 1
2 2
3 3
4 4

讀取dataframe每列的數據類型

df=pd.DataFrame(
    {
        'a':range(100),
        'b':np.random.rand(100),
        'c':[1,2,3,4]*25,
        'd':['apple', 'banana', 'carrot']*33 + ['apple']
    }
)

df.dtypes
a      int64
b    float64
c      int64
d     object
dtype: object

讀取dataframe的行數和列數

df.shape
(100, 4)

獲取dataframe每列的基本描述統計

df.describe()
a b c
count 100.000000 100.000000 100.000000
mean 49.500000 0.511879 2.500000
std 29.011492 0.293060 1.123666
min 0.000000 0.012651 1.000000
25% 24.750000 0.232498 1.750000
50% 49.500000 0.523401 2.500000
75% 74.250000 0.777009 3.250000
max 99.000000 0.993780 4.000000

從dataframe中找到a列最大值對應的行

df[df.a == max(df.a)]
a b c d
99 99 0.365934 4 apple

從dataframe中獲取c列最大值所在的行號

# df[df.c == max(df.c)].index
np.where(df.c == max(df.c))
(array([ 3,  7, 11, 15, 19, 23, 27, 31, 35, 39, 43, 47, 51, 55, 59, 63, 67,
        71, 75, 79, 83, 87, 91, 95, 99], dtype=int64),)

在dataframe中根據行列數讀取某個值

row = 4
col = 0
print(f'行{row}列{col}的值是: {df.iat[row, col]}')
row = 4
col = 2
print(f'行{row}列{col}的值是: {df.iat[row, col]}')
row = 0
col = 0
print(f'行{row}列{col}的值是: {df.iat[row, col]}')
row = 33
col = 3
print(f'行{row}列{col}的值是: {df.iat[row, col]}')
行4列0的值是: 4
行4列2的值是: 1
行0列0的值是: 0
行33列3的值是: apple

在dataframe中根據index和列名稱讀取某個值

index = 0
col = 'd'
print(f'index={index}, col={col} : {df.at[index, col]}')
index = 2
col = 'd'
print(f'index={index}, col={col} : {df.at[index, col]}')
index = 4
col = 'd'
print(f'index={index}, col={col} : {df.at[index, col]}')
index = 5
col = 'c'
print(f'index={index}, col={col} : {df.at[index, col]}')
index=0, col=d : apple
index=2, col=d : carrot
index=4, col=d : banana
index=5, col=c : 2

dataframe中重命名某一列

df.rename(columns={'d':'e'}).head()
a b c e
0 0 0.233596 1 apple
1 1 0.457918 2 banana
2 2 0.866008 3 carrot
3 3 0.724397 4 apple
4 4 0.161981 1 banana

檢查dataframe是否有缺失值

df = pd.DataFrame({
    'a':[1.2,2,3,4],
    'b':list('abcd')
})

print('缺失:', df.isnull().values.any())
df.iat[0,0] = np.nan
print('缺失:', df.isnull().values.any())
缺失: False
缺失: True

統計dataframe中每列缺失值的數量

df.apply(lambda x: x.isnull().sum())
a    1
b    0
dtype: int64

dataframe用每列的平均值取代缺失值

df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
# 前者還未替換
df[['Min.Price', 'Max.Price']].head()
Min.Price Max.Price
0 12.9 18.8
1 29.2 38.7
2 25.9 32.3
3 NaN 44.6
4 NaN NaN
# 後者 已替換
# 僅使用['Min.Price', 'Max.Price']這兩列演示
df[['Min.Price', 'Max.Price']].apply(lambda x: x.fillna(x.mean())).head()
Min.Price Max.Price
0 12.900000 18.800000
1 29.200000 38.700000
2 25.900000 32.300000
3 17.118605 44.600000
4 17.118605 21.459091

從dataframe中獲取某一列, 並返回一個dataframe

df[['Manufacturer']].head()
Manufacturer
0 Acura
1 NaN
2 Audi
3 Audi
4 BMW

dataframe如何改變列的順序

df = pd.DataFrame(np.arange(20).reshape(-1, 5), columns=list('abcde'))
df.head()
a b c d e
0 0 1 2 3 4
1 5 6 7 8 9
2 10 11 12 13 14
3 15 16 17 18 19
df[list('cbdae')]
c b d a e
0 2 1 3 0 4
1 7 6 8 5 9
2 12 11 13 10 14
3 17 16 18 15 19

設置dataframe輸出的行數和列數

# 設置前
print(df)
pd.set_option('display.max_columns', 3)
pd.set_option('display.max_rows', 3)
# 設置之後
print(df)
    a   b   c   d   e
0   0   1   2   3   4
1   5   6   7   8   9
2  10  11  12  13  14
3  15  16  17  18  19
     a  ...   e
0    0  ...   4
..  ..  ...  ..
3   15  ...  19

[4 rows x 5 columns]
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 100)

設置dataframe輸出時不使用科學記數法

pd.DataFrame(np.random.random(4)**10, columns=['random'])
random
0 2.241557e-06
1 4.480715e-01
2 3.953685e-05
3 8.765943e-08
# 設置
pd.set_option('display.float_format', lambda x: '%.4f' % x)
pd.DataFrame(np.random.random(4)**10, columns=['random'])
random
0 0.0000
1 0.8627
2 0.0000
3 0.0000
# 恢復默認值
pd.set_option('display.float_format', None)
pd.DataFrame(np.random.random(4)**10, columns=['random'])
random
0 1.264273e-01
1 4.025861e-02
2 1.266545e-05
3 6.658184e-07

設置dataframe輸出百分比數據

df = pd.DataFrame(np.random.random(4), columns=['random'])
df
random
0 0.485751
1 0.043867
2 0.028074
3 0.649918
df.style.format({'random':'{0:.2%}'.format})
            <tr>
                    <th id="T_bbbf859c_6f53_11ea_b26a_54e1ad861b9blevel0_row0" class="row_heading level0 row0" >0</th>
                    <td id="T_bbbf859c_6f53_11ea_b26a_54e1ad861b9brow0_col0" class="data row0 col0" >48.58%</td>
        </tr>
        <tr>
                    <th id="T_bbbf859c_6f53_11ea_b26a_54e1ad861b9blevel0_row1" class="row_heading level0 row1" >1</th>
                    <td id="T_bbbf859c_6f53_11ea_b26a_54e1ad861b9brow1_col0" class="data row1 col0" >4.39%</td>
        </tr>
        <tr>
                    <th id="T_bbbf859c_6f53_11ea_b26a_54e1ad861b9blevel0_row2" class="row_heading level0 row2" >2</th>
                    <td id="T_bbbf859c_6f53_11ea_b26a_54e1ad861b9brow2_col0" class="data row2 col0" >2.81%</td>
        </tr>
        <tr>
                    <th id="T_bbbf859c_6f53_11ea_b26a_54e1ad861b9blevel0_row3" class="row_heading level0 row3" >3</th>
                    <td id="T_bbbf859c_6f53_11ea_b26a_54e1ad861b9brow3_col0" class="data row3 col0" >64.99%</td>
        </tr>
</tbody></table>
random

使用多個列創建唯一索引(index)

df = pd.read_csv(
    'https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv', 
    usecols=[0,1,2,3,5])
df[['Manufacturer', 'Model', 'Type']] = df[['Manufacturer', 'Model', 'Type']].fillna('missing')
df.index = df.Manufacturer + '_' + df.Model + '_' + df.Type
df.head()
Manufacturer Model Type Min.Price Max.Price
Acura_Integra_Small Acura Integra Small 12.9 18.8
missing_Legend_Midsize missing Legend Midsize 29.2 38.7
Audi_90_Compact Audi 90 Compact 25.9 32.3
Audi_100_Midsize Audi 100 Midsize NaN 44.6
BMW_535i_Midsize BMW 535i Midsize NaN NaN

獲取第n大的數所在行

df = pd.DataFrame(
    np.random.randint(1, 30, 30).reshape(10,-1), 
    columns=list('abc'))
df['a']
0    18
1     8
2    21
3     2
4    14
5    14
6    28
7    11
8    26
9    27
Name: a, dtype: int32
# 使用行號排序
df['a'].argsort()
0    3
1    1
2    7
3    4
4    5
5    0
6    2
7    8
8    9
9    6
Name: a, dtype: int64
df['a'][df['a'].argsort()]
3     2
1     8
7    11
4    14
5    14
0    18
2    21
8    26
9    27
6    28
Name: a, dtype: int32
n = 5
df['a'].argsort()[::-1][n]
0

dataframe獲取行之和大於100的數據, 並返回最後的兩行

df = pd.DataFrame(np.random.randint(10, 40, 60).reshape(-1, 4))
rowsums = df.apply(np.sum, axis=1)
last_two_rows = df.iloc[np.where(rowsums > 100)[0][-2:], :]
last_two_rows
0 1 2 3
13 17 34 15 37
14 34 39 27 11

如何從系列或數據框列中查找和限制異常值

用相應的5%分位數和95%分位數值替換低於5%分位數和大於95%分位數的所有值

# Input
ser = pd.Series(np.logspace(-2, 2, 30))

# Solution
def cap_outliers(ser, low_perc, high_perc):
    low, high = ser.quantile([low_perc, high_perc])
    print(low_perc, '%ile: ', low, '|', high_perc, '%ile: ', high)
    ser[ser < low] = low
    ser[ser > high] = high
    return(ser)

capped_ser = cap_outliers(ser, .05, .95)
0.05 %ile:  0.016049294076965887 | 0.95 %ile:  63.876672220183934

如何在刪除負值後將dataframe重新整形爲最大可能的正方形

將df重塑爲最大可能的正方形,並刪除負值。如果需要,刪除最小值。結果中正數的順序應保持與原始順序相同。

df = pd.DataFrame(np.random.randint(-20, 50, 100).reshape(10,-1))
print(df)
    0   1   2   3   4   5   6   7   8   9
0  32  20   1  42  43  12 -11   4   4  17
1   7  30 -17 -12   4   2  -5  27 -11 -16
2 -15 -15 -18   7 -16  18   1  10   3  15
3  33  -9  28  27  11   4   6  -7  42   2
4  -3  49  17  25  14   4  39  23  41  33
5   6  25 -13   9 -18  17  -3  48   5   0
6  14  13  -8 -15  27  10 -12  11   4  28
7   7  28  32  19  28 -19  26  19  38  -4
8   0  43 -13  45  12  21  40  29  -1   6
9  40  10  18  25  20   9   7  35  13  -2
# 步驟1:刪除負數

arr = df[df > 0].values.flatten()
arr_qualified = arr[~np.isnan(arr)]
arr_qualified
array([32., 20.,  1., 42., 43., 12.,  4.,  4., 17.,  7., 30.,  4.,  2.,
       27.,  7., 18.,  1., 10.,  3., 15., 33., 28., 27., 11.,  4.,  6.,
       42.,  2., 49., 17., 25., 14.,  4., 39., 23., 41., 33.,  6., 25.,
        9., 17., 48.,  5., 14., 13., 27., 10., 11.,  4., 28.,  7., 28.,
       32., 19., 28., 26., 19., 38., 43., 45., 12., 21., 40., 29.,  6.,
       40., 10., 18., 25., 20.,  9.,  7., 35., 13.])
# 步驟2: 計算正方形的邊長

n = int(np.floor(arr_qualified.shape[0]**.5))
n
8
# 步驟3: 整形爲要求的正方形
top_indexes = np.argsort(arr_qualified)[::-1]
output = np.take(arr_qualified, sorted(top_indexes[:n**2])).reshape(n, -1)
print(output)
[[32. 20. 42. 43. 12. 17.  7. 30.]
 [ 4. 27.  7. 18. 10. 15. 33. 28.]
 [27. 11.  6. 42. 49. 17. 25. 14.]
 [39. 23. 41. 33.  6. 25.  9. 17.]
 [48.  5. 14. 13. 27. 10. 11. 28.]
 [ 7. 28. 32. 19. 28. 26. 19. 38.]
 [43. 45. 12. 21. 40. 29.  6. 40.]
 [10. 18. 25. 20.  9.  7. 35. 13.]]

交換dataframe的兩行

把第一行和第二行數據交換

df = pd.DataFrame(np.arange(25).reshape(5, -1))
df
0 1 2 3 4
0 0 1 2 3 4
1 5 6 7 8 9
2 10 11 12 13 14
3 15 16 17 18 19
4 20 21 22 23 24
a, b = df.iloc[1, :].copy(), df.iloc[2, :].copy()
df.iloc[1, :], df.iloc[2, :] = b, a
df
0 1 2 3 4
0 0 1 2 3 4
1 10 11 12 13 14
2 5 6 7 8 9
3 15 16 17 18 19
4 20 21 22 23 24

dataframe行倒序排序

df.iloc[::-1, :]
0 1 2 3 4
4 20 21 22 23 24
3 15 16 17 18 19
2 5 6 7 8 9
1 10 11 12 13 14
0 0 1 2 3 4

對分類數據進行one-hot編碼

df = pd.DataFrame(np.arange(25).reshape(5,-1), columns=list('abcde'))
df
a b c d e
0 0 1 2 3 4
1 5 6 7 8 9
2 10 11 12 13 14
3 15 16 17 18 19
4 20 21 22 23 24
pd.concat([pd.get_dummies(df['a']), df[list('bcde')]], axis=1)
0 5 10 15 20 b c d e
0 1 0 0 0 0 1 2 3 4
1 0 1 0 0 0 6 7 8 9
2 0 0 1 0 0 11 12 13 14
3 0 0 0 1 0 16 17 18 19
4 0 0 0 0 1 21 22 23 24

哪個列包含每行的最大值

求行最大值所在的列

df = pd.DataFrame(np.random.randint(1,100, 40).reshape(10, -1))
df.apply(np.argmax, axis=1)
0    1
1    1
2    1
3    2
4    2
5    2
6    0
7    1
8    3
9    3
dtype: int64

計算 每行的最近行(使用歐幾里得距離)

nearest = {}
for i, row in df.iterrows():
    c = ((df - row)**2).sum(axis = 1).argsort()
    for j in c:
        if j != i:
            break
    nearest[i] = j
print(nearest)
{0: 2, 1: 7, 2: 7, 3: 5, 4: 0, 5: 3, 6: 2, 7: 2, 8: 9, 9: 8}

如何計算列之間的最大相關係數

df = pd.DataFrame(
    np.random.randint(1,100, 80).reshape(8, -1), 
    columns=list('pqrstuvwxy'),
    index=list('abcdefgh')
)

abs_corrmat = np.abs(df.corr())
print(abs_corrmat)
max_corr = abs_corrmat.apply(lambda x: sorted(x)[-2])
print('Maximum Correlation possible for each column: ', np.round(max_corr.tolist(), 2))
          p         q         r         s         t         u         v  \
p  1.000000  0.664683  0.230041  0.017442  0.219024  0.548010  0.201785   
q  0.664683  1.000000  0.084993  0.145618  0.050637  0.846209  0.122756   
r  0.230041  0.084993  1.000000  0.527469  0.329950  0.251885  0.179597   
s  0.017442  0.145618  0.527469  1.000000  0.217383  0.157896  0.122085   
t  0.219024  0.050637  0.329950  0.217383  1.000000  0.265879  0.650415   
u  0.548010  0.846209  0.251885  0.157896  0.265879  1.000000  0.156130   
v  0.201785  0.122756  0.179597  0.122085  0.650415  0.156130  1.000000   
w  0.534019  0.097979  0.122874  0.459503  0.100161  0.353358  0.177244   
x  0.211146  0.376211  0.473825  0.176471  0.181969  0.338606  0.153986   
y  0.258789  0.088951  0.518529  0.261935  0.090374  0.318080  0.295910   

          w         x         y  
p  0.534019  0.211146  0.258789  
q  0.097979  0.376211  0.088951  
r  0.122874  0.473825  0.518529  
s  0.459503  0.176471  0.261935  
t  0.100161  0.181969  0.090374  
u  0.353358  0.338606  0.318080  
v  0.177244  0.153986  0.295910  
w  1.000000  0.008501  0.045387  
x  0.008501  1.000000  0.385229  
y  0.045387  0.385229  1.000000  
Maximum Correlation possible for each column:  [0.66 0.85 0.53 0.53 0.65 0.85 0.65 0.53 0.47 0.52]

計算每一行的最小值與最大值的比值

df = pd.DataFrame(np.random.randint(1,100, 80).reshape(8, -1))

# Solution 1
min_by_max = df.apply(lambda x: np.min(x)/np.max(x), axis=1)
min_by_max
0    0.094737
1    0.092784
2    0.223684
3    0.134021
4    0.012346
5    0.074074
6    0.046512
7    0.173469
dtype: float64

找到每行第二大的值

# Input
df = pd.DataFrame(np.random.randint(1,100, 80).reshape(8, -1))

# Solution
out = df.apply(lambda x: x.sort_values().unique()[-2], axis=1)
df['penultimate'] = out
print(df)
    0   1   2   3   4   5   6   7   8   9  penultimate
0  43  42  77  25   1  12  25  53  58  64           64
1  13  46  77  37  33  88  67  81  36   2           81
2  31  77  95   8  20  39   7  38  71  75           77
3  23  41  66   8  77  68  11  51  70  70           70
4   3  93  83  27  66  72  24  18  92   8           92
5  26  52  62  39  12   5  71  78   2  62           71
6  22  29  83  29  37  49  22  32  90  45           83
7  69  46  75  60  74  83  33  37  79  47           79

如何正態化dataframe中的所有列

# Input
df = pd.DataFrame(np.random.randint(1,100, 80).reshape(8, -1))

# Solution Q1
out1 = df.apply(lambda x: ((x - x.mean())/x.std()).round(2))
print('Solution Q1\n',out1)

# Solution Q2
out2 = df.apply(lambda x: ((x.max() - x)/(x.max() - x.min())).round(2))
print('Solution Q2\n', out2)
Solution Q1
       0     1     2     3     4     5     6     7     8     9
0 -0.68 -0.76 -0.97  1.56  0.85 -0.17  1.16 -0.78  1.17 -1.22
1  1.93 -0.38  1.44  0.27 -0.96 -0.64 -0.65 -1.59 -0.98 -1.83
2  0.79  0.72 -0.18  0.57 -0.17  1.34  1.22  0.69 -1.16 -0.16
3  0.49 -0.62  0.91 -0.03 -1.30  0.66 -0.50  0.98  0.34  0.73
4 -0.19 -0.64 -0.85 -0.38  0.99 -1.08 -0.18  1.24  0.86  0.49
5 -0.65  1.79  1.14  0.27  1.27  0.66  1.08 -0.87  0.56  0.80
6 -0.72 -0.99 -0.63 -1.97 -0.99 -1.49 -1.14  0.36  0.56  0.63
7 -0.98  0.89 -0.85 -0.28  0.31  0.72 -1.00 -0.03 -1.35  0.56
Solution Q2
       0     1     2     3     4     5     6     7     8     9
0  0.90  0.92  1.00  0.00  0.16  0.53  0.02  0.71  0.00  0.77
1  0.00  0.78  0.00  0.37  0.87  0.70  0.79  1.00  0.85  1.00
2  0.39  0.39  0.67  0.28  0.56  0.00  0.00  0.20  0.93  0.36
3  0.49  0.86  0.22  0.45  1.00  0.24  0.73  0.09  0.33  0.03
4  0.73  0.88  0.95  0.55  0.11  0.85  0.59  0.00  0.12  0.12
5  0.89  0.00  0.12  0.37  0.00  0.24  0.06  0.75  0.24  0.00
6  0.91  1.00  0.86  1.00  0.88  1.00  1.00  0.31  0.24  0.06
7  1.00  0.32  0.95  0.52  0.37  0.22  0.94  0.45  1.00  0.09

如何計算每行與上一行的相關?

# Input
df = pd.DataFrame(np.random.randint(1,100, 80).reshape(8, -1))

# Solution
[df.iloc[i].corr(df.iloc[i+1]).round(2) for i in range(df.shape[0])[:-1]]
[-0.29, 0.25, 0.66, -0.26, -0.47, -0.53, -0.01]

如何用0填充dataframe的對角線上的數

df = pd.DataFrame(np.random.randint(1,100, 100).reshape(10, -1))

# Solution
for i in range(df.shape[0]):
    df.iat[i, i] = 0
    df.iat[df.shape[0]-i-1, i] = 0
df
0 1 2 3 4 5 6 7 8 9
0 0 3 88 65 97 92 55 83 60 0
1 88 0 64 41 97 51 98 67 0 50
2 50 80 0 38 73 55 7 0 49 91
3 30 33 24 0 84 42 0 99 84 67
4 9 68 56 16 0 0 98 43 24 6
5 24 50 19 6 0 0 48 52 73 68
6 39 61 1 0 94 68 0 4 77 70
7 88 23 0 87 76 67 42 0 40 54
8 85 0 37 63 44 27 16 39 0 17
9 0 71 54 73 89 15 20 20 19 0

dataframe分組後獲取某個組的數據

df = pd.DataFrame({'col1': ['apple', 'banana', 'orange'] * 3,
                   'col2': np.random.rand(9),
                   'col3': np.random.randint(0, 15, 9)})

df_grouped = df.groupby(['col1'])

# Solution 1
df_grouped.get_group('apple')
col1 col2 col3
0 apple 0.183578 1
3 apple 0.359194 12
6 apple 0.062158 6

分組後獲取某組中的第n大的值

df = pd.DataFrame({'fruit': ['apple', 'banana', 'orange'] * 3,
                   'taste': np.random.rand(9),
                   'price': np.random.randint(0, 15, 9)})

n=2
# Solution
df_grpd = df['taste'].groupby(df.fruit)
df_grpd.get_group('banana').sort_values().iloc[-n]
0.5900557522953728

分組後獲取每組平均值, 並且保持分組列不是index

df = pd.DataFrame({'fruit': ['apple', 'banana', 'orange'] * 3,
                   'rating': np.random.rand(9),
                   'price': np.random.randint(0, 15, 9)})

# Solution
out = df.groupby('fruit', as_index=False)['price'].mean()
print(out)
    fruit     price
0   apple  9.666667
1  banana  8.666667
2  orange  3.666667

參照兩列合併兩個dataframe, 並且只保留兩個dataframe都有的行

df1 = pd.DataFrame({'fruit': ['apple', 'banana', 'orange'] * 3,
                    'weight': ['high', 'medium', 'low'] * 3,
                    'price': np.random.randint(0, 15, 9)})

df2 = pd.DataFrame({'pazham': ['apple', 'orange', 'pine'] * 2,
                    'kilo': ['high', 'low'] * 3,
                    'price': np.random.randint(0, 15, 6)})

# Solution
pd.merge(df1, df2, how='inner', left_on=['fruit', 'weight'], right_on=['pazham', 'kilo'], suffixes=['_left', '_right'])
fruit weight price_left pazham kilo price_right
0 apple high 14 apple high 9
1 apple high 10 apple high 9
2 apple high 7 apple high 9
3 orange low 5 orange low 8
4 orange low 5 orange low 8
5 orange low 0 orange low 8

如何從dataframe中刪除另一個dataframe中存在的行

df1 = pd.DataFrame({'fruit': ['apple', 'orange', 'banana'] * 3,
                    'weight': ['high', 'medium', 'low'] * 3,
                    'price': np.arange(9)})

df2 = pd.DataFrame({'fruit': ['apple', 'orange', 'pine'] * 2,
                    'weight': ['high', 'medium'] * 3,
                    'price': np.arange(6)})


# Solution
print(df1[~df1.isin(df2).all(1)])

df1.isin(df2)
    fruit  weight  price
2  banana     low      2
3   apple    high      3
4  orange  medium      4
5  banana     low      5
6   apple    high      6
7  orange  medium      7
8  banana     low      8
fruit weight price
0 True True True
1 True True True
2 False False True
3 True False True
4 True False True
5 False False True
6 False False False
7 False False False
8 False False False

如何獲得兩列值匹配的位置

df = pd.DataFrame({'fruit1': np.random.choice(['apple', 'orange', 'banana'], 10),
                    'fruit2': np.random.choice(['apple', 'orange', 'banana'], 10)})

# Solution
np.where(df.fruit1 == df.fruit2)
(array([0, 1, 2, 4, 6, 7], dtype=int64),)

時間序列如何前後移動時間步

創建新的列是已有列的滯後列或者前向列

df = pd.DataFrame(np.random.randint(1, 100, 20).reshape(-1, 4), columns = list('abcd'))

# Solution
df['a_lag1'] = df['a'].shift(1)
df['b_lead1'] = df['b'].shift(-1)
print(df)
    a   b   c   d  a_lag1  b_lead1
0  53  17  45  76     NaN     44.0
1  72  44  52  42    53.0     54.0
2  55  54  87  69    72.0     42.0
3  31  42  75  35    55.0     53.0
4  79  53  27  14    31.0      NaN

獲取整個dataframe值的計數

df = pd.DataFrame(np.random.randint(1, 10, 20).reshape(-1, 4), columns = list('abcd'))
# Solution

pd.value_counts(df.values.ravel())
7    5
9    4
1    3
8    2
6    2
5    2
3    1
2    1
dtype: int64

字符串列的分割

df = pd.DataFrame(["STD, City    State",
                    "33, Kolkata    West Bengal",
                    "44, Chennai    Tamil Nadu",
                    "40, Hyderabad    Telengana",
                    "80, Bangalore    Karnataka"], columns=['row'])

# Solution
df.row.str.split(',|\t', expand=True)
0 1
0 STD City State
1 33 Kolkata West Bengal
2 44 Chennai Tamil Nadu
3 40 Hyderabad Telengana
4 80 Bangalore Karnataka
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章