下面的練習來源：pandas數據分析100道練習題，將將夠了解熟悉一下pandas各種操作，我對有些題目使用到的函數還不是十分理解。
題目我都寫到一個ipynb文件裏了，已上傳到CSDN，0積分，鏈接。

另外分享兩個網站練習，能通過實戰一般練習到pandas（有十個實戰的題目）：
兩個練習題我沒全看，只看了和鯨Kesci網的，但是看github上的題目標題及練習的形式大概判斷它們的本質是一模一樣的。
和鯨Kesci：pandas數分析練習 \ pandas數分析練習：Github鏈接
 知乎：pandas練習題100道 \ pandas練習題100道：Github鏈接

如何引入pandas並查看版本

import pandas as pd
print(pd.__version__)
print(pd.show_versions(as_json=True))

0.25.1
{'system': {'commit': None, 'python': '3.7.4.final.0', 'python-bits': 64, 'OS': 'Windows', 'OS-release': '10', 'machine': 'AMD64', 'processor': 'Intel64 Family 6 Model 142 Stepping 9, GenuineIntel', 'byteorder': 'little', 'LC_ALL': 'None', 'LANG': 'None', 'LOCALE': 'None.None'}, 'dependencies': {'pandas': '0.25.1', 'numpy': '1.16.5', 'pytz': '2019.3', 'dateutil': '2.8.0', 'pip': '19.2.3', 'setuptools': '41.4.0', 'Cython': '0.29.13', 'pytest': '5.2.1', 'hypothesis': None, 'sphinx': '2.2.0', 'blosc': None, 'feather': None, 'xlsxwriter': '1.2.1', 'lxml.etree': '4.4.1', 'html5lib': '1.0.1', 'pymysql': '0.9.3', 'psycopg2': None, 'jinja2': '2.10.3', 'IPython': '7.8.0', 'pandas_datareader': None, 'bs4': '4.8.0', 'bottleneck': '1.2.1', 'fastparquet': None, 'gcsfs': None, 'matplotlib': '3.1.1', 'numexpr': '2.7.0', 'odfpy': None, 'openpyxl': '3.0.0', 'pandas_gbq': None, 'pyarrow': None, 'pytables': None, 's3fs': None, 'scipy': '1.3.1', 'sqlalchemy': '1.3.9', 'tables': '3.5.2', 'xarray': None, 'xlrd': '1.2.0', 'xlwt': '1.3.0'}}
None

list或numpy array或dict轉pd.Series

import numpy as np
mylist = list('abcdefghijklmnopqrstuvwxyz')
myarr = np.arange(26)
mydict = zip(mylist, myarr)
ser1 = pd.Series(mylist)
ser2 = pd.Series(myarr)
ser3 = pd.Series(mydict)
print(ser3.head())

0    (a, 0)
1    (b, 1)
2    (c, 2)
3    (d, 3)
4    (e, 4)
dtype: object

series的index轉dataframe的column

df = ser3.to_frame().reset_index()
df.head()

	index	0
0	0	(a, 0)
1	1	(b, 1)
2	2	(c, 2)
3	3	(d, 3)
4	4	(e, 4)

多個series合併成一個dataframe

df = pd.DataFrame({'col1': ser1, 'col2':ser2})
df.head()

	col1	col2
0	a	0
1	b	1
2	c	2
3	d	3
4	e	4

根據index, 多個series合併成dataframe

s1 = ser1[:16]
s2 = ser2[14:]
pd.concat([s1,s2], axis=1)

	0	1
0	a	NaN
1	b	NaN
2	c	NaN
3	d	NaN
4	e	NaN
5	f	NaN
6	g	NaN
7	h	NaN
8	i	NaN
9	j	NaN
10	k	NaN
11	l	NaN
12	m	NaN
13	n	NaN
14	o	14.0
15	p	15.0
16	NaN	16.0
17	NaN	17.0
18	NaN	18.0
19	NaN	19.0
20	NaN	20.0
21	NaN	21.0
22	NaN	22.0
23	NaN	23.0
24	NaN	24.0
25	NaN	25.0

頭尾拼接兩個series

pd.concat([s1,s2],axis=0)

0      a
1      b
2      c
3      d
4      e
5      f
6      g
7      h
8      i
9      j
10     k
11     l
12     m
13     n
14     o
15     p
14    14
15    15
16    16
17    17
18    18
19    19
20    20
21    21
22    22
23    23
24    24
25    25
dtype: object

找到元素在series A中不在series B中

ser1 = pd.Series([1,2,3,4,5])
ser2 = pd.Series([4,5,6,7,8])
ser1[~ser1.isin(ser2)]

0    1
1    2
2    3
dtype: int64

兩個seiries的並集

np.union1d(ser1,ser2)

array([1, 2, 3, 4, 5, 6, 7, 8], dtype=int64)

兩個series的交集

np.intersect1d(ser1,ser2)

array([4, 5], dtype=int64)

兩個series的非共有元素

u = pd.Series(np.union1d(ser1,ser2))
i = pd.Series(np.intersect1d(ser1,ser2))
u[~u.isin(i)]

0    1
1    2
2    3
5    6
6    7
7    8
dtype: int64

如何獲得series的最小值，第25百分位數，中位數，第75位和最大值？

ser = pd.Series(np.random.normal(10, 5, 25))
np.random.RandomState(100)
np.percentile(ser, q=[0, 25, 50, 75, 100])

array([-2.54523372,  7.86187042, 10.16123596, 15.60337005, 23.12409334])

如何獲得系列中唯一項目的頻率計數?

ser = pd.Series(np.take(list('abcdefgh'), np.random.randint(8, size=30)))
ser.value_counts()

h    8
b    5
c    4
g    4
f    4
a    3
d    2
dtype: int64

series中計數排名前2的元素

v_cnt = ser.value_counts()
print(v_cnt)
cnt_cnt = v_cnt.value_counts().index[:2]
print(cnt_cnt)
cnt_cnt

h    8
b    5
c    4
g    4
f    4
a    3
d    2
dtype: int64
Int64Index([4, 5], dtype='int64')





Int64Index([4, 5], dtype='int64')

v_cnt[v_cnt.isin(cnt_cnt)].index

Index(['b', 'c', 'g', 'f'], dtype='object')

如何將數字系列分成10個相同大小的組

ser = pd.Series(np.random.random(20))
ser.head()

0    0.588945
1    0.356710
2    0.798986
3    0.170943
4    0.076717
dtype: float64

groups = pd.qcut(ser, q=[0, .10, .20, .3, .4, .5, .6, .7, .8, .9, 1], 
        labels=['1st', '2nd', '3rd', '4th', '5th', '6th', '7th', '8th', '9th', '10th'])
groups.head()

0    5th
1    3rd
2    9th
3    3rd
4    1st
dtype: category
Categories (10, object): [1st < 2nd < 3rd < 4th ... 7th < 8th < 9th < 10th]

如何將numpy數組轉換爲給定形狀的dataframe

ser = pd.Series(np.random.randint(1,10,35))
df = pd.DataFrame(ser.values.reshape(7,5))
df

	0	1	2	3	4
0	8	6	5	4	1
1	1	7	8	1	4
2	5	3	5	7	5
3	8	6	3	4	6
4	5	9	2	4	3
5	3	7	6	8	7
6	6	2	2	7	5

如何從一系列中找到2的倍數的數字位置

ser = pd.Series(np.random.randint(1,10,7))
ser

0    1
1    8
2    7
3    9
4    9
5    2
6    6
dtype: int32

# ser[ser.map(lambda x: x%2 == 0)].index
np.argwhere(ser % 2 == 0)

array([[1],
       [5],
       [6]], dtype=int64)

如何從系列中的給定位置提取項目

ser = pd.Series(list('abcdefghijklmnopqrstuvwxyz'))
pos = [0,4,8,14,20]
ser.take(pos)

0     a
4     e
8     i
14    o
20    u
dtype: object

獲取元素的位置

aims = list('adhz')
[pd.Index(ser).get_loc(i) for i in aims]

[0, 3, 7, 25]

如何計算真值和預測序列的均方誤差

truth = pd.Series(range(10))
pred = pd.Series(range(10)) + np.random.random(10)
np.mean((truth - pred)**2)

0.32466394194250286

如何將系列中每個元素的第一個字符轉換爲大寫

ser = pd.Series(['how','are','you'])
ser.map(lambda x: x.title())

0    How
1    Are
2    You
dtype: object

如何計算系列中每個單詞的字符數

ser.map(lambda x:len(x))

0    3
1    3
2    3
dtype: int64

如何計算時間序列數據的差分

ser = pd.Series([1, 3, 6, 10, 15, 21, 27, 35])

# 一級差分
ser.diff()

0    NaN
1    2.0
2    3.0
3    4.0
4    5.0
5    6.0
6    6.0
7    8.0
dtype: float64

# 二級差分
ser.diff().diff()

0    NaN
1    NaN
2    1.0
3    1.0
4    1.0
5    1.0
6    0.0
7    2.0
dtype: float64

series如何將一日期字符串轉換爲時間

import pandas as pd
ser = pd.Series(
    ['01 Jan 2010', 
     '02-02-2011', 
     '20120303', 
     '2013/04/04', 
     '2014-05-05', 
     '2015-06-06T12:20']
)
data = pd.to_datetime(ser)
data

0   2010-01-01 00:00:00
1   2011-02-02 00:00:00
2   2012-03-03 00:00:00
3   2013-04-04 00:00:00
4   2014-05-05 00:00:00
5   2015-06-06 12:20:00
dtype: datetime64[ns]

series如何從時間序列中提取年/月/天/小時/分鐘/秒

data.dt.year

0    2010
1    2011
2    2012
3    2013
4    2014
5    2015
dtype: int64

data.dt.month

0    1
1    2
2    3
3    4
4    5
5    6
dtype: int64

data.dt.day

0    1
1    2
2    3
3    4
4    5
5    6
dtype: int64

data.dt.hour

0     0
1     0
2     0
3     0
4     0
5    12
dtype: int64

data.dt.minute

0     0
1     0
2     0
3     0
4     0
5    20
dtype: int64

data.dt.microsecond

0    0
1    0
2    0
3    0
4    0
5    0
dtype: int64

從series中找出包含兩個以上元音字母的單詞

ser = pd.Series(['Apple', 'Orange', 'Plan', 'Python', 'Money'])

def count(x):
    aims = 'aeiou'
    c= 0
    for i in x:
        if i in aims:
            c += 1
    return c

ser[ser.map(lambda x: count(x)) >= 2]

1    Orange
4     Money
dtype: object

如何過濾series中的有效電子郵件

import re
emails = pd.Series(['buying books at amazom.com', 
                    '[email protected]', 
                    '[email protected]',
                    '[email protected]'])
pattern ='[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,4}'
valid = emails.str.findall(pattern)
[x[0] for x in valid if len(x)]

['[email protected]', '[email protected]', '[email protected]']

series A 以series B爲分組依據, 然後計算分組後的平均值

import numpy as np
fruit = pd.Series(np.random.choice(['apple', 'banana', 'carrot'], 10))
weights = pd.Series(np.linspace(1, 10, 10))

weights.groupby(fruit).mean()

apple     6.5
banana    6.0
carrot    2.5
dtype: float64

如何計算兩個系列之間的歐氏距離

p = pd.Series([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
q = pd.Series([10, 9, 8, 7, 6, 5, 4, 3, 2, 1])

sum((p - q)**2)**.5

18.16590212458495

如何在數字系列中查找所有局部最大值（或峯值）

ser = pd.Series([2, 10, 3, 4, 9, 10, 2, 7, 3])
dd = np.diff(np.sign(np.diff(ser)))
peak_locs = np.where(dd == -2)[0] + 1
peak_locs

array([1, 5, 7], dtype=int64)

如何創建一個以’2000-01-02’開始包含10個週六的TimeSeries

pd.Series(np.random.randint(1, 10, 10),
          pd.date_range('2000-01-02',
                        periods=10,
                        freq='W-SAT'
                       )
         )

2000-01-08    3
2000-01-15    5
2000-01-22    2
2000-01-29    4
2000-02-05    7
2000-02-12    9
2000-02-19    1
2000-02-26    4
2000-03-04    8
2000-03-11    9
Freq: W-SAT, dtype: int32

如何填補TimeSeires的缺失日期

ser = pd.Series([1,10,3,np.nan], index=pd.to_datetime(['2000-01-01',
                                                       '2000-01-03', 
                                                       '2000-01-06', 
                                                       '2000-01-08']))
ser.resample('D').ffill()

2000-01-01     1.0
2000-01-02     1.0
2000-01-03    10.0
2000-01-04    10.0
2000-01-05    10.0
2000-01-06     3.0
2000-01-07     3.0
2000-01-08     NaN
Freq: D, dtype: float64

如何計算series的自相關

ser = pd.Series(np.arange(20) + np.random.normal(1, 10, 20))
autocorrelations = [ser.autocorr(i).round(2) for i in range(11)]

autocorrelations

[1.0, 0.33, -0.03, -0.49, -0.48, -0.25, -0.25, 0.52, 0.49, 0.39, -0.26]

讀取csv時, 間隔幾行讀取數據

# 生成用於測試的csv
fpath = 'testt.csv'
df = pd.DataFrame({'a': range(100), 
                   'b':np.random.choice(['apple', 'banana', 'carrot'], 100)})
df.to_csv(fpath, index=None)

# 隔行讀取csv
import csv

with open(fpath, 'r') as f:
    reader = csv.reader(f)
    out = []
    for i, row in enumerate(reader):
        if i%20 ==0:
            out.append(row)
            
print(out)            
pd.DataFrame(out[1:], columns=out[0])

[['a', 'b'], ['19', 'carrot'], ['39', 'apple'], ['59', 'apple'], ['79', 'apple'], ['99', 'carrot']]

	a	b
0	19	carrot
1	39	apple
2	59	apple
3	79	apple
4	99	carrot

讀取csv時進行數據轉換

pd.read_csv(fpath, 
            converters={
                'a':lambda x: 'low' if int(x) < 3 else 'high'
            }).head()

	a	b
0	low	banana
1	low	apple
2	low	apple
3	high	banana
4	high	carrot

讀取csv時只讀取某列

pd.read_csv(fpath, usecols=['a']).head()

	a
0	0
1	1
2	2
3	3
4	4

讀取dataframe每列的數據類型

df=pd.DataFrame(
    {
        'a':range(100),
        'b':np.random.rand(100),
        'c':[1,2,3,4]*25,
        'd':['apple', 'banana', 'carrot']*33 + ['apple']
    }
)

df.dtypes

a      int64
b    float64
c      int64
d     object
dtype: object

讀取dataframe的行數和列數

df.shape

(100, 4)

獲取dataframe每列的基本描述統計

df.describe()

	a	b	c
count	100.000000	100.000000	100.000000
mean	49.500000	0.511879	2.500000
std	29.011492	0.293060	1.123666
min	0.000000	0.012651	1.000000
25%	24.750000	0.232498	1.750000
50%	49.500000	0.523401	2.500000
75%	74.250000	0.777009	3.250000
max	99.000000	0.993780	4.000000

從dataframe中找到a列最大值對應的行

df[df.a == max(df.a)]

	a	b	c	d
99	99	0.365934	4	apple

從dataframe中獲取c列最大值所在的行號

# df[df.c == max(df.c)].index
np.where(df.c == max(df.c))

(array([ 3,  7, 11, 15, 19, 23, 27, 31, 35, 39, 43, 47, 51, 55, 59, 63, 67,
        71, 75, 79, 83, 87, 91, 95, 99], dtype=int64),)

在dataframe中根據行列數讀取某個值

row = 4
col = 0
print(f'行{row}列{col}的值是: {df.iat[row, col]}')
row = 4
col = 2
print(f'行{row}列{col}的值是: {df.iat[row, col]}')
row = 0
col = 0
print(f'行{row}列{col}的值是: {df.iat[row, col]}')
row = 33
col = 3
print(f'行{row}列{col}的值是: {df.iat[row, col]}')

行4列0的值是: 4
行4列2的值是: 1
行0列0的值是: 0
行33列3的值是: apple

在dataframe中根據index和列名稱讀取某個值

index = 0
col = 'd'
print(f'index={index}, col={col} : {df.at[index, col]}')
index = 2
col = 'd'
print(f'index={index}, col={col} : {df.at[index, col]}')
index = 4
col = 'd'
print(f'index={index}, col={col} : {df.at[index, col]}')
index = 5
col = 'c'
print(f'index={index}, col={col} : {df.at[index, col]}')

index=0, col=d : apple
index=2, col=d : carrot
index=4, col=d : banana
index=5, col=c : 2

dataframe中重命名某一列

df.rename(columns={'d':'e'}).head()

	a	b	c	e
0	0	0.233596	1	apple
1	1	0.457918	2	banana
2	2	0.866008	3	carrot
3	3	0.724397	4	apple
4	4	0.161981	1	banana

檢查dataframe是否有缺失值

df = pd.DataFrame({
    'a':[1.2,2,3,4],
    'b':list('abcd')
})

print('缺失:', df.isnull().values.any())
df.iat[0,0] = np.nan
print('缺失:', df.isnull().values.any())

缺失: False
缺失: True

統計dataframe中每列缺失值的數量

df.apply(lambda x: x.isnull().sum())

a    1
b    0
dtype: int64

dataframe用每列的平均值取代缺失值

df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
# 前者還未替換
df[['Min.Price', 'Max.Price']].head()

	Min.Price	Max.Price
0	12.9	18.8
1	29.2	38.7
2	25.9	32.3
3	NaN	44.6
4	NaN	NaN

# 後者 已替換
# 僅使用['Min.Price', 'Max.Price']這兩列演示
df[['Min.Price', 'Max.Price']].apply(lambda x: x.fillna(x.mean())).head()

	Min.Price	Max.Price
0	12.900000	18.800000
1	29.200000	38.700000
2	25.900000	32.300000
3	17.118605	44.600000
4	17.118605	21.459091

從dataframe中獲取某一列, 並返回一個dataframe

df[['Manufacturer']].head()

	Manufacturer
0	Acura
1	NaN
2	Audi
3	Audi
4	BMW

dataframe如何改變列的順序

df = pd.DataFrame(np.arange(20).reshape(-1, 5), columns=list('abcde'))
df.head()

	a	b	c	d	e
0	0	1	2	3	4
1	5	6	7	8	9
2	10	11	12	13	14
3	15	16	17	18	19

df[list('cbdae')]

	c	b	d	a	e
0	2	1	3	0	4
1	7	6	8	5	9
2	12	11	13	10	14
3	17	16	18	15	19

設置dataframe輸出的行數和列數

# 設置前
print(df)
pd.set_option('display.max_columns', 3)
pd.set_option('display.max_rows', 3)
# 設置之後
print(df)

    a   b   c   d   e
0   0   1   2   3   4
1   5   6   7   8   9
2  10  11  12  13  14
3  15  16  17  18  19
     a  ...   e
0    0  ...   4
..  ..  ...  ..
3   15  ...  19

[4 rows x 5 columns]

pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 100)

設置dataframe輸出時不使用科學記數法

pd.DataFrame(np.random.random(4)**10, columns=['random'])

	random
0	2.241557e-06
1	4.480715e-01
2	3.953685e-05
3	8.765943e-08

# 設置
pd.set_option('display.float_format', lambda x: '%.4f' % x)
pd.DataFrame(np.random.random(4)**10, columns=['random'])

	random
0	0.0000
1	0.8627
2	0.0000
3	0.0000

# 恢復默認值
pd.set_option('display.float_format', None)
pd.DataFrame(np.random.random(4)**10, columns=['random'])

	random
0	1.264273e-01
1	4.025861e-02
2	1.266545e-05
3	6.658184e-07

設置dataframe輸出百分比數據

df = pd.DataFrame(np.random.random(4), columns=['random'])
df

	random
0	0.485751
1	0.043867
2	0.028074
3	0.649918

df.style.format({'random':'{0:.2%}'.format})

            <tr>
                    <th id="T_bbbf859c_6f53_11ea_b26a_54e1ad861b9blevel0_row0" class="row_heading level0 row0" >0</th>
                    <td id="T_bbbf859c_6f53_11ea_b26a_54e1ad861b9brow0_col0" class="data row0 col0" >48.58%</td>
        </tr>
        <tr>
                    <th id="T_bbbf859c_6f53_11ea_b26a_54e1ad861b9blevel0_row1" class="row_heading level0 row1" >1</th>
                    <td id="T_bbbf859c_6f53_11ea_b26a_54e1ad861b9brow1_col0" class="data row1 col0" >4.39%</td>
        </tr>
        <tr>
                    <th id="T_bbbf859c_6f53_11ea_b26a_54e1ad861b9blevel0_row2" class="row_heading level0 row2" >2</th>
                    <td id="T_bbbf859c_6f53_11ea_b26a_54e1ad861b9brow2_col0" class="data row2 col0" >2.81%</td>
        </tr>
        <tr>
                    <th id="T_bbbf859c_6f53_11ea_b26a_54e1ad861b9blevel0_row3" class="row_heading level0 row3" >3</th>
                    <td id="T_bbbf859c_6f53_11ea_b26a_54e1ad861b9brow3_col0" class="data row3 col0" >64.99%</td>
        </tr>
</tbody></table>

	random

使用多個列創建唯一索引(index)

df = pd.read_csv(
    'https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv', 
    usecols=[0,1,2,3,5])
df[['Manufacturer', 'Model', 'Type']] = df[['Manufacturer', 'Model', 'Type']].fillna('missing')
df.index = df.Manufacturer + '_' + df.Model + '_' + df.Type
df.head()

	Manufacturer	Model	Type	Min.Price	Max.Price
Acura_Integra_Small	Acura	Integra	Small	12.9	18.8
missing_Legend_Midsize	missing	Legend	Midsize	29.2	38.7
Audi_90_Compact	Audi	90	Compact	25.9	32.3
Audi_100_Midsize	Audi	100	Midsize	NaN	44.6
BMW_535i_Midsize	BMW	535i	Midsize	NaN	NaN

獲取第n大的數所在行

df = pd.DataFrame(
    np.random.randint(1, 30, 30).reshape(10,-1), 
    columns=list('abc'))
df['a']

0    18
1     8
2    21
3     2
4    14
5    14
6    28
7    11
8    26
9    27
Name: a, dtype: int32

# 使用行號排序
df['a'].argsort()

0    3
1    1
2    7
3    4
4    5
5    0
6    2
7    8
8    9
9    6
Name: a, dtype: int64

df['a'][df['a'].argsort()]

3     2
1     8
7    11
4    14
5    14
0    18
2    21
8    26
9    27
6    28
Name: a, dtype: int32

n = 5
df['a'].argsort()[::-1][n]

dataframe獲取行之和大於100的數據, 並返回最後的兩行

df = pd.DataFrame(np.random.randint(10, 40, 60).reshape(-1, 4))
rowsums = df.apply(np.sum, axis=1)
last_two_rows = df.iloc[np.where(rowsums > 100)[0][-2:], :]
last_two_rows

	0	1	2	3
13	17	34	15	37
14	34	39	27	11

如何從系列或數據框列中查找和限制異常值

用相應的5％分位數和95％分位數值替換低於5％分位數和大於95％分位數的所有值

# Input
ser = pd.Series(np.logspace(-2, 2, 30))

# Solution
def cap_outliers(ser, low_perc, high_perc):
    low, high = ser.quantile([low_perc, high_perc])
    print(low_perc, '%ile: ', low, '|', high_perc, '%ile: ', high)
    ser[ser < low] = low
    ser[ser > high] = high
    return(ser)

capped_ser = cap_outliers(ser, .05, .95)

0.05 %ile:  0.016049294076965887 | 0.95 %ile:  63.876672220183934

如何在刪除負值後將dataframe重新整形爲最大可能的正方形

將df重塑爲最大可能的正方形，並刪除負值。如果需要，刪除最小值。結果中正數的順序應保持與原始順序相同。

df = pd.DataFrame(np.random.randint(-20, 50, 100).reshape(10,-1))
print(df)

    0   1   2   3   4   5   6   7   8   9
0  32  20   1  42  43  12 -11   4   4  17
1   7  30 -17 -12   4   2  -5  27 -11 -16
2 -15 -15 -18   7 -16  18   1  10   3  15
3  33  -9  28  27  11   4   6  -7  42   2
4  -3  49  17  25  14   4  39  23  41  33
5   6  25 -13   9 -18  17  -3  48   5   0
6  14  13  -8 -15  27  10 -12  11   4  28
7   7  28  32  19  28 -19  26  19  38  -4
8   0  43 -13  45  12  21  40  29  -1   6
9  40  10  18  25  20   9   7  35  13  -2

# 步驟1:刪除負數

arr = df[df > 0].values.flatten()
arr_qualified = arr[~np.isnan(arr)]
arr_qualified

array([32., 20.,  1., 42., 43., 12.,  4.,  4., 17.,  7., 30.,  4.,  2.,
       27.,  7., 18.,  1., 10.,  3., 15., 33., 28., 27., 11.,  4.,  6.,
       42.,  2., 49., 17., 25., 14.,  4., 39., 23., 41., 33.,  6., 25.,
        9., 17., 48.,  5., 14., 13., 27., 10., 11.,  4., 28.,  7., 28.,
       32., 19., 28., 26., 19., 38., 43., 45., 12., 21., 40., 29.,  6.,
       40., 10., 18., 25., 20.,  9.,  7., 35., 13.])

# 步驟2: 計算正方形的邊長

n = int(np.floor(arr_qualified.shape[0]**.5))
n

# 步驟3: 整形爲要求的正方形
top_indexes = np.argsort(arr_qualified)[::-1]
output = np.take(arr_qualified, sorted(top_indexes[:n**2])).reshape(n, -1)
print(output)

[[32. 20. 42. 43. 12. 17.  7. 30.]
 [ 4. 27.  7. 18. 10. 15. 33. 28.]
 [27. 11.  6. 42. 49. 17. 25. 14.]
 [39. 23. 41. 33.  6. 25.  9. 17.]
 [48.  5. 14. 13. 27. 10. 11. 28.]
 [ 7. 28. 32. 19. 28. 26. 19. 38.]
 [43. 45. 12. 21. 40. 29.  6. 40.]
 [10. 18. 25. 20.  9.  7. 35. 13.]]

交換dataframe的兩行

把第一行和第二行數據交換

df = pd.DataFrame(np.arange(25).reshape(5, -1))
df

	0	1	2	3	4
0	0	1	2	3	4
1	5	6	7	8	9
2	10	11	12	13	14
3	15	16	17	18	19
4	20	21	22	23	24

a, b = df.iloc[1, :].copy(), df.iloc[2, :].copy()
df.iloc[1, :], df.iloc[2, :] = b, a
df

	0	1	2	3	4
0	0	1	2	3	4
1	10	11	12	13	14
2	5	6	7	8	9
3	15	16	17	18	19
4	20	21	22	23	24

dataframe行倒序排序

df.iloc[::-1, :]

	0	1	2	3	4
4	20	21	22	23	24
3	15	16	17	18	19
2	5	6	7	8	9
1	10	11	12	13	14
0	0	1	2	3	4

對分類數據進行one-hot編碼

df = pd.DataFrame(np.arange(25).reshape(5,-1), columns=list('abcde'))
df

	a	b	c	d	e
0	0	1	2	3	4
1	5	6	7	8	9
2	10	11	12	13	14
3	15	16	17	18	19
4	20	21	22	23	24

pd.concat([pd.get_dummies(df['a']), df[list('bcde')]], axis=1)

	0	5	10	15	20	b	c	d	e
0	1	0	0	0	0	1	2	3	4
1	0	1	0	0	0	6	7	8	9
2	0	0	1	0	0	11	12	13	14
3	0	0	0	1	0	16	17	18	19
4	0	0	0	0	1	21	22	23	24

哪個列包含每行的最大值

求行最大值所在的列

df = pd.DataFrame(np.random.randint(1,100, 40).reshape(10, -1))
df.apply(np.argmax, axis=1)

0    1
1    1
2    1
3    2
4    2
5    2
6    0
7    1
8    3
9    3
dtype: int64

計算每行的最近行(使用歐幾里得距離)

nearest = {}
for i, row in df.iterrows():
    c = ((df - row)**2).sum(axis = 1).argsort()
    for j in c:
        if j != i:
            break
    nearest[i] = j
print(nearest)

{0: 2, 1: 7, 2: 7, 3: 5, 4: 0, 5: 3, 6: 2, 7: 2, 8: 9, 9: 8}

如何計算列之間的最大相關係數

df = pd.DataFrame(
    np.random.randint(1,100, 80).reshape(8, -1), 
    columns=list('pqrstuvwxy'),
    index=list('abcdefgh')
)

abs_corrmat = np.abs(df.corr())
print(abs_corrmat)
max_corr = abs_corrmat.apply(lambda x: sorted(x)[-2])
print('Maximum Correlation possible for each column: ', np.round(max_corr.tolist(), 2))

          p         q         r         s         t         u         v  \
p  1.000000  0.664683  0.230041  0.017442  0.219024  0.548010  0.201785   
q  0.664683  1.000000  0.084993  0.145618  0.050637  0.846209  0.122756   
r  0.230041  0.084993  1.000000  0.527469  0.329950  0.251885  0.179597   
s  0.017442  0.145618  0.527469  1.000000  0.217383  0.157896  0.122085   
t  0.219024  0.050637  0.329950  0.217383  1.000000  0.265879  0.650415   
u  0.548010  0.846209  0.251885  0.157896  0.265879  1.000000  0.156130   
v  0.201785  0.122756  0.179597  0.122085  0.650415  0.156130  1.000000   
w  0.534019  0.097979  0.122874  0.459503  0.100161  0.353358  0.177244   
x  0.211146  0.376211  0.473825  0.176471  0.181969  0.338606  0.153986   
y  0.258789  0.088951  0.518529  0.261935  0.090374  0.318080  0.295910   

          w         x         y  
p  0.534019  0.211146  0.258789  
q  0.097979  0.376211  0.088951  
r  0.122874  0.473825  0.518529  
s  0.459503  0.176471  0.261935  
t  0.100161  0.181969  0.090374  
u  0.353358  0.338606  0.318080  
v  0.177244  0.153986  0.295910  
w  1.000000  0.008501  0.045387  
x  0.008501  1.000000  0.385229  
y  0.045387  0.385229  1.000000  
Maximum Correlation possible for each column:  [0.66 0.85 0.53 0.53 0.65 0.85 0.65 0.53 0.47 0.52]

計算每一行的最小值與最大值的比值

df = pd.DataFrame(np.random.randint(1,100, 80).reshape(8, -1))

# Solution 1
min_by_max = df.apply(lambda x: np.min(x)/np.max(x), axis=1)
min_by_max

0    0.094737
1    0.092784
2    0.223684
3    0.134021
4    0.012346
5    0.074074
6    0.046512
7    0.173469
dtype: float64

找到每行第二大的值

# Input
df = pd.DataFrame(np.random.randint(1,100, 80).reshape(8, -1))

# Solution
out = df.apply(lambda x: x.sort_values().unique()[-2], axis=1)
df['penultimate'] = out
print(df)

    0   1   2   3   4   5   6   7   8   9  penultimate
0  43  42  77  25   1  12  25  53  58  64           64
1  13  46  77  37  33  88  67  81  36   2           81
2  31  77  95   8  20  39   7  38  71  75           77
3  23  41  66   8  77  68  11  51  70  70           70
4   3  93  83  27  66  72  24  18  92   8           92
5  26  52  62  39  12   5  71  78   2  62           71
6  22  29  83  29  37  49  22  32  90  45           83
7  69  46  75  60  74  83  33  37  79  47           79

如何正態化dataframe中的所有列

# Input
df = pd.DataFrame(np.random.randint(1,100, 80).reshape(8, -1))

# Solution Q1
out1 = df.apply(lambda x: ((x - x.mean())/x.std()).round(2))
print('Solution Q1\n',out1)

# Solution Q2
out2 = df.apply(lambda x: ((x.max() - x)/(x.max() - x.min())).round(2))
print('Solution Q2\n', out2)

Solution Q1
       0     1     2     3     4     5     6     7     8     9
0 -0.68 -0.76 -0.97  1.56  0.85 -0.17  1.16 -0.78  1.17 -1.22
1  1.93 -0.38  1.44  0.27 -0.96 -0.64 -0.65 -1.59 -0.98 -1.83
2  0.79  0.72 -0.18  0.57 -0.17  1.34  1.22  0.69 -1.16 -0.16
3  0.49 -0.62  0.91 -0.03 -1.30  0.66 -0.50  0.98  0.34  0.73
4 -0.19 -0.64 -0.85 -0.38  0.99 -1.08 -0.18  1.24  0.86  0.49
5 -0.65  1.79  1.14  0.27  1.27  0.66  1.08 -0.87  0.56  0.80
6 -0.72 -0.99 -0.63 -1.97 -0.99 -1.49 -1.14  0.36  0.56  0.63
7 -0.98  0.89 -0.85 -0.28  0.31  0.72 -1.00 -0.03 -1.35  0.56
Solution Q2
       0     1     2     3     4     5     6     7     8     9
0  0.90  0.92  1.00  0.00  0.16  0.53  0.02  0.71  0.00  0.77
1  0.00  0.78  0.00  0.37  0.87  0.70  0.79  1.00  0.85  1.00
2  0.39  0.39  0.67  0.28  0.56  0.00  0.00  0.20  0.93  0.36
3  0.49  0.86  0.22  0.45  1.00  0.24  0.73  0.09  0.33  0.03
4  0.73  0.88  0.95  0.55  0.11  0.85  0.59  0.00  0.12  0.12
5  0.89  0.00  0.12  0.37  0.00  0.24  0.06  0.75  0.24  0.00
6  0.91  1.00  0.86  1.00  0.88  1.00  1.00  0.31  0.24  0.06
7  1.00  0.32  0.95  0.52  0.37  0.22  0.94  0.45  1.00  0.09

如何計算每行與上一行的相關？

# Input
df = pd.DataFrame(np.random.randint(1,100, 80).reshape(8, -1))

# Solution
[df.iloc[i].corr(df.iloc[i+1]).round(2) for i in range(df.shape[0])[:-1]]

[-0.29, 0.25, 0.66, -0.26, -0.47, -0.53, -0.01]

如何用0填充dataframe的對角線上的數

df = pd.DataFrame(np.random.randint(1,100, 100).reshape(10, -1))

# Solution
for i in range(df.shape[0]):
    df.iat[i, i] = 0
    df.iat[df.shape[0]-i-1, i] = 0
df

	0	1	2	3	4	5	6	7	8	9
0	0	3	88	65	97	92	55	83	60	0
1	88	0	64	41	97	51	98	67	0	50
2	50	80	0	38	73	55	7	0	49	91
3	30	33	24	0	84	42	0	99	84	67
4	9	68	56	16	0	0	98	43	24	6
5	24	50	19	6	0	0	48	52	73	68
6	39	61	1	0	94	68	0	4	77	70
7	88	23	0	87	76	67	42	0	40	54
8	85	0	37	63	44	27	16	39	0	17
9	0	71	54	73	89	15	20	20	19	0

dataframe分組後獲取某個組的數據

df = pd.DataFrame({'col1': ['apple', 'banana', 'orange'] * 3,
                   'col2': np.random.rand(9),
                   'col3': np.random.randint(0, 15, 9)})

df_grouped = df.groupby(['col1'])

# Solution 1
df_grouped.get_group('apple')

	col1	col2	col3
0	apple	0.183578	1
3	apple	0.359194	12
6	apple	0.062158	6

分組後獲取某組中的第n大的值

df = pd.DataFrame({'fruit': ['apple', 'banana', 'orange'] * 3,
                   'taste': np.random.rand(9),
                   'price': np.random.randint(0, 15, 9)})

n=2
# Solution
df_grpd = df['taste'].groupby(df.fruit)
df_grpd.get_group('banana').sort_values().iloc[-n]

0.5900557522953728

分組後獲取每組平均值, 並且保持分組列不是index

df = pd.DataFrame({'fruit': ['apple', 'banana', 'orange'] * 3,
                   'rating': np.random.rand(9),
                   'price': np.random.randint(0, 15, 9)})

# Solution
out = df.groupby('fruit', as_index=False)['price'].mean()
print(out)

    fruit     price
0   apple  9.666667
1  banana  8.666667
2  orange  3.666667

參照兩列合併兩個dataframe, 並且只保留兩個dataframe都有的行

df1 = pd.DataFrame({'fruit': ['apple', 'banana', 'orange'] * 3,
                    'weight': ['high', 'medium', 'low'] * 3,
                    'price': np.random.randint(0, 15, 9)})

df2 = pd.DataFrame({'pazham': ['apple', 'orange', 'pine'] * 2,
                    'kilo': ['high', 'low'] * 3,
                    'price': np.random.randint(0, 15, 6)})

# Solution
pd.merge(df1, df2, how='inner', left_on=['fruit', 'weight'], right_on=['pazham', 'kilo'], suffixes=['_left', '_right'])

	fruit	weight	price_left	pazham	kilo	price_right
0	apple	high	14	apple	high	9
1	apple	high	10	apple	high	9
2	apple	high	7	apple	high	9
3	orange	low	5	orange	low	8
4	orange	low	5	orange	low	8
5	orange	low	0	orange	low	8

如何從dataframe中刪除另一個dataframe中存在的行

df1 = pd.DataFrame({'fruit': ['apple', 'orange', 'banana'] * 3,
                    'weight': ['high', 'medium', 'low'] * 3,
                    'price': np.arange(9)})

df2 = pd.DataFrame({'fruit': ['apple', 'orange', 'pine'] * 2,
                    'weight': ['high', 'medium'] * 3,
                    'price': np.arange(6)})


# Solution
print(df1[~df1.isin(df2).all(1)])

df1.isin(df2)

    fruit  weight  price
2  banana     low      2
3   apple    high      3
4  orange  medium      4
5  banana     low      5
6   apple    high      6
7  orange  medium      7
8  banana     low      8

	fruit	weight	price
0	True	True	True
1	True	True	True
2	False	False	True
3	True	False	True
4	True	False	True
5	False	False	True
6	False	False	False
7	False	False	False
8	False	False	False

如何獲得兩列值匹配的位置

df = pd.DataFrame({'fruit1': np.random.choice(['apple', 'orange', 'banana'], 10),
                    'fruit2': np.random.choice(['apple', 'orange', 'banana'], 10)})

# Solution
np.where(df.fruit1 == df.fruit2)

(array([0, 1, 2, 4, 6, 7], dtype=int64),)

時間序列如何前後移動時間步

創建新的列是已有列的滯後列或者前向列

df = pd.DataFrame(np.random.randint(1, 100, 20).reshape(-1, 4), columns = list('abcd'))

# Solution
df['a_lag1'] = df['a'].shift(1)
df['b_lead1'] = df['b'].shift(-1)
print(df)

    a   b   c   d  a_lag1  b_lead1
0  53  17  45  76     NaN     44.0
1  72  44  52  42    53.0     54.0
2  55  54  87  69    72.0     42.0
3  31  42  75  35    55.0     53.0
4  79  53  27  14    31.0      NaN

獲取整個dataframe值的計數

df = pd.DataFrame(np.random.randint(1, 10, 20).reshape(-1, 4), columns = list('abcd'))
# Solution

pd.value_counts(df.values.ravel())

7    5
9    4
1    3
8    2
6    2
5    2
3    1
2    1
dtype: int64

字符串列的分割

df = pd.DataFrame(["STD, City    State",
                    "33, Kolkata    West Bengal",
                    "44, Chennai    Tamil Nadu",
                    "40, Hyderabad    Telengana",
                    "80, Bangalore    Karnataka"], columns=['row'])

# Solution
df.row.str.split(',|\t', expand=True)

	0	1
0	STD	City State
1	33	Kolkata West Bengal
2	44	Chennai Tamil Nadu
3	40	Hyderabad Telengana
4	80	Bangalore Karnataka

	0	5	10	15	20	b	c	d	e
0	1	0	0	0	0	1	2	3	4
1	0	1	0	0	0	6	7	8	9
2	0	0	1	0	0	11	12	13	14
3	0	0	0	1	0	16	17	18	19
4	0	0	0	0	1	21	22	23	24

	0	1	2	3	4	5	6	7	8	9
0	0	3	88	65	97	92	55	83	60	0
1	88	0	64	41	97	51	98	67	0	50
2	50	80	0	38	73	55	7	0	49	91
3	30	33	24	0	84	42	0	99	84	67
4	9	68	56	16	0	0	98	43	24	6
5	24	50	19	6	0	0	48	52	73	68
6	39	61	1	0	94	68	0	4	77	70
7	88	23	0	87	76	67	42	0	40	54
8	85	0	37	63	44	27	16	39	0	17
9	0	71	54	73	89	15	20	20	19	0

	0	5	10	15	20	b	c	d	e
0	1	0	0	0	0	1	2	3	4
1	0	1	0	0	0	6	7	8	9
2	0	0	1	0	0	11	12	13	14
3	0	0	0	1	0	16	17	18	19
4	0	0	0	0	1	21	22	23	24

	0	1	2	3	4	5	6	7	8	9
0	0	3	88	65	97	92	55	83	60	0
1	88	0	64	41	97	51	98	67	0	50
2	50	80	0	38	73	55	7	0	49	91
3	30	33	24	0	84	42	0	99	84	67
4	9	68	56	16	0	0	98	43	24	6
5	24	50	19	6	0	0	48	52	73	68
6	39	61	1	0	94	68	0	4	77	70
7	88	23	0	87	76	67	42	0	40	54
8	85	0	37	63	44	27	16	39	0	17
9	0	71	54	73	89	15	20	20	19	0

Pandas練習