python 筆記（二）numpy和pandas

1，numpy

Numpy

Numpy切片操作

data = data_original.iloc[:,1:2]結果爲DataFrame

data = data_original.iloc[:,2] 結果爲Serise

data_25 =data[0]

arr[1:2, 1:3]

np.c_和np.c_

np.c_是按行連接兩個矩陣，就是把兩矩陣上下相加，要求行數相等，類似於pandas中的concat()

np.c_是按列連接兩個矩陣，就是把兩矩陣左右相加，要求列數相等，類似於pandas中的merge()

Numpy隨機數

x = np.linspace(start,stop,num)（默認num等於50）

numpy的ones_like函數/np.zeros

返回一個用1填充的跟輸入形狀和類型一致的數組。

>>> x = np.arange(6)

>>> x = x.reshape((2, 3))

>>> x

array([[0, 1, 2],

[3, 4, 5]])

>>> np.ones_like(x)

array([[1, 1, 1],

[1, 1, 1]])

>>>

>>> y = np.arange(3, dtype=np.float)

>>> y

array([ 0., 1., 2.])

>>> np.ones_like(y)

array([ 1., 1., 1.])

同理，zeros_like返回一個用0填充的跟輸入數組形狀和類型一樣的數組。

range() 與 np.arange()

>>>range(1,5)

range(1,5)

>>>tuple(range(1, 5))

(1, 2, 3, 4)

>>>list(range(1, 5))

[1, 2, 3, 4]

>>>r = range(1, 5)

>>>type(r)

>>>for i in range(1, 5):

... print(i)

>>> np.arange(1, 5)

array([1, 2, 3, 4])

>>>range(1, 5, .1)

Traceback (most recent call last):

File "<stdin>", line 1, in <module>

TypeError: 'float' object cannot be interpreted as an integer

>>>np.arange(1, 5, .5)

array([ 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5])

>>>range(1, 5, 2)

>>>for i in range(1, 5, 2):

... print(i)

>>for i in np.arange(1, 5):

... print(i)

Python中的shape()和reshape()

參考博文：https://blog.csdn.net/qq_28618765/article/details/78083895
shape()和reshape()都是數組array中的方法

shape()

import numpy as np

a = np.array([1,2,3,4,5,6,7,8])  #一維數組

print(a.shape[0])  #值爲8，因爲有8個數據

print(a.shape[1])  #IndexError: tuple index out of range

a = np.array([[1,2,3,4],[5,6,7,8]])  #二維數組

print(a.shape[0])  #值爲2，最外層矩陣有2個元素，2個元素還是矩陣。

print(a.shape[1])  #值爲4，內層矩陣有4個元素。

print(a.shape[2])  #IndexError: tuple index out of range

reshape()

reshape新生成數組和原數組公用一個內存，不管改變哪個都會互相影響。

numpy.random詳細解析

https://blog.csdn.net/vicdd/article/details/52667709

numpy- shape用法

建立一個4×2的矩陣c, c.shape[1] 爲第一維的長度，c.shape[0] 爲第二維的長度

c = array([[1,1],[1,2],[1,3],[1,4]])
>>> c.shape
(4, 2)
>>> c.shape[0]
4
>>> c.shape[1]
2

所謂元素，在一維時就是元素的個數，二維時表示行數和列數，三維時a.shape【0】表示創建的塊數，a.shape【1】和a.shape【2】表示每一塊（每一塊都是二維的）的行數和列數，

numpy.argmin

表示最小值在數組中所在的位置

hstack(value)將數組合並

https://blog.csdn.net/garfielder007/article/details/51378296

>>> a = np.array((1,2,3))

>>> b = np.array((2,3,4))

>>> np.hstack((a,b))

array([1, 2, 3, 2, 3, 4])

>>> a = np.array([[1],[2],[3]])

>>> b = np.array([[2],[3],[4]])

>>> np.hstack((a,b))

array([[1, 2],

[2, 3],

[3, 4]])

numpy.diff

numpy.diff（a，n = 1，軸= -1 ）

n等於多少，就進行幾次差分計算

計算沿給定軸的第n個離散差。

第一個差異是沿着給定的軸給出的，通過遞歸使用來計算更高的差異。out[i] = a[i+1] - a[i]

2.Pandas

Pandas

DataFrame切片操作iloc

loc使用標籤取值

data = data.as_matrix() #將表格轉換爲矩陣

https://jingyan.baidu.com/article/a17d52853f879e8099c8f240.html

Pandas統計特徵函數

主要作爲Pandas的對象DataFrame或Series的方法出現。

sum()：計算數據樣本的總和（按列計算）

mean()：計算數據樣本的算術平均數

var()：計算數據樣本的方差

std()：計算數據樣本的標準差

corr()：計算數據樣本的Spearman(Pearson)相關係數矩陣

pandas常用函數

http://www.cnblogs.com/hotsnow/p/9480849.html

pandas庫pd.read_excel參數

https://blog.csdn.net/brucewong0516/article/details/79096633

pd.read_excel(io, sheetname=0,header=0,skiprows=None,index_col=None,names=None,

arse_cols=None,date_parser=None,na_values=None,thousands=None,

convert_float=True,has_index_names=None,converters=None,dtype=None,

true_values=None,false_values=None,engine=None,squeeze=False,**kwds)

io ：excel 路徑；
sheetname：默認是sheetname爲0，返回多表使用sheetname=[0,1]，若sheetname=None是返回全表。注意：int/string返回的是dataframe，而none和list返回的是dict of dataframe。
header ：指定作爲列名的行，默認0，即取第一行，數據爲列名行以下的數據；若數據不含列名，則設定 header = None
skiprows：省略指定行數的數據
skip_footer：省略從尾部數的行數據
index_col ：指定列爲索引列，也可以使用 u’string’
names：指定列的名字，傳入一個list數據

重置index

data= data.reset_index(drop=True)

Python將DataFrame的某一列作爲index

df.set_index(["Column"], inplace=True)

pandas.read_csv參數詳解

def parse(x):

return datetime.strptime(x, '%Y %m %d %H')

dataset = read_csv('C:/Users/kyle/Desktop/raw.csv', parse_dates = [['year', 'month', 'day', 'hour']], index_col=0, date_parser=parse)

第一步是將日期時間信息合併爲一個日期時間，以便我們可以將其用作Pandas中的索引。

parse_dates將年月日小時合併爲一個日期，index_col=0將第一列作爲索引

pandas.DataFrame.dropna

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html

DataFrame.dropna（axis = 0，how ='any'，thresh = None，subset = None，inplace = False ）

刪除缺失的值。

pandas中shift和diff

①對於DataFrame的行索引是日期型，行索引發生移動，列索引數據不變

In [2]: import pandas as pd

...: import numpy as np

...: df = pd.DataFrame(np.arange(24).reshape(6,4),index=pd.date_range(start=

...: '20170101',periods=6),columns=['A','B','C','D'])

...: df

...:

Out[2]:

A B C D

2017-01-01 0 1 2 3

2017-01-02 4 5 6 7

2017-01-03 8 9 10 11

2017-01-04 12 13 14 15

2017-01-05 16 17 18 19

2017-01-06 20 21 22 23

In [3]: df.shift(2,axis=0,freq='2D')

Out[3]:

A B C D

2017-01-05 0 1 2 3

2017-01-06 4 5 6 7

2017-01-07 8 9 10 11

2017-01-08 12 13 14 15

2017-01-09 16 17 18 19

2017-01-10 20 21 22 23

In [4]: df.shift(2,axis=1,freq='2D')

Out[4]:

A B C D

2017-01-05 0 1 2 3

2017-01-06 4 5 6 7

2017-01-07 8 9 10 11

2017-01-08 12 13 14 15

2017-01-09 16 17 18 19

2017-01-10 20 21 22 23

In [5]: df.shift(2,freq='2D')

Out[5]:

A B C D

2017-01-05 0 1 2 3

2017-01-06 4 5 6 7

2017-01-07 8 9 10 11

2017-01-08 12 13 14 15

2017-01-09 16 17 18 19

2017-01-10 20 21 22 23

結論：對於時間索引而言，shift使時間索引發生移動，其他數據保存原樣，且axis設置沒有任何影響

②對於DataFrame行索引爲非時間序列，行索引數據保持不變，列索引數據發生移動

In [6]: import pandas as pd

...: import numpy as np

...: df = pd.DataFrame(np.arange(24).reshape(6,4),index=['r1','r2','r3','r4'

...: ,'r5','r6'],columns=['A','B','C','D'])

...: df

...:

Out[6]:

A B C D

r1 0 1 2 3

r2 4 5 6 7

r3 8 9 10 11

r4 12 13 14 15

r5 16 17 18 19

r6 20 21 22 23

In [7]: df.shift(periods=2,axis=0)

Out[7]:

A B C D

r1 NaN NaN NaN NaN

r2 NaN NaN NaN NaN

r3 0.0 1.0 2.0 3.0

r4 4.0 5.0 6.0 7.0

r5 8.0 9.0 10.0 11.0

r6 12.0 13.0 14.0 15.0

In [8]: df.shift(periods=-2,axis=0)

Out[8]:

A B C D

r1 8.0 9.0 10.0 11.0

r2 12.0 13.0 14.0 15.0

r3 16.0 17.0 18.0 19.0

r4 20.0 21.0 22.0 23.0

r5 NaN NaN NaN NaN

r6 NaN NaN NaN NaN

In [9]: df.shift(periods=2,axis=1)

Out[9]:

A B C D

r1 NaN NaN 0.0 1.0

r2 NaN NaN 4.0 5.0

r3 NaN NaN 8.0 9.0

r4 NaN NaN 12.0 13.0

r5 NaN NaN 16.0 17.0

r6 NaN NaN 20.0 21.0

In [10]: df.shift(periods=-2,axis=1)

Out[10]:

A B C D

r1 2.0 3.0 NaN NaN

r2 6.0 7.0 NaN NaN

r3 10.0 11.0 NaN NaN

r4 14.0 15.0 NaN NaN

r5 18.0 19.0 NaN NaN

r6 22.0 23.0 NaN NaN

下面看看diff函數和shift函數之間的關係

In [13]: df.diff(periods=2,axis=0)

Out[13]:

A B C D

r1 NaN NaN NaN NaN

r2 NaN NaN NaN NaN

r3 8.0 8.0 8.0 8.0

r4 8.0 8.0 8.0 8.0

r5 8.0 8.0 8.0 8.0

r6 8.0 8.0 8.0 8.0

In [14]: df -df.diff(periods=2,axis=0)

Out[14]:

A B C D

r1 NaN NaN NaN NaN

r2 NaN NaN NaN NaN

r3 0.0 1.0 2.0 3.0

r4 4.0 5.0 6.0 7.0

r5 8.0 9.0 10.0 11.0

r6 12.0 13.0 14.0 15.0

In [15]: df.shift(periods=2,axis=0)

Out[15]:

A B C D

r1 NaN NaN NaN NaN

r2 NaN NaN NaN NaN

r3 0.0 1.0 2.0 3.0

r4 4.0 5.0 6.0 7.0

r5 8.0 9.0 10.0 11.0

r6 12.0 13.0 14.0 15.0

pandas.DataFrame.drop

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html

DataFrame.drop（labels = None，axis = 0，index = None，columns = None，level = None，inplace = False，errors ='raise' ）[source]

從行或列中刪除指定的標籤。

Pandas+groupby用法講解

相當於按照某一列，把相同的組合起來，再進行操作

In [4]:df = pd.DataFrame({'A': ['a', 'b', 'a', 'c', 'a', 'c', 'b', 'c'] , 'B': [2, 8, 1, 4, 3, 2, 5, 9], 'C': [102, 98, 107, 104, 115, 87, 92, 123]})

In [5]:df

Out[5]:

A B C

0 a 2 102

1 b 8 98

2 a 1 107

3 c 4 104

4 a 3 115

5 c 2 87

6 b 5 92

7 c 9 123

In [6]: df.groupby('A').mean()

Out[6]:

B C

a 2.0 108.000000

b 6.5 95.000000

c 5.0 104.666667

In [7]: df.groupby(['A','B']).mean()

Out[7]:

A B

a 1 107

2 102

3 115

b 5 92

8 98

c 2 87

4 104

9 123

DataFrame 類型轉換方法—astype()

df['col2'] = df['col2'].astype('int')

df['col2'] = df['col2'].astype('float64')

data_2=data_1.values

data_2=np.array(data_2,dtype=np.float) #將int轉float

DataFrame.rolling

在建模過程中，我們常常需要需要對有時間關係的數據進行整理。比如我們想要得到某一時刻過去30分鐘的銷量（產量，速度，消耗量等），傳統方法複雜消耗資源較多，pandas提供的rolling使用簡單，速度較快。

DataFrame.rolling(window, min_periods=None, freq=None, center=False, win_type=None, on=None, axis=0, closed=None)

window：表示時間窗的大小，注意有兩種形式（int or offset）。如果使用int，則數值表示計算統計量的觀測值的數量即向前幾個數據。如果是offset類型，表示時間窗的大小。pandas offset相關可以參考這裏。

min_periods：最少需要有值的觀測點的數量，對於int類型，默認與window相等。對於offset類型，默認爲1。

freq：從0.18版本中已經被捨棄。

center：是否使用window的中間值作爲label，默認爲false。只能在window是int時使用。

win_type：窗口類型，默認爲None一般不特殊指定，瞭解支持的其他窗口類型，參考這裏。

on：對於DataFrame如果不使用index（索引）作爲rolling的列，那麼用on來指定使用哪列。

closed：定義區間的開閉，曾經支持int類型的window，新版本已經不支持了。對於offset類型默認是左開右閉的即默認爲right。可以根據情況指定爲left both等。

axis：方向（軸），一般都是0。

https://blog.csdn.net/wj1066/article/details/78853717

pandas.resample()

附：常見時間頻率
A year
M month
W week
D day
H hour
T minute
S second

ts_5d_leftclosed = ts.resample('5D', closed='right').sum()

print(ts_5d_leftclosed)

首先先來創建一個時間序列，起始日期是2018/01/01，一共12天，每天對應的數值分別是1到12：

使用resample方法來做降採樣處理，頻率是5天，上面提到的兩個參數，都使用默認值：

ts_5d = ts.resample('5D').sum()

print(ts_5d)

#### Outputs ####

2018-01-01 15

2018-01-06 40

2018-01-11 23

Freq: 5D, dtype: int32

closed = 'left' 左閉右開

上面的三個5天可以由以下的三個左閉右開的區間構成：

區間1：[1, 6)
區間2: [6, 11)
區間3：[11, 16) 例子中，時間只到12號爲止，但是這裏會往後補足5天

closed = 'right' 左開右閉

上面的三個5天可以由以下的四個左開右閉的區間構成。注意，由於第一個5天是從1號到6號，但由於是左開區間，1號就落不到1到6號的那個區間，所以要往前補足：

區間1：(27, 1]
區間2：(1, 6]
區間3: (6, 11]
區間4：(11, 16]

closed：劃分區間的依據，left會劃成左閉右開區間；right會劃分成左開右閉的區間。一般來說，closed爲right的時候，區間會比爲left的時候多一個。區間劃分完畢，聚合運算就在這個區間內執行。

label：劃分區間完畢，根據label的不同，區間的索引就不同。如果label爲left，則區間左邊的日期作爲索引；如果label爲right，則區間右邊的日期作爲索引。

python 筆記（二）numpy和pandas

Numpy

Pandas

python 筆記（一）

spss數據處理(一)

EMD模態分解

算法準備（一）

python 筆記（二）numpy和pandas

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結