R語言dplyr包實操

1. dplyr簡介

dplyr是R語言的數據分析包,類似於python中的pandas,能對dataframe類型的數據做很方便的數據處理和分析操作。最初我也很奇怪dplyr這個奇怪的名字,我查到其中一種解釋 - d代表dataframe - plyr是英文鉗子plier的諧音

dplyr如同R的大多數包,都是函數式編程,這點跟Python面向對象編程區別很大。優點是初學者比較容易接受這種函數式思維,有點類似於流水線,每個函數就是一個車間,多個車間共同完成一個生產(數據分析)任務。

而在dplyr中,就有一個管道符 %>% ,符號左側表示數據的輸入,右側表示下游數據處理環節。


2. 安裝並導入dplyr庫

pacman庫的p_load函數功能包含了

  1. install.packages(“dplyr”)

  2. library(dplyr)

該寫法更簡單易用

pacman::p_load("dplyr")

3. 讀取數據

#設置工作目錄
setwd("/Users/thunderhit/Desktop/dplyr_learn")

#導入csv數據
aapl <- read.csv('aapl.csv', 
                 header=TRUE,
                 sep=',',
                 stringsAsFactors = FALSE) %>% as_tibble()
head(aapl)
A tibble: 6 × 6
DateOpenHighLowCloseVolume
<chr><dbl><dbl><dbl><dbl><int>
7-Jul-17142.90144.75142.90144.1819201712
6-Jul-17143.02143.50142.41142.7324128782
5-Jul-17143.69144.79142.72144.0921569557
3-Jul-17144.88145.30143.10143.5014277848
30-Jun-17144.45144.96143.78144.0223024107
29-Jun-17144.71145.13142.28143.6831499368

查看數據類型

class(aapl)
  1. 'tbl_df'

  2. 'tbl'

  3. 'data.frame'

查看數據的字段

colnames(aapl)
  1. 'Date'

  2. 'Open'

  3. 'High'

  4. 'Low'

  5. 'Close'

  6. 'Volume'

查看記錄數、字段數

dim(aapl)
  1. 251

  2. 6


4. dplyr常用函數

4.1 Arrange

對appl數據按照字段Volume進行降序排序

arrange(aapl, -Volume)
A tibble: 6 × 6
DateOpenHighLowCloseVolume
<chr><dbl><dbl><dbl><dbl><int>
14-Sep-16108.73113.03108.60111.77112340318
1-Feb-17127.03130.49127.01128.75111985040
27-Jul-16104.26104.35102.75102.9592344820
15-Sep-16113.86115.73113.49115.5790613177
16-Sep-16115.12116.13114.04114.9279886911
12-Jun-17145.74146.09142.51145.4272307330

我們也可以用管道符 %>% ,兩種寫法得到的運行結果是一致的,可能用久了會覺得管道符 %>% 可讀性更強,後面我們都會用 %>% 來寫代碼。

aapl %>% arrange(-Volume)
A tibble: 6 × 6
DateOpenHighLowCloseVolume
<chr><dbl><dbl><dbl><dbl><int>
14-Sep-16108.73113.03108.60111.77112340318
1-Feb-17127.03130.49127.01128.75111985040
27-Jul-16104.26104.35102.75102.9592344820
15-Sep-16113.86115.73113.49115.5790613177
16-Sep-16115.12116.13114.04114.9279886911
12-Jun-17145.74146.09142.51145.4272307330

4.2 Select

選取 Date、Close和Volume三列

aapl %>% select(Date, Close, Volume)
A tibble: 6 × 3
DateCloseVolume
<chr><dbl><int>
7-Jul-17144.1819201712
6-Jul-17142.7324128782
5-Jul-17144.0921569557
3-Jul-17143.5014277848
30-Jun-17144.0223024107
29-Jun-17143.6831499368

只選取Date、Close和Volume三列,其實另外一種表達方式是“排除Open、High、Low,選擇剩下的字段的數據”。

aapl %>% select(-c("Open", "High", "Low")) 
A tibble: 6 × 3
DateCloseVolume
<chr><dbl><int>
7-Jul-17144.1819201712
6-Jul-17142.7324128782
5-Jul-17144.0921569557
3-Jul-17143.5014277848
30-Jun-17144.0223024107
29-Jun-17143.6831499368

4.3 Filter

按照篩選條件選擇數據

#從數據中選擇appl股價大於150美元的交易數據
aapl %>% filter(Close>=150) 
A tibble: 6 × 6
DateOpenHighLowCloseVolume
<chr><dbl><dbl><dbl><dbl><int>
8-Jun-17155.25155.54154.40154.9921250798
7-Jun-17155.02155.98154.48155.3721069647
6-Jun-17153.90155.81153.78154.4526624926
5-Jun-17154.34154.45153.46153.9325331662
2-Jun-17153.58155.45152.89155.4527770715
1-Jun-17153.17153.33152.22153.1816404088

從數據中選擇appl - 股價大於150美元 且 收盤價大於開盤價 的交易數據

aapl %>% filter((Close>=150) & (Close>Open))
A tibble: 11 × 6
DateOpenHighLowCloseVolume
<chr><dbl><dbl><dbl><dbl><int>
7-Jun-17155.02155.98154.48155.3721069647
6-Jun-17153.90155.81153.78154.4526624926
2-Jun-17153.58155.45152.89155.4527770715
1-Jun-17153.17153.33152.22153.1816404088
30-May-17153.42154.43153.33153.6720126851
25-May-17153.73154.35153.03153.8719235598
18-May-17151.27153.34151.13152.5433568215
12-May-17154.70156.42154.67156.1032527017
11-May-17152.45154.07152.31153.9527255058
9-May-17153.87154.88153.45153.9939130363
8-May-17149.03153.70149.03153.0148752413

4.4 Mutate

將現有的字段經過計算後生成新字段。

#將最好價High減去最低價Low的結果定義爲maxDif,並取log
aapl %>% mutate(maxDif = High-Low,
                log_maxDif=log(maxDif)) 
A tibble: 6 × 8
DateOpenHighLowCloseVolumemaxDiflog_maxDif
<chr><dbl><dbl><dbl><dbl><int><dbl><dbl>
7-Jul-17142.90144.75142.90144.18192017121.850.6151856
6-Jul-17143.02143.50142.41142.73241287821.090.0861777
5-Jul-17143.69144.79142.72144.09215695572.070.7275486
3-Jul-17144.88145.30143.10143.50142778482.200.7884574
30-Jun-17144.45144.96143.78144.02230241071.180.1655144
29-Jun-17144.71145.13142.28143.68314993682.851.0473190

得到記錄的位置(行數)

aapl  %>% mutate(n=row_number())  
A tibble: 6 × 7
DateOpenHighLowCloseVolumen
<chr><dbl><dbl><dbl><dbl><int><int>
7-Jul-17142.90144.75142.90144.18192017121
6-Jul-17143.02143.50142.41142.73241287822
5-Jul-17143.69144.79142.72144.09215695573
3-Jul-17144.88145.30143.10143.50142778484
30-Jun-17144.45144.96143.78144.02230241075
29-Jun-17144.71145.13142.28143.68314993686

4.5 Group_By

對資料進行分組,這裏導入新的 數據集 weather

#導入csv數據
weather <- read.csv('weather.csv', 
                    header=TRUE,
                    sep=',',
                    stringsAsFactors = FALSE) %>% as_tibble()  
weather 
A tibble: 6 × 5
Datecitytemperaturewindspeedevent
<chr><chr><int><int><chr>
1/1/2017new york326Rain
1/1/2017mumbai905Sunny
1/1/2017paris4520Sunny
1/2/2017new york367Sunny
1/2/2017mumbai8512Fog
1/2/2017paris5013Cloudy

按照城市分組

weather %>% group_by(city) 
A grouped_df: 6 × 5
Datecitytemperaturewindspeedevent
<chr><chr><int><int><chr>
1/1/2017new york326Rain
1/1/2017mumbai905Sunny
1/1/2017paris4520Sunny
1/2/2017new york367Sunny
1/2/2017mumbai8512Fog
1/2/2017paris5013Cloudy

爲了讓大家看到分組的功效,咱們按照城市分別計算平均溫度

weather %>% group_by(city) %>% summarise(mean_temperature = mean(temperature))
`summarise()` ungrouping output (override with `.groups` argument)
A tibble: 3 × 2
citymean_temperature
<chr><dbl>
mumbai87.5
new york34.0
paris47.5
weather %>%  summarise(mean_temperature = mean(temperature))
A tibble: 1 × 1
mean_temperature
<dbl>
56.33333
往期文章小案例: Pandas的apply方法

Python語法快速入門
Python網絡爬蟲與文本數據分析
讀完本文你就瞭解什麼是文本分析

綜述:文本分析在市場營銷研究中的應用
從記者的Twitter關注看他們稿件的黨派傾向?

Pandas時間序列數據操作
70G上市公司定期報告數據集
文本數據清洗之正則表達式
shreport庫: 批量下載上海證券交易所上市公司年報
Numpy和Pandas性能改善的方法和技巧
漂亮~pandas可以無縫銜接Bokeh
YelpDaset: 酒店管理類數據集10+G
半個小時學會Markdown標記語法

後臺回覆關鍵詞【dplyr實操】,可獲得測試數據及代碼

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章