data.table vs dplyr:一個人能做得好嗎,另一個做不好或做得不好?

本文翻譯自:data.table vs dplyr: can one do something well the other can't or does poorly?

Overview 概觀

I'm relatively familiar with data.table , not so much with dplyr . 我對data.table比較熟悉,而不是dplyr I've read through some dplyr vignettes and examples that have popped up on SO, and so far my conclusions are that: 我已經閱讀了一些dplyr小插曲以及出現在SO上的例子,到目前爲止,我的結論是:

  1. data.table and dplyr are comparable in speed, except when there are many (ie >10-100K) groups, and in some other circumstances (see benchmarks below) data.tabledplyr在速度上具有可比性,除非有許多(即> 10-100K)組,並且在某些其他情況下(參見下面的基準)
  2. dplyr has more accessible syntax dplyr具有更易於訪問的語法
  3. dplyr abstracts (or will) potential DB interactions dplyr抽象(或將)潛在的DB交互
  4. There are some minor functionality differences (see "Examples/Usage" below) 有一些小的功能差異(參見下面的“示例/用法”)

In my mind 2. doesn't bear much weight because I am fairly familiar with it data.table , though I understand that for users new to both it will be a big factor. 在我的腦海裏2.沒有太大的重量,因爲我對data.table非常熟悉,雖然我明白對於那些對這兩者data.table熟悉的用戶來說這將是一個很重要的因素。 I would like to avoid an argument about which is more intuitive, as that is irrelevant for my specific question asked from the perspective of someone already familiar with data.table . 我想避免爭論哪個更直觀,因爲這與我從熟悉data.table的人的角度提出的具體問題無關。 I also would like to avoid a discussion about how "more intuitive" leads to faster analysis (certainly true, but again, not what I'm most interested about here). 我還想避免討論“更直觀”如何導致更快的分析(當然是真的,但同樣,不是我最感興趣的)。

Question

What I want to know is: 我想知道的是:

  1. Are there analytical tasks that are a lot easier to code with one or the other package for people familiar with the packages (ie some combination of keystrokes required vs. required level of esotericism, where less of each is a good thing). 對於熟悉軟件包的人來說,是否需要使用一個或另一個軟件包來編寫分析任務更加容易(例如,需要按鍵的一些組合與所需的深奧水平相結合,其中每個項目都是好事。
  2. Are there analytical tasks that are performed substantially (ie more than 2x) more efficiently in one package vs. another. 是否存在在一個包裝與另一個包裝中更有效地執行分析任務(即,超過2倍)的分析任務。

One recent SO question got me thinking about this a bit more, because up until that point I didn't think dplyr would offer much beyond what I can already do in data.table . 最近的一個問題讓我想到了這個問題 ,因爲直到那時我才認爲dplyr會提供超出我在data.table已經做的data.table Here is the dplyr solution (data at end of Q): 這是dplyr解決方案(Q末尾的數據):

dat %.%
  group_by(name, job) %.%
  filter(job != "Boss" | year == min(year)) %.%
  mutate(cumu_job2 = cumsum(job2))

Which was much better than my hack attempt at a data.table solution. 這比我的hack嘗試data.table解決方案要好得多。 That said, good data.table solutions are also pretty good (thanks Jean-Robert, Arun, and note here I favored single statement over the strictly most optimal solution): 也就是說,良好的data.table解決方案也相當不錯(感謝Jean-Robert,Arun,並注意到這裏我贊成對最嚴格的最佳解決方案的單一陳述):

setDT(dat)[,
  .SD[job != "Boss" | year == min(year)][, cumjob := cumsum(job2)], 
  by=list(id, job)
]

The syntax for the latter may seem very esoteric, but it actually is pretty straightforward if you're used to data.table (ie doesn't use some of the more esoteric tricks). 後者的語法可能看起來非常深奧,但如果您習慣於data.table (即不使用某些更深奧的技巧),它實際上非常簡單。

Ideally what I'd like to see is some good examples were the dplyr or data.table way is substantially more concise or performs substantially better. 理想情況下,我想看到的是一些很好的例子, dplyrdata.table方式基本上更簡潔或表現更好。

Examples 例子

Usage 用法
  • dplyr does not allow grouped operations that return arbitrary number of rows (from eddi's question , note: this looks like it will be implemented in dplyr 0.5 , also, @beginneR shows a potential work-around using do in the answer to @eddi's question). dplyr不允許返回任意行數的分組操作(來自eddi的問題 ,請注意:這看起來它將在dplyr 0.5中實現,同樣,@ beginneR顯示了在@ eddi的問題的答案中使用do的潛在解決方法) 。
  • data.table supports rolling joins (thanks @dholstius) as well as overlap joins data.table支持滾動連接 (感謝@dholstius)以及重疊連接
  • data.table internally optimises expressions of the form DT[col == value] or DT[col %in% values] for speed through automatic indexing which uses binary search while using the same base R syntax. data.table內部優化形式的表達式DT[col == value]DT[col %in% values]速度通過自動索引 ,它使用二進制搜索 ,同時使用相同的基礎R語法。 See here for some more details and a tiny benchmark. 請參閱此處瞭解更多詳細信息和一個小基準。
  • dplyr offers standard evaluation versions of functions (eg regroup , summarize_each_ ) that can simplify the programmatic use of dplyr (note programmatic use of data.table is definitely possible, just requires some careful thought, substitution/quoting, etc, at least to my knowledge) dplyr提供的功能標準評估版本(如regroupsummarize_each_ ),可以簡化程序中使用的dplyr (注意程序中使用的data.table是絕對有可能,只是需要一些認真思考,置換/報價,等等,至少據我所知)
Benchmarks 基準

Data 數據

This is for the first example I showed in the question section. 這是我在問題部分中展示的第一個例子。

 dat <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), name = c("Jane", "Jane", "Jane", "Jane", "Jane", "Jane", "Jane", "Jane", "Bob", "Bob", "Bob", "Bob", "Bob", "Bob", "Bob", "Bob"), year = c(1980L, 1981L, 1982L, 1983L, 1984L, 1985L, 1986L, 1987L, 1985L, 1986L, 1987L, 1988L, 1989L, 1990L, 1991L, 1992L), job = c("Manager", "Manager", "Manager", "Manager", "Manager", "Manager", "Boss", "Boss", "Manager", "Manager", "Manager", "Boss", "Boss", "Boss", "Boss", "Boss"), job2 = c(1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L)), .Names = c("id", "name", "year", "job", "job2"), class = "data.frame", row.names = c(NA, -16L)) 

#1樓

參考:https://stackoom.com/question/1RwJH/data-table-vs-dplyr-一個人能做得好嗎-另一個做不好或做得不好


#2樓

In direct response to the Question Title ... 直接回答問題標題 ......

dplyr definitely does things that data.table can not. dplyr 絕對data.table不能做的事情。

Your point #3 你的觀點#3

dplyr abstracts (or will) potential DB interactions dplyr抽象(或將)潛在的DB交互

is a direct answer to your own question but isn't elevated to a high enough level. 是你自己的問題的直接答案,但沒有提升到足夠高的水平。 dplyr is truly an extendable front-end to multiple data storage mechanisms where as data.table is an extension to a single one. dplyr是多個數據存儲機制的真正可擴展前端,其中data.table是單個數據存儲機制的擴展。

Look at dplyr as a back-end agnostic interface, with all of the targets using the same grammer, where you can extend the targets and handlers at will. dplyr視爲後端不可知接口,所有目標都使用相同的語法,您可以隨意擴展目標和處理程序。 data.table is, from the dplyr perspective, one of those targets. dplyr角度來看, data.table是其中一個目標。

You will never (I hope) see a day that data.table attempts to translate your queries to create SQL statements that operate with on-disk or networked data stores. 您永遠不會(我希望)看到有一天data.table嘗試轉換您的查詢以創建與磁盤或網絡數據存儲一起運行的SQL語句。

dplyr can possibly do things data.table will not or might not do as well. dplyr可能會做data.table不會或不會做的事情。

Based on the design of working in-memory, data.table could have a much more difficult time extending itself into parallel processing of queries than dplyr . 基於內存工作的設計, data.table可能比dplyr更難以將自身擴展到查詢的並行處理。


In response to the in-body questions... 迴應體內問題......

Usage 用法

Are there analytical tasks that are a lot easier to code with one or the other package for people familiar with the packages (ie some combination of keystrokes required vs. required level of esotericism, where less of each is a good thing). 對於熟悉軟件包的人來說 ,是否需要使用一個或另一個軟件包來編寫分析任務更加容易(例如,需要按鍵的一些組合與所需的深奧水平相結合,其中每個項目都是好事。

This may seem like a punt but the real answer is no. 這似乎是一個平底船,但真正的答案是否定的。 People familiar with tools seem to use the either the one most familiar to them or the one that is actually the right one for the job at hand. 熟悉工具的人似乎使用他們最熟悉的工具或實際上適合工作的工具。 With that being said, sometimes you want to present a particular readability, sometimes a level of performance, and when you have need for a high enough level of both you may just need another tool to go along with what you already have to make clearer abstractions. 話雖如此,有時你想要提供一個特定的可讀性,有時候是一個性能水平,當你需要足夠高的兩者時,你可能只需要另一個工具來配合你已經擁有的東西來做出更清晰的抽象。

Performance 性能

Are there analytical tasks that are performed substantially (ie more than 2x) more efficiently in one package vs. another. 是否存在在一個包裝與另一個包裝中更有效地執行分析任務(即,超過2倍)的分析任務。

Again, no. 再一次,沒有。 data.table excels at being efficient in everything it does where dplyr gets the burden of being limited in some respects to the underlying data store and registered handlers. data.table擅長是在這裏的一切有效dplyr得到的在某些方面被限制在底層數據存儲和處理註冊的負擔。

This means when you run into a performance issue with data.table you can be pretty sure it is in your query function and if it is actually a bottleneck with data.table then you've won yourself the joy of filing a report. 這意味着當您遇到了性能問題data.table你可以相當肯定它在你的查詢功能,如果它實際上一個瓶頸data.table那麼你已經贏得了自己在提交報告的喜悅。 This is also true when dplyr is using data.table as the back-end; dplyr使用data.table作爲後端時也是如此; you may see some overhead from dplyr but odds are it is your query. 可能會看到dplyr 一些開銷,但可能是你的查詢。

When dplyr has performance issues with back-ends you can get around them by registering a function for hybrid evaluation or (in the case of databases) manipulating the generated query prior to execution. dplyr性能問題時,您可以通過註冊混合評估函數或(在數據庫的情況下)在執行之前操作生成的查詢來繞過它們。

Also see the accepted answer to when is plyr better than data.table? 另請參閱plyr何時比data.table更好的接受答案


#3樓

We need to cover at least these aspects to provide a comprehensive answer/comparison (in no particular order of importance): Speed , Memory usage , Syntax and Features . 我們需要至少涵蓋這些方面,以提供全面的答案/比較(沒有特別重要的順序): SpeedMemory usageSyntaxFeatures

My intent is to cover each one of these as clearly as possible from data.table perspective. 我的目的是從data.table的角度儘可能清楚地涵蓋其中的每一個。

Note: unless explicitly mentioned otherwise, by referring to dplyr, we refer to dplyr's data.frame interface whose internals are in C++ using Rcpp. 注意:除非另有明確說明,否則通過引用dplyr,我們引用dplyr的data.frame接口,其內部使用Rcpp在C ++中。


The data.table syntax is consistent in its form - DT[i, j, by] . data.table語法的形式是一致的 - DT[i, j, by] To keep i , j and by together is by design. 保持ijby在一起是設計的。 By keeping related operations together, it allows to easily optimise operations for speed and more importantly memory usage , and also provide some powerful features , all while maintaining the consistency in syntax. 通過將相關操作保持在一起,它允許輕鬆優化操作以提高速度 ,更重要的是內存使用 ,並提供一些強大的功能 ,同時保持語法的一致性。

1. Speed 1.速度

Quite a few benchmarks (though mostly on grouping operations) have been added to the question already showing data.table gets faster than dplyr as the number of groups and/or rows to group by increase, including benchmarks by Matt on grouping from 10 million to 2 billion rows (100GB in RAM) on 100 - 10 million groups and varying grouping columns, which also compares pandas . 相當多的基準測試(儘管主要是分組操作)已經添加到已經顯示data.table的問題比dplyr 更快 ,因爲要分組的組和/或行的數量增加,包括Matt在分組上的基準1000萬到20億行 (100GB內存)在100-1000萬個組和不同的分組列上,這也比較了pandas See also updated benchmarks , which include Spark and pydatatable as well. 另請參閱更新的基準測試 ,其中包括Sparkpydatatable

On benchmarks, it would be great to cover these remaining aspects as well: 在基準測試中,覆蓋這些剩餘方面也很棒:

  • Grouping operations involving a subset of rows - ie, DT[x > val, sum(y), by = z] type operations. 涉及行子集的分組操作 - 即DT[x > val, sum(y), by = z]類型的操作。

  • Benchmark other operations such as update and joins . 對其他操作進行基準測試,例如更新連接

  • Also benchmark memory footprint for each operation in addition to runtime. 除運行時外,還爲每個操作的基準內存佔用量

2. Memory usage 2.內存使用情況

  1. Operations involving filter() or slice() in dplyr can be memory inefficient (on both data.frames and data.tables). 涉及dplyr中的filter()slice()的操作可能內存效率低(在data.frames和data.tables上)。 See this post . 看這篇文章

    Note that Hadley's comment talks about speed (that dplyr is plentiful fast for him), whereas the major concern here is memory . 請注意, Hadley的評論談論速度 (dplyr對他來說很快),而主要關注的是記憶

  2. data.table interface at the moment allows one to modify/update columns by reference (note that we don't need to re-assign the result back to a variable). data.table接口目前允許通過引用修改/更新列(請注意,我們不需要將結果重新分配回變量)。

     # sub-assign by reference, updates 'y' in-place DT[x >= 1L, y := NA] 

    But dplyr will never update by reference. 但dplyr 永遠不會通過引用更新。 The dplyr equivalent would be (note that the result needs to be re-assigned): dplyr等價物將是(注意結果需要重新分配):

     # copies the entire 'y' column ans <- DF %>% mutate(y = replace(y, which(x >= 1L), NA)) 

    A concern for this is referential transparency . 對此的關注是參考透明度 Updating a data.table object by reference, especially within a function may not be always desirable. 通過引用更新data.table對象,尤其是在函數內更新可能並不總是令人滿意的。 But this is an incredibly useful feature: see this and this posts for interesting cases. 但這是一個非常有用的功能:看到這個這個帖子的有趣案例。 And we want to keep it. 我們想保留它。

    Therefore we are working towards exporting shallow() function in data.table that will provide the user with both possibilities . 因此,我們正在努力在data.table中導出shallow()函數,它將爲用戶提供兩種可能性 For example, if it is desirable to not modify the input data.table within a function, one can then do: 例如,如果希望不修改函數中的輸入data.table,則可以執行以下操作:

     foo <- function(DT) { DT = shallow(DT) ## shallow copy DT DT[, newcol := 1L] ## does not affect the original DT DT[x > 2L, newcol := 2L] ## no need to copy (internally), as this column exists only in shallow copied DT DT[x > 2L, x := 3L] ## have to copy (like base R / dplyr does always); otherwise original DT will ## also get modified. } 

    By not using shallow() , the old functionality is retained: 通過不使用shallow() ,保留舊功能:

     bar <- function(DT) { DT[, newcol := 1L] ## old behaviour, original DT gets updated by reference DT[x > 2L, x := 3L] ## old behaviour, update column x in original DT. } 

    By creating a shallow copy using shallow() , we understand that you don't want to modify the original object. 通過使用shallow()創建淺拷貝 ,我們知道您不想修改原始對象。 We take care of everything internally to ensure that while also ensuring to copy columns you modify only when it is absolutely necessary . 我們在內部處理所有事情,以確保在確保複製列時僅在絕對必要時修改。 When implemented, this should settle the referential transparency issue altogether while providing the user with both possibilties. 實施時,這應該完全解決參考透明度問題,同時爲用戶提供兩種可能性。

    Also, once shallow() is exported dplyr's data.table interface should avoid almost all copies. 此外,一旦shallow()被導出,dplyr的data.table接口應該避免幾乎所有的副本。 So those who prefer dplyr's syntax can use it with data.tables. 所以那些喜歡dplyr語法的人可以將它與data.tables一起使用。

    But it will still lack many features that data.table provides, including (sub)-assignment by reference. 但它仍然缺少data.table提供的許多功能,包括(sub)-assignment by reference。

  3. Aggregate while joining: 加入時聚合:

    Suppose you have two data.tables as follows: 假設您有兩個data.tables如下:

     DT1 = data.table(x=c(1,1,1,1,2,2,2,2), y=c("a", "a", "b", "b"), z=1:8, key=c("x", "y")) # xyz # 1: 1 a 1 # 2: 1 a 2 # 3: 1 b 3 # 4: 1 b 4 # 5: 2 a 5 # 6: 2 a 6 # 7: 2 b 7 # 8: 2 b 8 DT2 = data.table(x=1:2, y=c("a", "b"), mul=4:3, key=c("x", "y")) # xy mul # 1: 1 a 4 # 2: 2 b 3 

    And you would like to get sum(z) * mul for each row in DT2 while joining by columns x,y . 並且您希望在按列x,y連接時獲得DT2每行的sum(z) * mul We can either: 我們可以:

    • 1) aggregate DT1 to get sum(z) , 2) perform a join and 3) multiply (or) 1)聚合DT1得到sum(z) ,2)執行連接3)乘法(或)

       # data.table way DT1[, .(z = sum(z)), keyby = .(x,y)][DT2][, z := z*mul][] # dplyr equivalent DF1 %>% group_by(x, y) %>% summarise(z = sum(z)) %>% right_join(DF2) %>% mutate(z = z * mul) 
    • 2) do it all in one go (using by = .EACHI feature): 2)一次完成(使用by = .EACHI功能):

       DT1[DT2, list(z=sum(z) * mul), by = .EACHI] 

    What is the advantage? 有什麼好處?

    • We don't have to allocate memory for the intermediate result. 我們不必爲中間結果分配內存。

    • We don't have to group/hash twice (one for aggregation and other for joining). 我們沒有兩次分組/哈希(一個用於聚合,另一個用於加入)。

    • And more importantly, the operation what we wanted to perform is clear by looking at j in (2). 更重要的是,通過查看(2)中的j ,我們想要執行的操作是清楚的。

    Check this post for a detailed explanation of by = .EACHI . 查看這篇文章 ,瞭解by = .EACHI的詳細說明。 No intermediate results are materialised, and the join+aggregate is performed all in one go. 沒有實現中間結果,並且連接+聚合一次性完成。

    Have a look at this , this and this posts for real usage scenarios. 看看這個這個這個帖子的實際使用場景。

    In dplyr you would have to join and aggregate or aggregate first and then join , neither of which are as efficient, in terms of memory (which in turn translates to speed). dplyr你必須首先加入並聚合或聚合然後加入 ,這兩者在內存方面都不是那麼有效(這反過來又轉化爲速度)。

  4. Update and joins: 更新和加入:

    Consider the data.table code shown below: 考慮下面顯示的data.table代碼:

     DT1[DT2, col := i.mul] 

    adds/updates DT1 's column col with mul from DT2 on those rows where DT2 's key column matches DT1 . DT2的鍵列與DT1匹配的那些行上, DT1的列col與來自DT2 mul一起添加/更新。 I don't think there is an exact equivalent of this operation in dplyr , ie, without avoiding a *_join operation, which would have to copy the entire DT1 just to add a new column to it, which is unnecessary. 我不認爲在dplyr有這個操作的確切等價,即沒有避免*_join操作,它必須複製整個DT1只是爲了向它添加一個新列,這是不必要的。

    Check this post for a real usage scenario. 查看此帖子瞭解真實的使用場景。

To summarise, it is important to realise that every bit of optimisation matters. 總而言之,重要的是要意識到每一點優化都很重要。 As Grace Hopper would say, Mind your nanoseconds ! 正如Grace Hopper所說, 記住你的納秒

3. Syntax 3.語法

Let's now look at syntax . 我們現在看一下語法 Hadley commented here : 哈德利在這裏評論道:

Data tables are extremely fast but I think their concision makes it harder to learn and code that uses it is harder to read after you have written it ... 數據表非常快,但我認爲它們的簡潔性使得學習起來更加困難 ,使用它的代碼在編寫之後難以閱讀 ...

I find this remark pointless because it is very subjective. 我發現這句話毫無意義,因爲它非常主觀。 What we can perhaps try is to contrast consistency in syntax . 我們可以嘗試的是對比語法的一致性 We will compare data.table and dplyr syntax side-by-side. 我們將並排比較data.table和dplyr語法。

We will work with the dummy data shown below: 我們將使用下面顯示的虛擬數據:

DT = data.table(x=1:10, y=11:20, z=rep(1:2, each=5))
DF = as.data.frame(DT)
  1. Basic aggregation/update operations. 基本的聚合/更新操作。

     # case (a) DT[, sum(y), by = z] ## data.table syntax DF %>% group_by(z) %>% summarise(sum(y)) ## dplyr syntax DT[, y := cumsum(y), by = z] ans <- DF %>% group_by(z) %>% mutate(y = cumsum(y)) # case (b) DT[x > 2, sum(y), by = z] DF %>% filter(x>2) %>% group_by(z) %>% summarise(sum(y)) DT[x > 2, y := cumsum(y), by = z] ans <- DF %>% group_by(z) %>% mutate(y = replace(y, which(x > 2), cumsum(y))) # case (c) DT[, if(any(x > 5L)) y[1L]-y[2L] else y[2L], by = z] DF %>% group_by(z) %>% summarise(if (any(x > 5L)) y[1L] - y[2L] else y[2L]) DT[, if(any(x > 5L)) y[1L] - y[2L], by = z] DF %>% group_by(z) %>% filter(any(x > 5L)) %>% summarise(y[1L] - y[2L]) 
    • data.table syntax is compact and dplyr's quite verbose. data.table語法是緊湊的,dplyr非常詳細。 Things are more or less equivalent in case (a). 在(a)的情況下,事情或多或少是等價的。

    • In case (b), we had to use filter() in dplyr while summarising . 在情況(b)中,我們必須在總結時在dplyr中使用filter() But while updating , we had to move the logic inside mutate() . 但是在更新時 ,我們不得不在mutate()移動邏輯。 In data.table however, we express both operations with the same logic - operate on rows where x > 2 , but in first case, get sum(y) , whereas in the second case update those rows for y with its cumulative sum. 但是,在data.table中,我們使用相同的邏輯表示兩個操作 - 對x > 2行進行操作,但在第一種情況下,得到sum(y) ,而在第二種情況下,使用其累積和更新y那些行。

      This is what we mean when we say the DT[i, j, by] form is consistent . 當我們說DT[i, j, by]形式是一致的時,這就是我們的意思。

    • Similarly in case (c), when we have if-else condition, we are able to express the logic "as-is" in both data.table and dplyr. 類似地,在情況(c)中,當我們有if-else條件時,我們能夠在data.table和dplyr中表達邏輯“原樣” However, if we would like to return just those rows where the if condition satisfies and skip otherwise, we cannot use summarise() directly (AFAICT). 但是,如果我們只想返回if條件滿足的那些行,否則我們不能直接使用summarise() (AFAICT)。 We have to filter() first and then summarise because summarise() always expects a single value . 我們必須首先filter()然後彙總,因爲summarise()總是需要一個

      While it returns the same result, using filter() here makes the actual operation less obvious. 雖然它返回相同的結果,但在這裏使用filter()會使實際操作不那麼明顯。

      It might very well be possible to use filter() in the first case as well (does not seem obvious to me), but my point is that we should not have to. 在第一種情況下也可以使用filter() (對我來說似乎不太明顯),但我的觀點是我們不應該這樣做。

  2. Aggregation / update on multiple columns 多列的聚合/更新

     # case (a) DT[, lapply(.SD, sum), by = z] ## data.table syntax DF %>% group_by(z) %>% summarise_each(funs(sum)) ## dplyr syntax DT[, (cols) := lapply(.SD, sum), by = z] ans <- DF %>% group_by(z) %>% mutate_each(funs(sum)) # case (b) DT[, c(lapply(.SD, sum), lapply(.SD, mean)), by = z] DF %>% group_by(z) %>% summarise_each(funs(sum, mean)) # case (c) DT[, c(.N, lapply(.SD, sum)), by = z] DF %>% group_by(z) %>% summarise_each(funs(n(), mean)) 
    • In case (a), the codes are more or less equivalent. 在情況(a)中,代碼或多或少相等。 data.table uses familiar base function lapply() , whereas dplyr introduces *_each() along with a bunch of functions to funs() . data.table使用熟悉的基函數lapply() ,而dplyr*_each()和一堆函數引入*_each() funs()

    • data.table's := requires column names to be provided, whereas dplyr generates it automatically. data.table's :=需要提供列名,而dplyr會自動生成它。

    • In case (b), dplyr's syntax is relatively straightforward. 在情況(b)中,dplyr的語法相對簡單。 Improving aggregations/updates on multiple functions is on data.table's list. 改進多個函數的聚合/更新是在data.table的列表中。

    • In case (c) though, dplyr would return n() as many times as many columns, instead of just once. 在情況(c)中,dplyr將返回n()儘可能多的列,而不是僅返回一次。 In data.table, all we need to do is to return a list in j . 在data.table中,我們需要做的就是在j返回一個列表。 Each element of the list will become a column in the result. 列表的每個元素都將成爲結果中的一列。 So, we can use, once again, the familiar base function c() to concatenate .N to a list which returns a list . 所以,我們可以使用,再次熟悉的基函數c()來連接.N一個list返回一個list

    Note: Once again, in data.table, all we need to do is return a list in j . 注意:再次,在data.table中,我們需要做的就是在j返回一個列表。 Each element of the list will become a column in result. 列表的每個元素都將成爲結果中的列。 You can use c() , as.list() , lapply() , list() etc... base functions to accomplish this, without having to learn any new functions. 您可以使用c()as.list()lapply()list()等...基本函數來完成此任務,而無需學習任何新功能。

    You will need to learn just the special variables - .N and .SD at least. 您至少需要學習特殊變量 - .N.SD The equivalent in dplyr are n() and . dplyr中的等價物是n().

  3. Joins 加盟

    dplyr provides separate functions for each type of join where as data.table allows joins using the same syntax DT[i, j, by] (and with reason). dplyr爲每種類型的連接提供單獨的函數,其中data.table允許使用相同的語法DT[i, j, by] (和有理由)進行連接。 It also provides an equivalent merge.data.table() function as an alternative. 它還提供了一個等效的merge.data.table()函數作爲替代。

     setkey(DT1, x, y) # 1. normal join DT1[DT2] ## data.table syntax left_join(DT2, DT1) ## dplyr syntax # 2. select columns while join DT1[DT2, .(z, i.mul)] left_join(select(DT2, x, y, mul), select(DT1, x, y, z)) # 3. aggregate while join DT1[DT2, .(sum(z) * i.mul), by = .EACHI] DF1 %>% group_by(x, y) %>% summarise(z = sum(z)) %>% inner_join(DF2) %>% mutate(z = z*mul) %>% select(-mul) # 4. update while join DT1[DT2, z := cumsum(z) * i.mul, by = .EACHI] ?? # 5. rolling join DT1[DT2, roll = -Inf] ?? # 6. other arguments to control output DT1[DT2, mult = "first"] ?? 
    • Some might find a separate function for each joins much nicer (left, right, inner, anti, semi etc), whereas as others might like data.table's DT[i, j, by] , or merge() which is similar to base R. 有些人可能會發現每個連接的單獨函數更好(左,右,內,反,半等),而其他人可能喜歡data.table的DT[i, j, by]merge()類似於base R.

    • However dplyr joins do just that. 然而,dplyr加入就是這樣做的。 Nothing more. 而已。 Nothing less. 沒什麼。

    • data.tables can select columns while joining (2), and in dplyr you will need to select() first on both data.frames before to join as shown above. data.tables可以在加入(2)時選擇列,而在dplyr中,如上所示,您需要先在兩個data.frames上select() Otherwise you would materialiase the join with unnecessary columns only to remove them later and that is inefficient. 否則,您將使用不必要的列來實現連接,以便稍後刪除它們,這是低效的。

    • data.tables can aggregate while joining (3) and also update while joining (4), using by = .EACHI feature. data.tables可以在加入 (3)時聚合 ,也可以在加入 (4)時進行更新 ,使用by = .EACHI功能。 Why materialse the entire join result to add/update just a few columns? 爲什麼要使整個連接結果只添加/更新幾列?

    • data.table is capable of rolling joins (5) - roll forward, LOCF , roll backward, NOCB , nearest . data.table能夠滾動連接 (5) - 前滾,LOCF ,後滾,NOCB最近

    • data.table also has mult = argument which selects first , last or all matches (6). data.table也有mult =參數,它選擇第一個最後一個所有匹配(6)。

    • data.table has allow.cartesian = TRUE argument to protect from accidental invalid joins. data.table具有allow.cartesian = TRUE參數以防止意外的無效連接。

Once again, the syntax is consistent with DT[i, j, by] with additional arguments allowing for controlling the output further. 再一次,語法與DT[i, j, by] ,附加參數允許進一步控制輸出。

  1. do() ... do() ......

    dplyr's summarise is specially designed for functions that return a single value. dplyr的總結是專門爲返回單個值的函數而設計的。 If your function returns multiple/unequal values, you will have to resort to do() . 如果函數返回多個/不相等的值,則必須求助於do() You have to know beforehand about all your functions return value. 您必須事先知道所有函數的返回值。

     DT[, list(x[1], y[1]), by = z] ## data.table syntax DF %>% group_by(z) %>% summarise(x[1], y[1]) ## dplyr syntax DT[, list(x[1:2], y[1]), by = z] DF %>% group_by(z) %>% do(data.frame(.$x[1:2], .$y[1])) DT[, quantile(x, 0.25), by = z] DF %>% group_by(z) %>% summarise(quantile(x, 0.25)) DT[, quantile(x, c(0.25, 0.75)), by = z] DF %>% group_by(z) %>% do(data.frame(quantile(.$x, c(0.25, 0.75)))) DT[, as.list(summary(x)), by = z] DF %>% group_by(z) %>% do(data.frame(as.list(summary(.$x)))) 
    • .SD 's equivalent is . .SD的等價物是.

    • In data.table, you can throw pretty much anything in j - the only thing to remember is for it to return a list so that each element of the list gets converted to a column. 在data.table中,你可以在j拋出幾乎任何東西 - 唯一要記住的是它返回一個列表,以便列表的每個元素都轉換爲一列。

    • In dplyr, cannot do that. 在dplyr中,不能那樣做。 Have to resort to do() depending on how sure you are as to whether your function would always return a single value. 必須求助於do()取決於你的功能是否總是返回單個值。 And it is quite slow. 它很慢。

Once again, data.table's syntax is consistent with DT[i, j, by] . data.table的語法再一次與DT[i, j, by] We can just keep throwing expressions in j without having to worry about these things. 我們可以繼續在j拋出表達式,而不必擔心這些事情。

Have a look at this SO question and this one . 看看這個問題這個 問題 I wonder if it would be possible to express the answer as straightforward using dplyr's syntax... 我想知道使用dplyr的語法是否可以直截了當地表達答案...

To summarise, I have particularly highlighted several instances where dplyr's syntax is either inefficient, limited or fails to make operations straightforward. 總而言之,我特別強調了幾個實例,其中dplyr的語法效率低,有限或無法使操作簡單明瞭。 This is particularly because data.table gets quite a bit of backlash about "harder to read/learn" syntax (like the one pasted/linked above). 這尤其是因爲data.table對“難以閱讀/學習”的語法(如上面粘貼/鏈接的語法)產生了相當大的反對意見。 Most posts that cover dplyr talk about most straightforward operations. 大多數關於dplyr的帖子都談到了最簡單的操作。 And that is great. 這很棒。 But it is important to realise its syntax and feature limitations as well, and I am yet to see a post on it. 但重要的是要實現其語法和功能限制,我還沒有看到它的帖子。

data.table has its quirks as well (some of which I have pointed out that we are attempting to fix). data.table也有它的怪癖(其中一些我已經指出我們正試圖修復)。 We are also attempting to improve data.table's joins as I have highlighted here . 我們也試圖改進data.table的連接,正如我在這裏強調的那樣。

But one should also consider the number of features that dplyr lacks in comparison to data.table. 但是,與data.table相比,還應該考慮dplyr缺少的功能數量。

4. Features 4.特點

I have pointed out most of the features here and also in this post. 我在這裏和本文中都指出了大部分功能。 In addition: 此外:

  • fread - fast file reader has been available for a long time now. fread - 快速文件閱讀器已經有很長一段時間了。

  • fwrite - a parallelised fast file writer is now available. fwrite - 現在可以使用並行快速文件編寫器。 See this post for a detailed explanation on the implementation and #1664 for keeping track of further developments. 有關實施的詳細說明,請參閱此文章 ;有關進一步發展的信息 ,請參閱#1664

  • Automatic indexing - another handy feature to optimise base R syntax as is, internally. 自動索引 - 內部優化基本R語法的另一個便利功能。

  • Ad-hoc grouping : dplyr automatically sorts the results by grouping variables during summarise() , which may not be always desirable. Ad-hoc分組dplyr通過在summarise()期間對變量進行分組 dplyr自動對結果進行排序,這可能並不總是令人滿意。

  • Numerous advantages in data.table joins (for speed / memory efficiency and syntax) mentioned above. 上面提到的data.table連接(用於速度/內存效率和語法)的許多優點。

  • Non-equi joins : Allows joins using other operators <=, <, >, >= along with all other advantages of data.table joins. 非equi連接 :允許使用其他運算符連接<=, <, >, >=以及data.table連接的所有其他優點。

  • Overlapping range joins was implemented in data.table recently. 最近在data.table中實現了重疊範圍連接 Check this post for an overview with benchmarks. 查看這篇文章 ,瞭解基準測試的概述。

  • setorder() function in data.table that allows really fast reordering of data.tables by reference. setorder()函數允許通過引用真正快速重新排序data.tables。

  • dplyr provides interface to databases using the same syntax, which data.table does not at the moment. dplyr使用相同的語法爲數據庫提供接口 ,而data.table目前還沒有。

  • data.table provides faster equivalents of set operations (written by Jan Gorecki) - fsetdiff , fintersect , funion and fsetequal with additional all argument (as in SQL). data.table提供更快的等效設置操作 (由Jan Gorecki編寫) - fsetdifffintersectfunionfsetequal以及附加的all參數(如在SQL中)。

  • data.table loads cleanly with no masking warnings and has a mechanism described here for [.data.frame compatibility when passed to any R package. data.table乾淨地加載,沒有屏蔽警告,並且具有這裏描述的機制用於傳遞給任何R包時的[.data.frame兼容性。 dplyr changes base functions filter , lag and [ which can cause problems; dplyr改變基函數filterlag[這會導致問題; eg here and here . 例如這裏這裏


Finally: 最後:

  • On databases - there is no reason why data.table cannot provide similar interface, but this is not a priority now. 在數據庫上 - 沒有理由說data.table不能提供類似的接口,但現在這不是優先事項。 It might get bumped up if users would very much like that feature.. not sure. 如果用戶非常喜歡這個功能,它可能會被提升......不確定。

  • On parallelism - Everything is difficult, until someone goes ahead and does it. 關於並行性 - 一切都很困難,直到有人繼續前進並做到這一點。 Of course it will take effort (being thread safe). 當然需要付出努力(線程安全)。

    • Progress is being made currently (in v1.9.7 devel) towards parallelising known time consuming parts for incremental performance gains using OpenMP . 目前正在取得進展(在v1.9.7開發中),以便使用OpenMP並行化已知的耗時部件以實現增量性能增益。

#4樓

Here's my attempt at a comprehensive answer from the dplyr perspective, following the broad outline of Arun's answer (but somewhat rearranged based on differing priorities). 這是我從dplyr角度全面回答的嘗試,遵循Arun回答的大致輪廓(但根據不同的優先級進行了一些重新安排)。

Syntax 句法

There is some subjectivity to syntax, but I stand by my statement that the concision of data.table makes it harder to learn and harder to read. 語法有一些主觀性,但我堅持認爲data.table的簡潔使得學習更難學,更難閱讀。 This is partly because dplyr is solving a much easier problem! 這部分是因爲dplyr解決了一個更容易的問題!

One really important thing that dplyr does for you is that it constrains your options. dplyr爲您做的一件非常重要的事情是它限制了您的選擇。 I claim that most single table problems can be solved with just five key verbs filter, select, mutate, arrange and summarise, along with a "by group" adverb. 我聲稱大多數單表問題都可以通過五個關鍵動詞過濾,選擇,變異,排列和總結以及“按組”副詞來解決。 That constraint is a big help when you're learning data manipulation, because it helps order your thinking about the problem. 當您學習數據操作時,這種約束是一個很大的幫助,因爲它有助於您對問題的思考。 In dplyr, each of these verbs is mapped to a single function. 在dplyr中,每個動詞都映射到一個函數。 Each function does one job, and is easy to understand in isolation. 每個功能都可以完成一項工作,並且易於理解。

You create complexity by piping these simple operations together with %>% . 通過將這些簡單操作與%>%一起管道,可以創建複雜性。 Here's an example from one of the posts Arun linked to : 以下是Arun 鏈接到的帖子中的一個示例:

diamonds %>%
  filter(cut != "Fair") %>%
  group_by(cut) %>%
  summarize(
    AvgPrice = mean(price),
    MedianPrice = as.numeric(median(price)),
    Count = n()
  ) %>%
  arrange(desc(Count))

Even if you've never seen dplyr before (or even R!), you can still get the gist of what's happening because the functions are all English verbs. 即使你以前從未見過dplyr(甚至是R!),你仍然可以得到正在發生的事情的要點,因爲這些函數都是英語動詞。 The disadvantage of English verbs is that they require more typing than [ , but I think that can be largely mitigated by better autocomplete. 英語動詞的缺點是它們需要更多的打字而不是[ ,但我認爲可以通過更好的自動完成來大大減輕這種情況。

Here's the equivalent data.table code: 這是等效的data.table代碼:

diamondsDT <- data.table(diamonds)
diamondsDT[
  cut != "Fair", 
  .(AvgPrice = mean(price),
    MedianPrice = as.numeric(median(price)),
    Count = .N
  ), 
  by = cut
][ 
  order(-Count) 
]

It's harder to follow this code unless you're already familiar with data.table. 除非您已經熟悉data.table,否則更難遵循此代碼。 (I also couldn't figure out how to indent the repeated [ in a way that looks good to my eye). (我也無法弄清楚如何縮進重複[以一種看起來很好看的方式)。 Personally, when I look at code I wrote 6 months ago, it's like looking at a code written by a stranger, so I've come to prefer straightforward, if verbose, code. 就個人而言,當我看到我6個月前寫的代碼時,就像查看陌生人編寫的代碼一樣,所以我更喜歡簡單明瞭的代碼。

Two other minor factors that I think slightly decrease readability: 我認爲其他兩個次要因素會略微降低可讀性:

  • Since almost every data table operation uses [ you need additional context to figure out what's happening. 由於幾乎每個數據表操作都使用[您需要額外的上下文來弄清楚發生了什麼。 For example, is x[y] joining two data tables or extracting columns from a data frame? 例如, x[y]連接兩個數據表還是從數據幀中提取列? This is only a small issue, because in well-written code the variable names should suggest what's happening. 這只是一個小問題,因爲在編寫良好的代碼中,變量名稱應該表明發生了什麼。

  • I like that group_by() is a separate operation in dplyr. 我喜歡group_by()是dplyr中的一個單獨操作。 It fundamentally changes the computation so I think should be obvious when skimming the code, and it's easier to spot group_by() than the by argument to [.data.table . 它從根本上改變了計算,所以我認爲在瀏覽代碼時應該是顯而易見的,並且發現group_by()[.data.table by參數更容易。

I also like that the the pipe isn't just limited to just one package. 我也喜歡管道不僅僅侷限於一個包裝。 You can start by tidying your data with tidyr , and finish up with a plot in ggvis . 您可以先用tidyr整理數據,然後用ggvis中的繪圖完成 And you're not limited to the packages that I write - anyone can write a function that forms a seamless part of a data manipulation pipe. 而且你並不侷限於我寫的軟件包 - 任何人都可以編寫一個函數來構成數據操作管道的無縫部分。 In fact, I rather prefer the previous data.table code rewritten with %>% : 實際上,我更喜歡先前用%>%重寫的data.table代碼:

diamonds %>% 
  data.table() %>% 
  .[cut != "Fair", 
    .(AvgPrice = mean(price),
      MedianPrice = as.numeric(median(price)),
      Count = .N
    ), 
    by = cut
  ] %>% 
  .[order(-Count)]

And the idea of piping with %>% is not limited to just data frames and is easily generalised to other contexts: interactive web graphics , web scraping , gists , run-time contracts , ...) 使用%>%管道的想法不僅限於數據框架,而且很容易推廣到其他環境: 交互式Web圖形Web抓取要點運行時合同 ,......)

Memory and performance 記憶和表現

I've lumped these together, because, to me, they're not that important. 我把它們混在一起,因爲對我來說,它們並不那麼重要。 Most R users work with well under 1 million rows of data, and dplyr is sufficiently fast enough for that size of data that you're not aware of processing time. 大多數R用戶的工作量低於100萬行,dplyr足夠快,足以滿足您不知道處理時間的數據大小。 We optimise dplyr for expressiveness on medium data; 我們優化dplyr以表達對中等數據的表達; feel free to use data.table for raw speed on bigger data. 隨時可以使用data.table獲取更大數據的原始速度。

The flexibility of dplyr also means that you can easily tweak performance characteristics using the same syntax. dplyr的靈活性還意味着您可以使用相同的語法輕鬆調整性能特徵。 If the performance of dplyr with the data frame backend is not good enough for you, you can use the data.table backend (albeit with a somewhat restricted set of functionality). 如果dplyr與數據幀後端的性能不夠好,則可以使用data.table後端(儘管功能有限)。 If the data you're working with doesn't fit in memory, then you can use a database backend. 如果您使用的數據不適合內存,則可以使用數據庫後端。

All that said, dplyr performance will get better in the long-term. 總而言之,dplyr的表現會在長期內變得更好。 We'll definitely implement some of the great ideas of data.table like radix ordering and using the same index for joins & filters. 我們肯定會實現data.table的一些好主意,比如基數排序和聯接和過濾器使用相同的索引。 We're also working on parallelisation so we can take advantage of multiple cores. 我們還致力於並行化,因此我們可以利用多個內核。

Features 特徵

A few things that we're planning to work on in 2015: 我們計劃在2015年開展的一些工作:

  • the readr package, to make it easy to get files off disk and in to memory, analogous to fread() . readr包,使文件從磁盤和內存中輕鬆獲取,類似於fread()

  • More flexible joins, including support for non-equi-joins. 更靈活的連接,包括對非equi-joins的支持。

  • More flexible grouping like bootstrap samples, rollups and more 更靈活的分組,如bootstrap樣本,彙總等

I'm also investing time into improving R's database connectors , the ability to talk to web apis , and making it easier to scrape html pages . 我還投入時間來改進R的數據庫連接器 ,與web apis交談的能力,以及更容易抓取HTML頁面

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章