Julia DataFrames ---- groupby/map/combine/aggregate 函數

1、支持的統計函數

下面提到的函數也都支持統計函數,具體支持的統計函數見 。by:https://blog.csdn.net/weixin_41715077/article/details/103747504

 

2、函數說明

2.1 groupby

#  groupby(d::AbstractDataFrame, cols; sort=false, skipmissing=false),
#  返回類型爲:GroupedDataFrame
Return a `GroupedDataFrame` representing a view of an `AbstractDataFrame` split into row groups.

# 參數說明
- `df` : an `AbstractDataFrame` to split
- `cols` : 按照那幾列分組
- `sort` : 是否分組的列排序
- `skipmissing` : 是否跳過分組的列值爲空的情況

接下來可以使用的函數有:
* [`by`](@ref) : split-apply-combine using functions
* [`aggregate`](@ref) : split-apply-combine; applies functions in the form of a cross product
* [`map`](@ref) : apply a function to each group of a `GroupedDataFrame` (without combining)
* [`combine`](@ref) : combine a `GroupedDataFrame`, optionally applying a function to each group

 

2.2 map

    map(cols => f, gd::GroupedDataFrame)
    cols必須是一個列名或者列索引
    map(f, gd::GroupedDataFrame)
    f 必須是一個可以調用的函數,支持的函數 (`sum`, `prod`,`minimum`, `maximum`, `mean`, `var`, `std`, `first`, `last` and `length`)
    Apply a function to each group of rows and return a [`GroupedDataFrame`](@ref).

2.3  combine

  合併計算
    combine(gd::GroupedDataFrame, cols => f...)
    combine(gd::GroupedDataFrame; (colname = cols => f)...)
    combine(gd::GroupedDataFrame, f)
    combine(f, gd::GroupedDataFrame)
    將多個 [`GroupedDataFrame`](@ref)合併成一個`DataFrame`.

    後續函數
    - [`by(f, df, cols)`](@ref) is a shorthand for `combine(f, groupby(df, cols))`.
    - [`map`](@ref): `combine(f, groupby(df, cols))` is a more efficient equivalent

2.4 aggregate

  合併計算
    aggregate(df::AbstractDataFrame, fs)
    aggregate(df::AbstractDataFrame, cols, fs; sort=false, skipmissing=false)
    aggregate(gd::GroupedDataFrame, fs; sort=false)
    `AbstractDataFrame` or [`GroupedDataFrame`](@ref). Return an aggregated data frame.

    # 參數說明
    - `df` : an `AbstractDataFrame`
    - `gd` : a `GroupedDataFrame`
    - `cols` : a column indicator (`Symbol`, `Int`, `Vector{Symbol}`, etc.)
    - `fs` : a function or vector of functions to be applied to vectors within groups; expects each argument to be a column vector
    - `sort` : whether to sort rows according to the values of the grouping columns
    - `skipmissing` : whether to skip rows with `missing` values in one of the grouping columns `cols`

 

3、代碼示例

using DataFrames, CSV, Statistics

"""
#  groupby(d::AbstractDataFrame, cols; sort=false, skipmissing=false),
#  返回類型爲:GroupedDataFrame
Return a `GroupedDataFrame` representing a view of an `AbstractDataFrame` split into row groups.

# 參數說明
- `df` : an `AbstractDataFrame` to split
- `cols` : 按照那幾列分組
- `sort` : 是否分組的列排序
- `skipmissing` : 是否跳過分組的列值爲空的情況

接下來可以使用的函數有:
* [`by`](@ref) : split-apply-combine using functions
* [`aggregate`](@ref) : split-apply-combine; applies functions in the form of a cross product
* [`map`](@ref) : apply a function to each group of a `GroupedDataFrame` (without combining)
* [`combine`](@ref) : combine a `GroupedDataFrame`, optionally applying a function to each group

"""

iris = DataFrame(CSV.File(joinpath(dirname(pathof(DataFrames)),
"C:/D/Julia/DataFrames/DataFrames.jl/docs/src/assets/iris.csv")));

#分組
gd = groupby(iris, :Species,sort= true,skipmissing=true)

#取指定的分組
gd[1]

last(gd)

first(gd)

for g in gd
    g = filter(g-> g>2, g.PetalLength)
    println(g)
end

k = first(keys(gd))
gd[(PetalWidth=="Species")]
gd[(SepalLength<3.0,)]

"""
    map(cols => f, gd::GroupedDataFrame)
    cols必須是一個列名或者列索引
    map(f, gd::GroupedDataFrame)
    f 必須是一個可以調用的函數,支持的函數 (`sum`, `prod`,`minimum`, `maximum`, `mean`, `var`, `std`, `first`, `last` and `length`)
    Apply a function to each group of rows and return a [`GroupedDataFrame`](@ref).
"""
#全部分組
map(iris -> sum(iris.PetalLength), gd)

#指定範圍的分組
map(iris -> sum(iris.PetalLength), gd[1:2])

map([:PetalLength, :SepalLength] =>
              x -> (a=mean(x.PetalLength)/mean(x.SepalLength), b=std(x.PetalLength)), gd)

map([:PetalLength, :SepalLength] =>
              x -> (a=mean(x.PetalLength)/mean(x.SepalLength), b=var(x.PetalLength)), gd)

map(:PetalLength => sum, gd)

"""
   合併計算
    combine(gd::GroupedDataFrame, cols => f...)
    combine(gd::GroupedDataFrame; (colname = cols => f)...)
    combine(gd::GroupedDataFrame, f)
    combine(f, gd::GroupedDataFrame)
    將多個 [`GroupedDataFrame`](@ref)合併成一個`DataFrame`.

    後續函數
    - [`by(f, df, cols)`](@ref) is a shorthand for `combine(f, groupby(df, cols))`.
    - [`map`](@ref): `combine(f, groupby(df, cols))` is a more efficient equivalent
"""
combine(gd, :PetalLength => sum)

combine(:PetalLength => sum,gd)

combine(iris -> sum(iris.PetalLength), gd)
iris.PetalLength

combine([:PetalLength, :SepalLength] =>
              x -> (a=mean(filter(x-> x>1, x.PetalLength))/mean(x.SepalLength), b=var(x.PetalLength)), gd)

#指定範圍的分組合並
combine(:PetalLength => sum,gd[1:2])


"""
   合併計算
    aggregate(df::AbstractDataFrame, fs)
    aggregate(df::AbstractDataFrame, cols, fs; sort=false, skipmissing=false)
    aggregate(gd::GroupedDataFrame, fs; sort=false)
    `AbstractDataFrame` or [`GroupedDataFrame`](@ref). Return an aggregated data frame.

    # 參數說明
    - `df` : an `AbstractDataFrame`
    - `gd` : a `GroupedDataFrame`
    - `cols` : a column indicator (`Symbol`, `Int`, `Vector{Symbol}`, etc.)
    - `fs` : a function or vector of functions to be applied to vectors within groups; expects each argument to be a column vector
    - `sort` : whether to sort rows according to the values of the grouping columns
    - `skipmissing` : whether to skip rows with `missing` values in one of the grouping columns `cols`
"""
aggregate(iris, :Species , maximum)

aggregate(iris, :Species, [sum, x->mean(skipmissing(x))])

aggregate(groupby(iris, :Species)[1:2], [sum, x->mean(skipmissing(x))])


#其他一些函數
parent(gd) # get the parent DataFrame
vcat(gd...)   # 返回原來的 DataFrame, 但是行的順序會變
DataFrame(gd) # 返回一個新的 DataFrame,
groupvars(gd) # 獲取 groupby分組時使用的列名或者列名組合 是一個 vector


eachcol(iris[!,1:end-1], true) #每一列都拆分成一個 DataFrameColumns(key=>Vector)
# :SepalLength => [4.6, 5.3, 7.0, 6.9, 6.3, 7.1]
# :SepalWidth => [3.2, 3.7, 3.2, 3.1, 3.3, 3.0]
# :PetalLength => [1.4, 1.5, 4.7, 4.9, 6.0, 5.9]
# :PetalWidth => [0.2, 0.2, 1.4, 1.5, 2.5, 2.1]
foreach(c -> println(c[1], ": ", mean(c[2])), eachcol(iris[!,1:end-1], true)) # an iteration returns a Pair with column name and values
foreach(c -> println(mean(c[2])), eachcol(iris[!,1:end-1], true)) # an iteration returns a Pair with column name and values
map(mean, eachcol(iris[!,1:end-1], false)) # map可以對相同的列 執行多個統計函數,藉助eachcol函數可以對多列執行一個統計函數
mapcols(mean, iris[!,1:end-1]) # 相當於 map 和 eachcol 兩個函數的組合 , mapcols可以對多列執行一個統計函數



eachrow(iris[!,1:end-1]) #每一行都拆分成一個 DataFrameRows
foreach(c -> println(c), eachrow(iris[!,1:end-1])) # an iteration returns a Pair with column name and values
map(r -> r.SepalLength/r.SepalWidth, eachrow(iris[!,1:end-1]))
 # now the returned value is DataFrameRow which works similarly to a one-row DataFrame

 

發佈了94 篇原創文章 · 獲贊 74 · 訪問量 5萬+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章