關注微信公共號:小程在線
關注CSDN博客:程志偉的博客
julia> using StatsKit
[ Info: Precompiling StatsKit [2cb19f9e-ec4d-5c53-8573-a4542a68d3f0]
julia> using Queryverse
[ Info: Precompiling Queryverse [612083be-0b0f-5412-89c1-4e7c75506a58]
[ Info: Installing xlrd via the Conda xlrd package...
[ Info: Running `conda install -y xlrd` in root environment
Collecting package metadata (current_repodata.json): done
Solving environment: done
## Package Plan ##
environment location: C:\Users\cheng\.julia\conda\3
added / updated specs:
- xlrd
The following packages will be downloaded:
package | build
---------------------------|-----------------
xlrd-1.2.0 | py37_0 180 KB
------------------------------------------------------------
Total: 180 KB
The following NEW packages will be INSTALLED:
xlrd pkgs/main/win-64::xlrd-1.2.0-py37_0
Downloading and Extracting Packages
xlrd-1.2.0 | 180 KB | ############################################################################ | 100%
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
julia> using CSV
julia> using Random
julia> using RDatasets
[ Info: Precompiling RDatasets [ce6b1742-4840-55fa-b093-852dadbb1d8b]
julia> iris = dataset("datasets","iris")
150×5 DataFrame
│ Row │ SepalLength │ SepalWidth │ PetalLength │ PetalWidth │ Species │
│ │ Float64 │ Float64 │ Float64 │ Float64 │ Cat… │
├─────┼─────────────┼────────────┼─────────────┼────────────┼───────────┤
│ 1 │ 5.1 │ 3.5 │ 1.4 │ 0.2 │ setosa │
│ 2 │ 4.9 │ 3.0 │ 1.4 │ 0.2 │ setosa │
│ 3 │ 4.7 │ 3.2 │ 1.3 │ 0.2 │ setosa │
│ 4 │ 4.6 │ 3.1 │ 1.5 │ 0.2 │ setosa │
│ 5 │ 5.0 │ 3.6 │ 1.4 │ 0.2 │ setosa │
│ 6 │ 5.4 │ 3.9 │ 1.7 │ 0.4 │ setosa │
│ 7 │ 4.6 │ 3.4 │ 1.4 │ 0.3 │ setosa │
│ 8 │ 5.0 │ 3.4 │ 1.5 │ 0.2 │ setosa │
│ 9 │ 4.4 │ 2.9 │ 1.4 │ 0.2 │ setosa │
│ 10 │ 4.9 │ 3.1 │ 1.5 │ 0.1 │ setosa │
│ 11 │ 5.4 │ 3.7 │ 1.5 │ 0.2 │ setosa │
│ 12 │ 4.8 │ 3.4 │ 1.6 │ 0.2 │ setosa │
│ 13 │ 4.8 │ 3.0 │ 1.4 │ 0.1 │ setosa │
⋮
│ 137 │ 6.3 │ 3.4 │ 5.6 │ 2.4 │ virginica │
│ 138 │ 6.4 │ 3.1 │ 5.5 │ 1.8 │ virginica │
│ 139 │ 6.0 │ 3.0 │ 4.8 │ 1.8 │ virginica │
│ 140 │ 6.9 │ 3.1 │ 5.4 │ 2.1 │ virginica │
│ 141 │ 6.7 │ 3.1 │ 5.6 │ 2.4 │ virginica │
│ 142 │ 6.9 │ 3.1 │ 5.1 │ 2.3 │ virginica │
│ 143 │ 5.8 │ 2.7 │ 5.1 │ 1.9 │ virginica │
│ 144 │ 6.8 │ 3.2 │ 5.9 │ 2.3 │ virginica │
│ 145 │ 6.7 │ 3.3 │ 5.7 │ 2.5 │ virginica │
│ 146 │ 6.7 │ 3.0 │ 5.2 │ 2.3 │ virginica │
│ 147 │ 6.3 │ 2.5 │ 5.0 │ 1.9 │ virginica │
│ 148 │ 6.5 │ 3.0 │ 5.2 │ 2.0 │ virginica │
│ 149 │ 6.2 │ 3.4 │ 5.4 │ 2.3 │ virginica │
│ 150 │ 5.9 │ 3.0 │ 5.1 │ 1.8 │ virginica │
julia> Random.seed!(123)
MersenneTwister(UInt32[0x0000007b], Random.DSFMT.DSFMT_state(Int32[1464307935, 1073116007, 222134151, 1073120226, -290652630, 1072956456, -580276323, 1073476387, 1332671753, 1073438661 … 138346874, 1073030449, 1049893279, 1073166535, -1999907543, 1597138926, -775229811, 32947490, 382, 0]), [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 … 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], UInt128[0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000 … 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000], 1002, 0)
均值
julia> mean(iris.SepalWidth)
3.0573333333333337
方差
julia> var(iris.SepalWidth)
0.18997941834451904
標準偏差
julia> std(iris.SepalWidth)
0.43586628493669827
峯度
julia> kurtosis(iris.SepalWidth)
0.180976317522465
變異係數
julia> variation(iris.SepalWidth)
0.1425642013530413
標準化偏度
julia> skewness(iris.SepalWidth)
0.31576710633893534
標準誤差
julia> sem(iris.SepalWidth)
0.03558833313924838
zscore標準化處理
julia> zscore(iris.SepalWidth)
150-element Array{Float64,1}:
1.015601990713633
-0.1315388120502606
0.3273175090552973
0.09788934850251833
1.245030151266412
1.9333146329247477
0.7861738301608541
0.7861738301608541
-0.3609669726030395
0.09788934850251833
1.474458311819191
0.7861738301608541
-0.1315388120502606
-0.1315388120502606
2.1627427934775265
⋮
-0.1315388120502606
0.7861738301608541
0.09788934850251833
-0.1315388120502606
0.09788934850251833
0.09788934850251833
0.09788934850251833
-0.8198232937085964
0.3273175090552973
0.5567456696080751
-0.1315388120502606
-1.2786796148141542
-0.1315388120502606
0.7861738301608541
-0.1315388120502606
julia> head(iris)
WARNING: both TimeSeries and DataFrames export "head"; uses of it in module StatsKit must be qualified
┌ Warning: `head(df::AbstractDataFrame)` is deprecated, use `first(df, 6)` instead.
│ caller = top-level scope at REPL[37]:1
└ @ Core REPL[37]:1
6×5 DataFrame
│ Row │ SepalLength │ SepalWidth │ PetalLength │ PetalWidth │ Species │
│ │ Float64 │ Float64 │ Float64 │ Float64 │ Cat… │
├─────┼─────────────┼────────────┼─────────────┼────────────┼─────────┤
│ 1 │ 5.1 │ 3.5 │ 1.4 │ 0.2 │ setosa │
│ 2 │ 4.9 │ 3.0 │ 1.4 │ 0.2 │ setosa │
│ 3 │ 4.7 │ 3.2 │ 1.3 │ 0.2 │ setosa │
│ 4 │ 4.6 │ 3.1 │ 1.5 │ 0.2 │ setosa │
│ 5 │ 5.0 │ 3.6 │ 1.4 │ 0.2 │ setosa │
│ 6 │ 5.4 │ 3.9 │ 1.7 │ 0.4 │ setosa │
對數據的多列進行標準化
julia> using Queryverse
julia> aggregate(iris[:,1:4],zscore)
┌ Warning: `aggregate(d, f)` is deprecated. Instead use `combine(d, names(d) .=> f)`.
│ caller = top-level scope at REPL[39]:1
└ @ Core REPL[39]:1
150×4 DataFrame
│ Row │ SepalLength_zscore │ SepalWidth_zscore │ PetalLength_zscore │ PetalWidth_zscore │
│ │ Float64 │ Float64 │ Float64 │ Float64 │
├─────┼────────────────────┼───────────────────┼────────────────────┼───────────────────┤
│ 1 │ -0.897674 │ 1.0156 │ -1.33575 │ -1.31105 │
│ 2 │ -1.1392 │ -0.131539 │ -1.33575 │ -1.31105 │
│ 3 │ -1.38073 │ 0.327318 │ -1.3924 │ -1.31105 │
│ 4 │ -1.50149 │ 0.0978893 │ -1.2791 │ -1.31105 │
│ 5 │ -1.01844 │ 1.24503 │ -1.33575 │ -1.31105 │
│ 6 │ -0.535384 │ 1.93331 │ -1.16581 │ -1.04867 │
│ 7 │ -1.50149 │ 0.786174 │ -1.33575 │ -1.17986 │
│ 8 │ -1.01844 │ 0.786174 │ -1.2791 │ -1.31105 │
│ 9 │ -1.74302 │ -0.360967 │ -1.33575 │ -1.31105 │
│ 10 │ -1.1392 │ 0.0978893 │ -1.2791 │ -1.44224 │
│ 11 │ -0.535384 │ 1.47446 │ -1.2791 │ -1.31105 │
│ 12 │ -1.25996 │ 0.786174 │ -1.22246 │ -1.31105 │
│ 13 │ -1.25996 │ -0.131539 │ -1.33575 │ -1.44224 │
⋮
│ 137 │ 0.551486 │ 0.786174 │ 1.04345 │ 1.57519 │
│ 138 │ 0.672249 │ 0.0978893 │ 0.986802 │ 0.788031 │
│ 139 │ 0.189196 │ -0.131539 │ 0.590269 │ 0.788031 │
│ 140 │ 1.27607 │ 0.0978893 │ 0.930154 │ 1.18161 │
│ 141 │ 1.03454 │ 0.0978893 │ 1.04345 │ 1.57519 │
│ 142 │ 1.27607 │ 0.0978893 │ 0.760211 │ 1.44399 │
│ 143 │ -0.0523308 │ -0.819823 │ 0.760211 │ 0.919223 │
│ 144 │ 1.1553 │ 0.327318 │ 1.21339 │ 1.44399 │
│ 145 │ 1.03454 │ 0.556746 │ 1.1001 │ 1.70638 │
│ 146 │ 1.03454 │ -0.131539 │ 0.816859 │ 1.44399 │
│ 147 │ 0.551486 │ -1.27868 │ 0.703564 │ 0.919223 │
│ 148 │ 0.793012 │ -0.131539 │ 0.816859 │ 1.05042 │
│ 149 │ 0.430722 │ 0.786174 │ 0.930154 │ 1.44399 │
│ 150 │ 0.0684325 │ -0.131539 │ 0.760211 │ 0.788031 │
julia> iqr(iris.SepalWidth)
0.5
對數據的分位數統計和describe差不多
julia> summarystats(iris.SepalWidth)
Summary Stats:
Length: 150
Missing Count: 0
Mean: 3.057333
Minimum: 2.000000
1st Quartile: 2.800000
Median: 3.000000
3rd Quartile: 3.300000
Maximum: 4.400000
julia> describe(iris)
5×8 DataFrame
│ Row │ variable │ mean │ min │ median │ max │ nunique │ nmissing │ eltype │
│ │ Symbol │ Union… │ Any │ Union… │ Any │ Union… │ Nothing │ DataType │
├─────┼─────────────┼─────────┼────────┼────────┼───────────┼─────────┼──────────┼────────────────────────────────┤
│ 1 │ SepalLength │ 5.84333 │ 4.3 │ 5.8 │ 7.9 │ │ │ Float64 │
│ 2 │ SepalWidth │ 3.05733 │ 2.0 │ 3.0 │ 4.4 │ │ │ Float64 │
│ 3 │ PetalLength │ 3.758 │ 1.0 │ 4.35 │ 6.9 │ │ │ Float64 │
│ 4 │ PetalWidth │ 1.19933 │ 0.1 │ 1.3 │ 2.5 │ │ │ Float64 │
│ 5 │ Species │ │ setosa │ │ virginica │ 3 │ │ CategoricalValue{String,UInt8} │
julia> counteq(iris.SepalWidth,iris.SepalLength)
0
julia> countne(iris.SepalWidth,iris.SepalLength)
150
按照數據出現的次數統計
julia> countmap(iris.SepalWidth)
Dict{Float64,Int64} with 23 entries:
4.1 => 1
2.8 => 14
2.0 => 1
2.2 => 3
4.4 => 1
3.5 => 6
2.6 => 5
2.3 => 4
2.9 => 10
3.7 => 3
4.0 => 1
3.0 => 26
2.7 => 9
4.2 => 1
3.9 => 2
2.5 => 8
2.4 => 3
3.1 => 11
3.4 => 12
3.2 => 13
3.3 => 6
3.6 => 4
3.8 => 6
按照數字對應的次數/總數量
julia> proportionmap(iris.SepalWidth)
Dict{Float64,Float64} with 23 entries:
4.1 => 0.00666667
2.8 => 0.0933333
2.0 => 0.00666667
2.2 => 0.02
4.4 => 0.00666667
3.5 => 0.04
2.6 => 0.0333333
2.3 => 0.0266667
2.9 => 0.0666667
3.7 => 0.02
4.0 => 0.00666667
3.0 => 0.173333
2.7 => 0.06
4.2 => 0.00666667
3.9 => 0.0133333
2.5 => 0.0533333
2.4 => 0.02
3.1 => 0.0733333
3.4 => 0.08
3.2 => 0.0866667
3.3 => 0.04
3.6 => 0.0266667
3.8 => 0.04
排序:按照1 2 3 4的順序
julia> ordinalrank(iris.SepalWidth)
150-element Array{Int64,1}:
126
58
95
84
132
145
114
115
48
85
136
116
59
60
147
⋮
79
124
91
80
92
93
94
33
107
113
81
19
82
125
83
julia> competerrank(iris.SepalWidth)
ERROR: UndefVarError: competerrank not defined
Stacktrace:
[1] top-level scope at REPL[48]:1
排序:按照1 2 2 4 的順序
julia> competerank(iris.SepalWidth)
150-element Array{Int64,1}:
126
58
95
84
132
145
114
114
48
84
136
114
58
58
147
⋮
58
114
84
58
84
84
84
25
95
108
58
12
58
114
58
排序:按照1 2 2 3 的順序
julia> denserank(iris.SepalWidth)
150-element Array{Int64,1}:
15
10
12
11
16
19
14
14
9
11
17
14
10
10
20
⋮
10
14
11
10
11
11
11
7
12
13
10
5
10
14
10
排序:按照1 2.5 2.5 4的順序
julia> tiedrank(iris.SepalWidth)
150-element Array{Float64,1}:
128.5
70.5
101.0
89.0
133.5
145.5
119.5
119.5
52.5
89.0
137.0
119.5
70.5
70.5
147.0
⋮
70.5
119.5
89.0
70.5
89.0
89.0
89.0
29.0
101.0
110.5
70.5
15.5
70.5
119.5
70.5
生成均值爲10,方差爲2的正態分佈
julia> n1 = Normal(10,2)
Normal{Float64}(μ=10.0, σ=2.0)
查看均值與方差
julia> params(n1)
(10.0, 2.0)
julia> fieldnames(typeof(n1))
(:μ, :σ)
julia> typeof(Normal)
UnionAll
按照n1隨機生成50個數
julia> rand(n1,50)
50-element Array{Float64,1}:
10.75252842248695
9.189456419757317
12.671706963707475
13.201521418484015
7.084217959084374
11.601178185151417
11.791756909048322
8.616132804493297
6.982470144914277
8.49095442018933
10.231243743909562
10.485190171875951
9.55357840686529
9.119296072267389
6.456312157863158
⋮
11.705057817124151
11.176904946401194
11.949705701909465
9.404278381567762
11.663841458180475
10.63762730322479
9.39660780780916
7.328480885141753
10.001410312828868
14.292303719124995
10.01686385349854
14.532511665913692
10.439431478457106
10.977465688620121
7.612032453765822
二項分佈
julia> x = Binomial()
Binomial{Float64}(n=1, p=0.5)
julia> rand(x,50)
50-element Array{Int64,1}:
1
1
1
0
0
0
0
1
0
1
1
1
1
1
0
⋮
1
0
0
1
0
1
0
1
0
0
1
0
1
0
1
概率密度分佈
julia> n = Normal()
Normal{Float64}(μ=0.0, σ=1.0)
julia> pdf(n,0)
0.3989422804014327
julia> cdf(n,0)
0.5
julia> quantile.(n,[0.25,0.5,0.75,0.99])
4-element Array{Float64,1}:
-0.6744897501960818
0.0
0.6744897501960818
2.326347874040846
分位數
julia> quantile.(n,[0.25,0.5,0.75,0.99])
4-element Array{Float64,1}:
-0.6744897501960818
0.0
0.6744897501960818
2.326347874040846