Understanding Some Concepts from Statistics Aspect

通過Khan Accademy中關於 Probability and Statistics(國內) 的學習(建議使用Probability and Statistics(國外))，對 Probability 中學到的概念有了一些新的理解，解決了之前困擾我的問題。可能並不是很連續的知識，只是一些幫助我理順邏輯的知識，希望這些內容同樣能夠幫助現在的你。

0.什麼是統計學(statistics)?

更寬泛的意義來說，統計學(statistics)是處理數據的一個學科。一般來說，它有兩個分類：Inferential Statistics 和 Descriptive Statistics。

Inferential Statistics : 通過數據(樣本數據sample data)，給出結論。
Descriptive Statistics : 用簡單的數據(eg. 算術平均/方差/中位數)，描述總體數據(population)應該長什麼樣子。

特別地，Statistics 中很喜歡用 central tendency 一詞。這個詞是描述數據該長什麼樣子的一個維度，另一個維度是 disperssion。具體地，我們可以用平均數，中位數，衆數等具體形式算出的數值來量化 central tendency，用方差來量化 disperssion。

1.慣用的符號(Notation)

熟悉統計學的符號，可以讓我們減輕理解的負擔，更重要的是，可以方便地消除歧義。

總體均值(population mean)： $\mu$
總體方差(population variance)： $\sigma^{2}$
樣本均值(sample mean)： $\bar{X}$
樣本方差(sample variance)： $S^2$

另外，樣本(sample)的統計量，習慣上稱爲 統計量(statistic) ；而總體(population)的統計量，習慣上稱爲 參數(parameter)。因爲，普遍習慣將未知待求的量稱爲parameter，而求解parameter的過程，就是時常聽說的 推斷(inferring)。

2.爲什麼均值又稱爲期望值？(Mean and Expectation)

注意，均值 $\mu$ (mean) 的含義是，總體均值(mean of population)。換句話說，理論上而言，當我們要求均值 $\mu$ 時，我們需要獲得整體的所有數據。

然而，在實際中，我們一般只擁有樣本(sample)的數據，而非整體的數據。另一種情況是，當我們處理連續隨機變量(continuous r.v.)時，我們更加不可能獲得整體的數據，因爲這是無限不可數(uncountable)的集合。

最終，我們選擇使用 “頻率” 來代表 “一片” 數據。即假設有人告訴我，整體數據的分佈是怎麼樣的時候(某個數據佔比是多少)，我們並不需要知道每一實驗得到的值具體是多少，我們也可以把均值算出來。對於Statistics，可以止步於此。但對於 Probability 而言，我們還用了這麼一個假設，即我們 假設當實驗次數無窮多的時候，頻率值接近於概率值。

題外話是，這樣算出來的 “均值” 是不是或多或少有點 “期望” 的味道在裏面呢？畢竟生活經驗告訴我們，如此大量的數據總會有漏網之魚，我們只能期望均值是我們算出來的這個值。

3.爲什麼無偏樣本方差長這樣？(Unbiased Sample variance)

樣本方差：
$S_{n} = \frac{\sum\limits_{i=1}^{n} (x_{i}-\mu)^{2}}{n}$

無偏樣本方差：
$S_{n-1} = \frac{\sum\limits_{i=1}^{n} (x_{i}-\mu)^{2}}{n-1}$
這裏，關鍵在於這是 樣本(sample) 的方差，而不是 整體(population) 的方差。對於整體的無偏方差，分母依舊是 $n$ 。另外，請注意 $\sqrt{S_{n-1}}$ 得到的樣本標準差並不是無偏的。

（直觀理解）雖然看公式看起來並不直觀，但我們可以設想這麼一種情景幫助我們去接受這個結果：（待續）

可以想象

（嚴謹的證明）：(待續)

4.理解泊松分佈(Poisson distribution)

待續

5.中心極限定理(The Central Limit Theorem)

0）意義

可以“融合”各種形式的分佈(滿足i.i.d.條件即可)，形成性質甚佳的正態分佈，方便建模。

1）定義

Let $X_1,X_2,\dots$ be a sequence of independent identically distributed random variables with common mean $\mu$ and variance $\sigma^2$ , and define:
$Z_n = \frac{X_1+X_2+\dots+X_n}{\sigma\sqrt{n}}$
Then, the CDF of $Z_n$ converges to the standard normal CDF:
$\Phi(z)=\frac{1}{\sqrt{2\pi}}\int_{ - \infty }^{z} {e^{-\frac{x^2}{2\sigma ^2}}dx}$
In the sense that, for every $z$ :
$\lim_{{n\to\infin}} P(Z_n \leq z ) = \Phi(z)$

2）直觀理解

待續

3）數學證明

待續

6.置信區間(Confidence Intervals)

0）意義

置信區間是一種量化手段。量化的對象是 估計的好壞 (Determine how good the estimate is.)

1）定義

待續

2）直觀理解

通過以下兩種等價的表達來幫助理解：

We are confident that there is 95% chance that the sample mean $\bar {X}$ is within some standard derviation of population mean $\mu$ .
We are confident that there is 95% chance that the population mean $\mu$ is within some interval of the sample mean $\bar {X}$ .

一般而言，我們可以通過中心極限定理(The Central Limit Theorem)來獲得 sampling distribution of sample mean。此時，sampling distribution 有如下性質：
$\mu_{\bar{X}} =\mu$

$\sigma_{\bar{X}} = \frac{\sigma}{\sqrt{n}}$

所以，最直接地，我們可以通過sampling distribution 得到第一個表述的結果。通過相對的變化，我們可以獲得第二個表述的結果。

一定要注意的是，這裏的表述指的是"We are confident that …"，而並不是真的有95%的概率滿足。

另外，經驗而言，當我們爲需要爲小樣本容量(sample size $n\leq 30$ ) 計算置信區間時，我們可以選擇使用 t distribution 對sampling data 進行建模。

7.其它

1) 二項分佈爲什麼叫二項分佈？

我經常會記混 Bernouli distribution 和 Binomial distribution。原因也很簡單，因爲都是讀音和寫法感覺都很像嘛。但當你知道Binomial distribution 名字的由來，我相信再也不會記錯了。

此處的Binomial，是得名於 Binomial coefficient (牛頓二項展開式係數)即：
$C_{n}^{k} = \frac{n!}{k!(n-k)!}$
另外，Binomial distribution 描述的是，進行 $n$ 個滿足 Bernouli distribution 的實驗，且實驗之間符合 i.i.d (獨立同分布) 的概率分佈。簡單而言，可以假想如下情形：同一枚硬幣拋 $n$ 次，硬幣有 $k$ 次朝上的概率分佈。
$P(X = k) = C_{n}^{k}p^{k}(1-p)^{n-k}$

2) 哪幾個分佈很重要？爲什麼它們很重要？

雖然有很多分佈(eg. dirichlet distribution/beta distribution…)沒怎麼聽說，但其實是很重要的。那麼，在本科 Probability and Statistics 的課程中反覆提及的 Bernouli distribution, Binomial distribution, Poisson distribution, Gaussian distribution ,Chi-square distribution, 其重要性可以說是不言而喻了。

關於爲什麼它們很重要這個問題，個人理解，是因爲這些分佈有如下性質：數學性質好，成體系(可互相轉化)

Gaussian distribution：高斯分佈之所以很重要，是因爲有中心極限定理(Central limit theorem) 的存在，使得它在所有分佈中佔據核心的地位。
Bernouli distribution and Binomial distribution：我們可以說 Bernouli distribution 是 Binomial distribution 的特殊情況。
Poisson distribution and Binomial distribution：Poisson distribution 可以由 Binomial distribution 推導而來。但請注意，這個推導並不是想象中那麼直接明瞭。因爲滿足 Binomial distribution 的隨機變量是離散的(discrete r.v.)，而滿足 Poisson distribution 的隨機變量是連續的(continuous r.v.)。換言之，這個推導過程，模糊了隨機變量其離散和連續的界限。

3）一些關於Gaussian distribution的內容

Gaussian distribution 的另外一種形式：

$f(x) = \frac{1}{\sqrt{2\pi \sigma ^2}}e^{-\frac{(x-\mu)^2}{2\sigma ^2}} =\frac{1}{\sqrt{2\pi \sigma ^{2} e^{\frac{x-\mu}{\sigma}}}}$

z-score：
- Define as how many standard deviation away from mean.
- Can be used in any distribution.
skew：衡量對稱程度的量值。positive(right) skew 爲右長尾；negative(left) skew：左長尾；

4）什麼是Sampling distribution?

首先，我們先看看與 sampling distribution 有關的詞語：

sampling distribution of the sample mean
sampling distribution of the sample variance
sampling distribution of sample median

最簡單的理解就是 distribution of mean, distribution of variance, distribution of median 。加上 sample 只是爲了指明我們的數據直接來源於樣本(sample)。

另外，我們的分佈圖像(distribution)可以用各種統計量作爲自變量(橫座標)，因爲這些統計量自身都就是是隨機變量。

5）什麼是Estimator？

待續 example exercise of biased and unbiased estimator

sample size get approach to inf(很重要)

【Statistics】Understanding Some Concepts from Statistics Aspect

Understanding Some Concepts from Statistics Aspect

0.什麼是統計學(statistics)?

1.慣用的符號(Notation)

2.爲什麼均值又稱爲期望值？(Mean and Expectation)

3.爲什麼無偏樣本方差長這樣？(Unbiased Sample variance)

4.理解泊松分佈(Poisson distribution)

5.中心極限定理(The Central Limit Theorem)

0）意義

1）定義

2）直觀理解

3）數學證明

6.置信區間(Confidence Intervals)

0）意義

1）定義

2）直觀理解

7.其它

1) 二項分佈爲什麼叫二項分佈？

2) 哪幾個分佈很重要？爲什麼它們很重要？

3）一些關於Gaussian distribution的內容

4）什麼是Sampling distribution?

5）什麼是Estimator？

[轉帖]使用NMT和pmap解決JVM資源泄漏問題原創

Python實現大麥網搶票的四大關鍵技術點解析

Python 安裝庫指令大全

salesforce零基礎學習（一百三十八）零碎知識點小總結（十）

一款開源的.NET程序集反編譯、編輯和調試神器

關於接口協議，你必須要知道這些！

【2024-05-21】以茶會友

【Statistics】HYPOTHESIS TEST(SIGNIFICANCE TEST)

【Statistics】Understanding Some Concepts from Statistics Aspect

【Statistics】Chi-square test

HYPOTHESIS TEST(SIGNIFICANCE TEST)

Understanding Some Concepts from Statistics Aspect

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結