jupyter作業

Anscombe's quartet

Anscombe's quartet comprises of four datasets, and is rather famous. Why? You'll find out in this exercise.


所有模塊:
%matplotlib inline
import random
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
sns.set_context("talk")

Part 1

For each of the four datasets...

  • Compute the mean and variance of both x and y
  • Compute the correlation coefficient between x and y
  • Compute the linear regression line: y=β0+β1x+ϵy=β0+β1x+ϵ (hint: use statsmodels and look at the Statsmodels notebook)

計算均值:

anscombe=pd.read_csv('data/anscombe.csv')
print('mean of x:')
print(anscombe.groupby("dataset").x.mean(),'\n')
print('mean of y:')
print(anscombe.groupby("dataset").y.mean(),'\n')

結果:


計算方差:
print('variance of x:')
print(anscombe.groupby("dataset").x.var(),'\n')
print('variance of y:')
print(anscombe.groupby("dataset").y.var(),'\n')

結果:


相關係數:

print('correlation coefficient between x and y:')
print(anscombe.groupby("dataset").x.corr(anscombe.y))
# print(anscombe.groupby("dataset").y.corr(anscombe.x)) #這樣結果和上面一樣

結果:


線性迴歸方程:

def regression(X,Y,num):
    print("dataset "+str(num)+':')
    X=sm.add_constant(X)
    est=sm.OLS(Y,X)
    est=est.fit()
    print('y='+str(est.params[1])+'x+'+str(est.params[0]))
    x=np.linspace(X.x.min(), X.x.max(),100)
    y=est.params[1]*x+est.params[0]
    plt.figure()
    plt.scatter(X.x, Y, alpha=0.3)
    plt.xlabel('x')
    plt.ylabel('y')
    plt.plot(x,y,color='r')
for i in range(4):
    regression(anscombe[i*11:(i+1)*11].x,anscombe[i*11:(i+1)*11].y,i+1)

結果和線性模擬:

dataset 1:
y=0.5000909090909089x+3.0000909090909085


dataset 2:
y=0.4999999999999999x+3.000909090909091

dataset 3:
y=0.4997272727272726x+3.002454545454545

dataset 4:
y=0.49990909090909114x+3.0017272727272735

Part 2

Using Seaborn, visualize all four datasets.

hint: use sns.FacetGrid combined with plt.scatter

代碼:

def visualize(datasetx,y):
    plt.figure()
    sns.FacetGrid(datasetx)
    plt.scatter(datasetx.x,y)
for i in range(4):
    visualize(anscombe[i*11:(i+1)*11],anscombe[i*11:(i+1)*11].y)

結果:

dataset1:


dataset2:


dataset3:


dataset4:



發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章