擬合(Curve fitting)與迴歸(Regression analysis)


時間 10:00 11:00 12:00 13:00 14:00 15:00
溫度 12℃ 15℃ 17℃ 20℃ 25℃ 18℃


擬合方法: 僅要求在用函數表示列表中數據關係時,其誤差在某種度量意義下最小,不要求完全經過數據點。

從幾何意義上將,擬合是給定了空間中的一些點,找到一個已知形式未知參數的連續曲面來最大限度地逼近這些點;而插值是找到一個(或幾個分片光滑的)連續曲面來穿過這些點。因此擬合可以用於外推預測(可以預測17:00的溫度值),而差值一般用於求解差值空間裏面的未知函數值(10:00 到15:00 之間任意時刻的值)




Curve Fitting

Curve fitting is the process of constructing a curve, or mathematical function, that has the best fit to a series of data points,possibly subject to constraints.Curve fitting can involve either interpolation, where an exact fit to the data is required, or smoothing, in which a “smooth” function is constructed that approximately fits the data.


A related topic is regression analysis, which focuses more on questions of statistical inference such as how much uncertainty is present in a curve that is fit to data observed with random errors. Fitted curves can be used as an aid for data visualization,to infer values of a function where no data are available,and to summarize the relationships among two or more variables. Extrapolation refers to the use of a fitted curve beyond the range of the observed data,and is subject to a degree of uncertainty since it may reflect the method used to construct the curve as much as it reflects the observed data.


Regression Analysis

在統計學中,迴歸分析(regression analysis)指的是確定兩種或兩種以上變量間相互依賴的定量關係的一種統計分析方法。迴歸分析按照涉及的變量的多少,分爲一元迴歸和多元迴歸分析;按照因變量的多少,可分爲簡單迴歸分析和多重回歸分析;按照自變量和因變量之間的關係類型,可分爲線性迴歸分析和非線性迴歸分析。

Regression is a statistical measurement used in finance, investing and other disciplines that attempts to determine the strength of the relationship between one dependent variable (usually denoted by Y) and a series of other changing variables (known as independent variables).

In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships among variables. It includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables (or ‘predictors’). More specifically, regression analysis helps one understand how the typical value of the dependent variable (or ‘criterion variable’) changes when any one of the independent variables is varied, while the other independent variables are held fixed.


Regression analysis is widely used for prediction and forecasting, where its use has substantial overlap with the field of machine learning. Regression analysis is also used to understand which among the independent variables are related to the dependent variable, and to explore the forms of these relationships. In restricted circumstances, regression analysis can be used to infer causal relationships between the independent and dependent variables. However this can lead to illusions or false relationships, so caution is advisable.

In a narrower sense, regression may refer specifically to the estimation of continuous response (dependent) variables, as opposed to the discrete response variables used in classification.The case of a continuous dependent variable may be more specifically referred to as metric regression to distinguish it from related problems.




A regression problem is when the output variable is a real or continuous value, such as “salary” or “weight”. Many different models can be used, the simplest is the linear regression. It tries to fit data with the best hyper-plane which goes through the points.

A classification problem is when the output variable is a category, such as “red” or “blue” or “disease” and “no disease”. A classification model attempts to draw some conclusion from observed values. Given one or more inputs a classification model will try to predict the value of one or more outcomes.

For example, when filtering emails “spam” or “not spam”, when looking at transaction data, “fraudulent”, or “authorized”. In short Classification either predicts categorical class labels or classifies data (construct a model) based on the training set and the values (class labels) in classifying attributes and uses it in classifying new data. There are a number of classification models. Classification models include logistic regression, decision tree, random forest, gradient-boosted tree, multilayer perceptron, one-vs-rest, and Naive Bayes.


高爾頓(Frramcia Galton,1882-1911)早年在劍橋大學學習醫學,但醫生的職業對他並無吸引力,後來他接受了一筆遺產,這使他可以放棄醫生的生涯,並與 1850-1852年期間去非洲考察,他所取得的成就使其在1853年獲得英國皇家地理學會的金質獎章。此後他研究過多種學科(氣象學、心理學、社會學、 教育學和指紋學等),在1865年後他的主要興趣轉向遺傳學,這也許是受他表兄達爾文的影響。
從19世紀80年代高爾頓就開始思考父代和子代相似,如身高、性格及其它種種特製的相似性問 題。於是他選擇了父母平均身高X與其一子身高Y的關係作爲研究對象。他觀察了1074對父母及每對父母的一個兒子,將結果描成散點圖,發現趨勢近乎一條直 線。總的來說是父母平均身高X增加時,其子的身高Y也傾向於增加,這是意料中的結果。但有意思的是高爾頓發現這1074對父母平均身高的平均值爲68 英寸(英國計量單位,1 英寸=2.54cm)時,1074個兒子的平均身高爲69 英寸,比父母平均身高大1 英寸 ,於是他推想,當父母平均身高爲64 英寸時,1074個兒子的平均身高應爲64+1=65 英寸;若父母的身高爲72 英寸時,他們兒子的平均身高應爲72=1=73 英寸,但觀察結果確與此不符。高爾頓發現前一種情況是兒子的平均身高爲67 英寸,高於父母平均值達3 英寸,後者兒子的平均身高爲71英寸,比父母的平均身高低1 英寸。
高爾頓對此研究後得出的解釋是自然界有一種約束力,使人類身高在一定時期是相對穩定的。如果父 母身高(或矮了),其子女比他們更高(矮),則人類身材將向高、矮兩個極端分化。自然界不這樣做,它讓身高有一種迴歸到中心的作用。例如,父母平均身高 72 英寸,這超過了平均值68英寸,表明這些父母屬於高的一類,其兒子也傾向屬於高的一類(其平均身高71 英寸 大於子代69 英寸),但不像父母離子代那麼遠(71-69<72-68)。反之,父母平均身高64 英寸,屬於矮的一類,其兒子也傾向屬於矮的一類(其平均67 英寸,小於子代的平均數69 英寸),但不像父母離中心那麼遠(69 -67< 68-64)。


