Machine Learning - Linear Regression with Multiple Variables

This series of articles are the study notes of"Machine Learning ", by Prof. Andrew Ng., Stanford University. This article is the notes of week 2, Linear Regression with Multiple Variables. This article contains linear regression with multiple variables, gradient decent for multiple variables, how to chose features, polynomial regression and normal equation solution for linear regression.


Linear Regression with Multiple Variables


1. Multiple features


One feature (variable)

Here we will start to talk about a new version of linear regression that's more powerful. One that works with multiple variables or with multiple features. Here's what I mean. In the original version of linear regression that we developed, we have a single feature x, the size of the house, and we wanted to use that to predict why the price of the house and this was our form of our hypothesis.


Hypothesis

Multiple features (variables)

But now imagine, what if we had not only the size of the house as a feature or as a variable of which to try to predict the price, but that we also knew the number of bedrooms, the number of house and the age of the home and years. It seems like this would give us a lot more information with which to predict the price.


Notation:


Now that we have four features we are going to use lowercase "n" to denote the number of features. So in this example we have n=4 because we have four features. So if you have 47 rows  "m" is the number of rows on this table or the number of training examples. So I'm also going to use x(i) to denote the input features of the ith training example. As a concrete example let;s say x(2) is going to be a vector of the features for my second training example. And so x(2)here is going to be a vector (1416, 3, 2, 40)T since those are my four features that I have to try to predict the price of the second house. And,we're going to use also x(i) subscript jto denote the value of the j, of feature number j and the training example. So concretely x(2) subscript 3, will refer to feature number 3 in the x(2) factor which is equal to 2.

For example


Hypothesis: 

For convenience of notation, let me define x0=1. Concretely, this means that for every example i I have a feature vector x(i) and x(i) subscript 0 is going to be equal to 1. You can think of this as defining an additional zero feature.


In this case, hypothesis can be written as


2. Gradient descent for multiple variables

Let's talk about how to fit the parameters of that hypothesis. In particular let's talk about how to use gradient descent for linear regression with multiple features. To quickly summarize our notation, this is our formal hypothesis in multi variable linear regression where we've adopted the convention that x0=1. The parameters of this model are θ0 throughθn, but instead of thinking of this as n separate parameters, which is valid, I'm instead going to think of the parameters as theta where theta here is an+1 dimensional vector. So I'm just going to think of the parameters of this model as itself being a vector.

Hypothesis

Parameters:


Cost function


Gradient descent algorithm


Note that, 


3. Features and Polynomial Regression

We now know about linear regression with multiple variables. In this section, we will learn a bit about the choice of features that you have and how you can get different learning algorithm, sometimes very powerful ones by choosing appropriate features. And in particular I also want to tell you about polynomial regression allows you to use the machinery of linear regression to fit very complicated, even very non-linear functions.

Chose your feature

Let's take the example of predicting the price of the house. Suppose you have two features, the frontage of house and the depth of the house. So, here's the picture of the house we're trying to sell. The frontage is defined as this distance is basically the width or the length of how wide your lot is if this that you own, and the depth of the house is how deep your property is, so there's a frontage, there's a depth. called frontage and depth. 


You might build a linear regression model like this where frontage is your first feature x1 and and depth is your second feature x2, but when you're applying linear regression,you don't necessarily have to use just the features x1 and x2 that you're given.


What you can do is actually create new features by yourself. So, if I want to predict the price of a house, what I might do instead is decide that what really determines the size of the house is the area or the land area that I own. So, I might create a new feature. I'm just call this feature x which is frontage ×depth.

New feature: Area

Polynomial Regression

Closely related to the idea of choosing your features is this idea called polynomial regression. Let's say you have a housing price data set that looks like this. Then there are a few different models you might fit to this. One thing you could do is fit a quadratic model like this. It doesn't look like a straight line fits this data very well. So maybe you want to fit a quadratic model like this where you think the size, where you think the price is a quadratic function and maybe that'll give you, you know, a fit to the data that looks like that. 

But then you may decide that your quadratic model doesn't make sense because of a quadratic function, eventually this function comes back down and well, we don't think housing prices should go down when the size goes up too high. So then maybe we might choose a different polynomial model and choose to use instead a cubic function, and where we have now a third-order term and we fit that, maybe we get this sort of model, and maybe the green line is a somewhat better fit to the data cause it doesn't eventually come back down. So how do we actually fit a model like this to our data? Using the machinery of multi variant linear regression, we can do this with a pretty simple modification to our algorithm. The form of the hypothesis we, we know how the fit looks like this


And if we want to fit this cubic model, what we're saying is that to predict the price of a house, it's theta 0 plus theta 1 times the size of the house plus theta 2 times the square size of the house. So this term is equal to that term. And then plus theta 3 times the cube of the size of the house raises that third term.


In order to map these two definitions to each other, well, the natural way to do that is to set the first feature x1 to be the size of the house, and set the second feature x2 to be the square of the size of the house, and set the third feature x3 to be the cube of the size of the house.


And, just by choosing my three features this way and applying the machinery of linear regression, I can fit this model and end up with a cubic fit to my data. I just want to point out one more thing, which is that if you choose your features like this, then feature scaling becomes increasingly important. (which we will talk about later)

4. Normal Equation

Next, we'll talk about the normal equation, which for some linear regression problems, will give us a much better way to solve for the optimal value of the parametersθ. Concretely, so far the algorithm that we've been using for linear regression is gradient descent herein order to minimize the cost function J (θ), we would take this iterative algorithm that takes many steps,multiple iterations of gradient descent to converge to the global minimum. 

Gradient descent 

Normal equation: Method to solve for θ analytically.

In contrast, the normal equation would give us a method to solve for θ analytically, so that rather than needing to run this iterative algorithm, we can instead just solve for the optimal value for theta all at one go, so that in basically one step you get to the optimal value right there.

It turns out the normal equation that has some advantages and some disadvantages,but before we get to that and talk about when you should use it, let's get some intuition about what this method does.


The cost function J (θ) looks like that. The way to minimize a function is to take derivatives and to set derivatives equal to zero. So, you take the derivative of J with respect to the parameter of θ. You get some formula and set that derivative equal to zero


solve for 

Example: m=4


Matrix X is called design matrix.


Normal equation:

Where, (XTX)-1 is inverse of matrix XTX.

Octave or Matlab code to caldulate: pinv(X’*X)*X’*y

What if (XTX) Non-invertible?

The reasons of matrix non-invertible

  • Redundant features (linearly dependent between features).

           E.g.  x1 = size in feet2

             x2 = size in m2

  • Too many features (e.g. m n ).

Delete some features, or use regularization.

If A is non-invertible matrix, you can use Octave or Matlab function pinv(A) to calculate pseudo-inverse of matrix A.

5. Gradient Descent vs. Normal Equation


Here are some disadvantages and some advantages of the gradient descent and normal equation. Gradient descent works pretty well, even when you have a very large number of features. Normal equation works directly and fast when you have not too much features.


If the number of features is not very large,normal equation is better. So exactly how large set of features has to be before you convert a gradient descent, it's hard to give a strict number. But for me, it is usually around ten thousand that I might start to consider switching over to gradient descents or maybe, some other algorithms that we'll talk about later in this lass.

To summarize, so long as the number of features is not too large, the normal equation gives us a great alternative method to solve for the parameter theta. Concretely, so long as the number of features is less than 1000, I would usually is used in normal equation method rather than gradient descent. But as we get to the more complex learning algorithm,for example, when we talk about classification algorithm, like a logistic regression algorithm, we'll see that the normal equation method actually do not work for those more sophisticated learning algorithms, and, we will have to resort to gradient descent for those algorithms. So, gradient descent is a very useful algorithm to know.

發佈了96 篇原創文章 · 獲贊 461 · 訪問量 119萬+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章