what is supervised ml?
ML systems learn how to combine input to produce useful predictions on never-before-seen data.
feature
x,input variable
label
y in linear regression, observation, what we are predicting, output, must be observable and quantifiable metric
model
a model defines the relationship between features and label.
regression vs classification
- a regression model predicts continuous values.
- a classification model predicts discrete values.
bias
y-intercept
weight
is the same concept as the slope m in the traditional equation of a line
inference
prediction
mse
mean squared error
gradient
a vector of partial derivatives with respect to the weights. intuitively, a partial derivative tells you how much the function changes when you perturb one variable a bit.
in f(x,y) = e^2sin(x)
f’x = e^2cos(x)
f’x(0,1) =7.4
so when we start at (0,1), hold y constant, and move x a little, f changes by about 7.4 times the amount that u changed x.
the gradient points in the direction of greatest increase of the function.
the negative of the gradient moves u in the direction of maximum decrease in height.
when performing gradient descent, we generalize the above process to tune all the model parameters simultaneously. for example, to find the optimal values of both w1 and bias b, we calculate the gradients with respect to both w1 and b. next, we modify the values of w1 and b based on their respective gradients. the we repeat these steps until we reach minimum loss.
learning rate
aka step size.
the gradient vector has both a direction and a magnitude. gradient descent algorithms multiply the gradient by a scalar known as the learning rate to determine the next point. for example, if the gradient magnitude is 2.5 and the learning rate is 0.01, then the gradient descent algorithm will pick the next point 0.025 away from the previous point.
hyperparameters
the knobs that programmers tweak in machine learning algorithms, like learning rate, batch size and steps/epochs/iteration. most machine learning programmers spend a fair amount of time tuning the learning rate.
batch
in gradient descent, a batch is the total number examples you use to calculate the gradient in a single iteration.
stochastic gradient descent, aka sgd, batch size of 1, only a single example per iteration. sgd works but is very noisy.
mini-batch stochastic gradient descent, mini-batch sgd, a mini-batch is typically between 10 and 1000 examples, chosen at random. mini-batch sgd reduces the amount of noise in sgd but is still more efficient than full-batch.
total number of trained examples = batch size * steps
periods
controls the granularity of reporting. for example, if periods is set to 7 and steps is set to 70, then the exercise will output the loss value every 10 steps or 7 times.
number of training examples in each period = batch size * steps / periods