key terminology in supervised machine learning

what is supervised ml?

ML systems learn how to combine input to produce useful predictions on never-before-seen data.

feature

x,input variable

label

y in linear regression, observation, what we are predicting, output, must be observable and quantifiable metric

model

a model defines the relationship between features and label.

regression vs classification

  • a regression model predicts continuous values.
  • a classification model predicts discrete values.

bias

y-intercept

weight

is the same concept as the slope m in the traditional equation of a line

inference

prediction

mse

mean squared error

gradient

a vector of partial derivatives with respect to the weights. intuitively, a partial derivative tells you how much the function changes when you perturb one variable a bit.
in f(x,y) = e^2sin(x)
f’x = e^2cos(x)
f’x(0,1) =7.4
so when we start at (0,1), hold y constant, and move x a little, f changes by about 7.4 times the amount that u changed x.

the gradient points in the direction of greatest increase of the function.
the negative of the gradient moves u in the direction of maximum decrease in height.

when performing gradient descent, we generalize the above process to tune all the model parameters simultaneously. for example, to find the optimal values of both w1 and bias b, we calculate the gradients with respect to both w1 and b. next, we modify the values of w1 and b based on their respective gradients. the we repeat these steps until we reach minimum loss.

learning rate

aka step size.
the gradient vector has both a direction and a magnitude. gradient descent algorithms multiply the gradient by a scalar known as the learning rate to determine the next point. for example, if the gradient magnitude is 2.5 and the learning rate is 0.01, then the gradient descent algorithm will pick the next point 0.025 away from the previous point.

hyperparameters

the knobs that programmers tweak in machine learning algorithms, like learning rate, batch size and steps/epochs/iteration. most machine learning programmers spend a fair amount of time tuning the learning rate.

batch

in gradient descent, a batch is the total number examples you use to calculate the gradient in a single iteration.

stochastic gradient descent, aka sgd, batch size of 1, only a single example per iteration. sgd works but is very noisy.
mini-batch stochastic gradient descent, mini-batch sgd, a mini-batch is typically between 10 and 1000 examples, chosen at random. mini-batch sgd reduces the amount of noise in sgd but is still more efficient than full-batch.

total number of trained examples = batch size * steps

periods

controls the granularity of reporting. for example, if periods is set to 7 and steps is set to 70, then the exercise will output the loss value every 10 steps or 7 times.
number of training examples in each period = batch size * steps / periods

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章