NN

perceptron

pretty limited
對於線性不可分的無手段
需要hidden units

soft-max loss function

\partial C \partial z i = y i - t i

where

C = - \sum j t j log y j

where

tj is the target value and

∑tj=1

\partial C \partial z i = \sum j \partial C \partial y j \partial y j \partial z i

y i = e z i \sum j e z j

so
when i=j

\partial y j \partial z i = y i (1 - y i)

else

\partial y j \partial z i = - y i y j

Besides

\partial C \partial y j = - t j y j

thus

\sum j \partial C \partial y j \partial y j \partial z i = - t i (1 - y i) + \sum j \neq i - t j y j (- y i y j) = - t i (1 - y i) + \sum j \neq i t j y i = - t i + t i y i + \sum j \neq i t j y i = - t i + y i \sum j t j = y i - t i

5a Things that make it hard to recognize objects

segmentation: real scenes are cluttered with other objects

it is hard to tell which pieces go together as parts of the same object

parts of an object can be hidden bebind other objects

lighting: the intensities of the pixels are determinded as much by the lighting as by the objects.

deformation: objects can deform in a variety of non-affine ways

affordances: object classes are often defined by how thery are used.

viewpiont: changes in viewpiont cause changes in images that standard learning methods cannot cope with.

5b how to achieve viewpoint invariance

use redundant invariant features

but for recognition, we must avoid forming features from parts of different objects

put a box around the object and use normalized pixels

but choosing such a box is very difficult and we need to recognize the shape to get the box right!

the brute force normalization approach: try all possible boxes in a range of positions and scales

use replicated features with pooling.cnn

use a hierarchy of parts that have explicit poses relative to the camera

5c cnn for hand-written digit recognition

BP for CNNs

if we need make w1=w2 (because of weight sharing)
we need Δw1=Δw2
and thus we compute ∂E∂w1 and ∂E∂w2 and use ∂E∂w1+∂E∂w2 for both w1 and w2

what does the replicating the feature detectors achieve?

equivariant activities
invariant knowlege

pooling

translational invariance
reducing the number of input to next layor
problem: lose information about the precise position.

LeNet

LeNet
Here is the architecture of LeNet-5.

5d CNNs for object recognition

from hand-written digits to 3-D objects

6a overview of mini-batch gradient descent

online: update weights after each case; however, mini-batches are usually better than online.
stochastic gradient descent

6b tricks of stochastic gradient descent

initializing weights with small random values
shifting the inputs: (101,101,2) (101,99,0) → (1,1,2) (1,-1,0)
scaling the inputs: (0.1, 10, 2) (0.1, -10, 0) → (1,1,2) (1,-1,0)
decorrelating the input components: PCA (Principal Components Analysis)

Four ways to speed up mini-batch learning

Use “momentum”
Use separate adaptive learning rates for each parameter
rmsprop
Take a fancy method from the optimization literature that makes use of curvature information.

6c The momentum method

v (t) = α v (t - 1) - ε \partial E \partial w (t)

where

α is slightly less than 1.

Δ w (t) = v (t) = α v (t - 1) - ε \partial E \partial w (t) = α Δ w (t - 1) - ε \partial E \partial w (t)

A better type of momentum

Nesterov 1983

The standard momentum method first computes the gradient at the current location and then takes a big jump in the direction of the updated accumulated gradient.
Ilya Sutskever(2012): First make a big jump in the direction of the previous accumulated gradient. Then measure the gradient where you end up and make a correction.

It’s better to correct a mistake after you have made it!

6d A separate, adaptive learning rate for each connection

each connection in the NN should have its own adaptive learning rate.
The magnitudes of the gradients are often very different for different layers

One way to determine the individual learning rates

Start with a local gain of 1 for every weight.
Increase the local gain if the gradient for that weight does not change sign

Δ w i j = - ε g i j \partial E \partial w i j

\partial E \partial w i j (t) \partial E \partial w i j (t - 1) > 0

then

g i j (t) = g i j (t - 1) + δ

else

g i j (t) = g i j (t - 1) \times (1 - δ)

for example δ=0.05

7a Modeling sequences: A brief overview

targets

turn an input sequence into an output sequence that lives in a different domain.

predict the next term in the input sequence.

memoryless models for sequences
- Autoregressive models
- Feed-forward neural nets: generalizing autoregressive models by using one or more layers of non-linear hidden units.
Linear Dynamical Systems: has hidden units which store information
Hidden Markov Models(HMM): have a discrete one-of-N hidden state. Transitions between states are stochastic and controlled by a transition matrix. The outputs produced by a state are stochastic. More detailed information about HMM
Recurrent neural networks (I will provide a clear picutre about RNNs in my new blog)

7b Training RNNs with backpropagation

7c A toy example of training an RNN

7d Why it is difficult to train an RNNs?

The backward pass is linear
The problem of exploding or vanishing gradients

Four effective ways to learn an RNN

Long Short Term Memory
- Hochreiter & Schmidhuber (1997)
Hessian Free Optimization
Echo State Networks
Good initialization with momentum

8a HF Optimization

I will come back later.

8b modeling character strings with multiplicative connections

why model character strings.

The web is composed of character strings.
pre-processing text to get words is a big hassle.

9a overview of ways to improve generalization

overfitting: the model cannot figure which regurarities are real and which are caused by sampled errors.

how to prevent overfitting

more data
use a model with the right capacity
average many different models
a single NN architecture, but make prediction by many different vectors

how to limit the capacity of a NN

architecture: limit the number of hidden layers and units per layer
early stopping
weight-decay
add noise to the weights or the activities.

cross-validation

training set
validation set
test set
N-fold cross-validation is not independent.

early stopping

however, it’s hard to decide when performance is getting worse.

9b limiting the size of wights

The standard L2 weight penalty involves adding an extra term to the cost function that penalizes the squared weights.

This keeps the weights small unless they have big error derivatives. It prevents network from using the weights that it doesn’t need.

C = E + λ 2 \sum i w 2 i

\partial C \partial w i = \partial E \partial w i + λ w i

when

\partial C \partial w i = 0

w i = - 1 λ \partial E \partial w i

9c Using noise as a regularizer

add noisy to inputs

Suppose we add Gaussian noise to the inputs
then, the input will be
$x i + N (0, σ 2 i)$
and the output turn out to be $y i + N (0, w 2 i σ 2 i)$
if we try to minimize the squared error, we tends to minimize the squared weights.

how does it work?

y n o i s y = \sum i w i x i + \sum i w i ε i

where εi is sampled from N(0,σ2i) .

E [(y n o i s y - t) 2] = E [(y + \sum i w i ε i - t) 2] = E [((y - t) + \sum i w i ε i) 2] = (y - t) 2 + E [2 (y - t) \sum i w i ε i] + E [(\sum i w i ε i) 2] = (y - t) 2 + E [\sum i w 2 i ε 2 i] = (y - t) 2 + \sum i w 2 i ε 2 i

Because εi is independent of εj .

Thus, we can see that σ2i is equivalent to a L2 penalty.

add noise to the weights

Adding noise to a multilayer non-linear neural net is not exactly equivalent to L2 penalty. However, it may work better, especially in RNN.

Alex Grave’s RNN that recognizes handwriting.

Using noise in the activities as a regularizer

It does worse on the training set and trains considerably slower. Nevertheless, it does significantly better on the test set!!! (~(≧▽≦)/~)

9d introduction to Bayesian Approach

Assumption: we always have a prior distribution for everything

Prior may be vague.
When we see some data, we combine our prior distribution with a likelihood term to get a posterior distribution
It favors parameter setting that make the data likely

9e the Bayesian interpretation of weight decay

Explain what’s really going on when we use weight decay to control the NN’s capacity.

Supervised Maximum Likelihood Learning

output of the net:

y c = f (i n p u t c, W)

the probability density of the target value given output + Gaussian noise:

p (t c | y c) = 1 2 π ‾ ‾ ‾ \sqrt σ e - ( t c - y c ) 2 2 σ 2

- log p (t c | y c) = k + ( t c - y c ) 2 2 σ 2

Thus, if we minimize the squared error, we maximize the log probability under a Gaussian.

Why log
Because it can change times into plus.

MAP: maximum a posterior

p (W | D) = p ( W ) p ( D | W ) p ( D )

C o s t = - log p (W | D) = - log p (W) - log p (D | W) + log p (D)

where logp(D) is constant. Thus,

C * = 1 2 σ 2 D \sum c (y c - t c) 2 + 1 2 σ 2 W \sum i w 2 i

C = E + σ 2 D σ 2 W \sum i w 2 i

This is the weight penalty.

Hinton's lectures( NN for ML) from lecture 5 to lecture 9

NN

soft-max loss function

5a Things that make it hard to recognize objects

5b how to achieve viewpoint invariance

5c cnn for hand-written digit recognition

BP for CNNs

what does the replicating the feature detectors achieve?

pooling

LeNet

5d CNNs for object recognition

6a overview of mini-batch gradient descent

6b tricks of stochastic gradient descent

Four ways to speed up mini-batch learning

6c The momentum method

A better type of momentum

6d A separate, adaptive learning rate for each connection

One way to determine the individual learning rates

7a Modeling sequences: A brief overview

7b Training RNNs with backpropagation

7c A toy example of training an RNN

7d Why it is difficult to train an RNNs?

Four effective ways to learn an RNN

8a HF Optimization

8b modeling character strings with multiplicative connections

why model character strings.

9a overview of ways to improve generalization

how to prevent overfitting

how to limit the capacity of a NN

cross-validation

early stopping

9b limiting the size of wights

9c Using noise as a regularizer

add noisy to inputs

add noise to the weights

Using noise in the activities as a regularizer

9d introduction to Bayesian Approach

9e the Bayesian interpretation of weight decay

MAP: maximum a posterior

9f MacKey’s quick and dirty method of fixing weight costs

10a why it helps to combine models