Hinton's lectures( NN for ML) from lecture 5 to lecture 9

NN

perceptron

pretty limited
對於線性不可分的無手段
需要hidden units

soft-max loss function

Czi=yiti

where
C=jtjlogyj

where tj is the target value and tj=1

Czi=jCyjyjzi

yi=ezijezj

so
when i=j

yjzi=yi(1yi)

else

yjzi=yiyj

Besides

Cyj=tjyj

thus

jCyjyjzi=ti(1yi)+jitjyj(yiyj)=ti(1yi)+jitjyi=ti+tiyi+jitjyi=ti+yijtj=yiti

5a Things that make it hard to recognize objects

  • segmentation: real scenes are cluttered with other objects
    • it is hard to tell which pieces go together as parts of the same object
    • parts of an object can be hidden bebind other objects
  • lighting: the intensities of the pixels are determinded as much by the lighting as by the objects.
  • deformation: objects can deform in a variety of non-affine ways
  • affordances: object classes are often defined by how thery are used.
  • viewpiont: changes in viewpiont cause changes in images that standard learning methods cannot cope with.

5b how to achieve viewpoint invariance

  • use redundant invariant features
    • but for recognition, we must avoid forming features from parts of different objects
  • put a box around the object and use normalized pixels
    • but choosing such a box is very difficult and we need to recognize the shape to get the box right!
    • the brute force normalization approach: try all possible boxes in a range of positions and scales
  • use replicated features with pooling.cnn
  • use a hierarchy of parts that have explicit poses relative to the camera

5c cnn for hand-written digit recognition

BP for CNNs

if we need make w1=w2 (because of weight sharing)
we need Δw1=Δw2
and thus we compute Ew1 and Ew2 and use Ew1+Ew2 for both w1 and w2

what does the replicating the feature detectors achieve?

  • equivariant activities
  • invariant knowlege

pooling

  • translational invariance
  • reducing the number of input to next layor
  • problem: lose information about the precise position.

LeNet

LeNet
Here is the architecture of LeNet-5.
LeNet-5


5d CNNs for object recognition

from hand-written digits to 3-D objects


6a overview of mini-batch gradient descent

  • online: update weights after each case; however, mini-batches are usually better than online.
  • stochastic gradient descent

6b tricks of stochastic gradient descent

  • initializing weights with small random values
  • shifting the inputs: (101,101,2) (101,99,0) (1,1,2) (1,-1,0)
  • scaling the inputs: (0.1, 10, 2) (0.1, -10, 0) (1,1,2) (1,-1,0)
  • decorrelating the input components: PCA (Principal Components Analysis)

Four ways to speed up mini-batch learning

  1. Use “momentum”
  2. Use separate adaptive learning rates for each parameter
  3. rmsprop
  4. Take a fancy method from the optimization literature that makes use of curvature information.

6c The momentum method

v(t)=αv(t1)εEw(t)

where α is slightly less than 1.
Δw(t)=v(t)=αv(t1)εEw(t)=αΔw(t1)εEw(t)

A better type of momentum

Nesterov 1983

  • The standard momentum method first computes the gradient at the current location and then takes a big jump in the direction of the updated accumulated gradient.
  • Ilya Sutskever(2012): First make a big jump in the direction of the previous accumulated gradient. Then measure the gradient where you end up and make a correction.

It’s better to correct a mistake after you have made it!


6d A separate, adaptive learning rate for each connection

each connection in the NN should have its own adaptive learning rate.
The magnitudes of the gradients are often very different for different layers

One way to determine the individual learning rates

  • Start with a local gain of 1 for every weight.
  • Increase the local gain if the gradient for that weight does not change sign

Δwij=εgijEwij

if

Ewij(t)Ewij(t1)>0

then

gij(t)=gij(t1)+δ

else

gij(t)=gij(t1)×(1δ)

for example δ=0.05


7a Modeling sequences: A brief overview

targets

  • turn an input sequence into an output sequence that lives in a different domain.
  • predict the next term in the input sequence.
  • memoryless models for sequences

    • Autoregressive models
    • Feed-forward neural nets: generalizing autoregressive models by using one or more layers of non-linear hidden units.
  • Linear Dynamical Systems: has hidden units which store information

  • Hidden Markov Models(HMM): have a discrete one-of-N hidden state. Transitions between states are stochastic and controlled by a transition matrix. The outputs produced by a state are stochastic. More detailed information about HMM
  • Recurrent neural networks (I will provide a clear picutre about RNNs in my new blog)
    RNNs

7b Training RNNs with backpropagation


7c A toy example of training an RNN


7d Why it is difficult to train an RNNs?

  • The backward pass is linear
  • The problem of exploding or vanishing gradients

Four effective ways to learn an RNN

  • Long Short Term Memory
    • Hochreiter & Schmidhuber (1997)
  • Hessian Free Optimization
  • Echo State Networks
  • Good initialization with momentum

8a HF Optimization

I will come back later.


8b modeling character strings with multiplicative connections

why model character strings.

  • The web is composed of character strings.
  • pre-processing text to get words is a big hassle.

9a overview of ways to improve generalization

  • overfitting: the model cannot figure which regurarities are real and which are caused by sampled errors.
    overfitting

how to prevent overfitting

  • more data
  • use a model with the right capacity
  • average many different models
  • a single NN architecture, but make prediction by many different vectors

how to limit the capacity of a NN

  • architecture: limit the number of hidden layers and units per layer
  • early stopping
  • weight-decay
  • add noise to the weights or the activities.

cross-validation

  • training set
  • validation set
  • test set
    N-fold cross-validation is not independent.

early stopping

however, it’s hard to decide when performance is getting worse.
early stopping


9b limiting the size of wights

The standard L2 weight penalty involves adding an extra term to the cost function that penalizes the squared weights.

This keeps the weights small unless they have big error derivatives. It prevents network from using the weights that it doesn’t need.

C=E+λ2iw2i

Cwi=Ewi+λwi

when

Cwi=0
,

wi=1λEwi

9c Using noise as a regularizer

add noisy to inputs

Suppose we add Gaussian noise to the inputs
then, the input will be

xi+N(0,σ2i)

and the output turn out to be
yi+N(0,w2iσ2i)

if we try to minimize the squared error, we tends to minimize the squared weights.

how does it work?

ynoisy=iwixi+iwiεi

where εi is sampled from N(0,σ2i) .

E[(ynoisyt)2]=E[(y+iwiεit)2]=E[((yt)+iwiεi)2]=(yt)2+E[2(yt)iwiεi]+E[(iwiεi)2]=(yt)2+E[iw2iε2i]=(yt)2+iw2iε2i

Because εi is independent of εj .

Thus, we can see that σ2i is equivalent to a L2 penalty.

add noise to the weights

Adding noise to a multilayer non-linear neural net is not exactly equivalent to L2 penalty. However, it may work better, especially in RNN.

Alex Grave’s RNN that recognizes handwriting.

Using noise in the activities as a regularizer

It does worse on the training set and trains considerably slower. Nevertheless, it does significantly better on the test set!!! (~(≧▽≦)/~)


9d introduction to Bayesian Approach

Assumption: we always have a prior distribution for everything

  • Prior may be vague.
  • When we see some data, we combine our prior distribution with a likelihood term to get a posterior distribution
  • It favors parameter setting that make the data likely

9e the Bayesian interpretation of weight decay

Explain what’s really going on when we use weight decay to control the NN’s capacity.

Supervised Maximum Likelihood Learning

output of the net:

yc=f(inputc,W)

the probability density of the target value given output + Gaussian noise:

p(tc|yc)=12πσe(tcyc)22σ2

logp(tc|yc)=k+(tcyc)22σ2

Thus, if we minimize the squared error, we maximize the log probability under a Gaussian.

Why log
Because it can change times into plus.

MAP: maximum a posterior

p(W|D)=p(W)p(D|W)p(D)

Cost=logp(W|D)=logp(W)logp(D|W)+logp(D)

where logp(D) is constant. Thus,

C=12σ2Dc(yctc)2+12σ2Wiw2i

C=E+σ2Dσ2Wiw2i

This is the weight penalty.


9f MacKey’s quick and dirty method of fixing weight costs


10a why it helps to combine models

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章