Index of Introduction of Deep Learning
- CNN
  1. Basic components
  2. Deconvolution methods
- RNN
  1. Basic structure, components, property
  2. BPTT
- NN training tricks
  1. Activation functions:
    1. Sigmoid, Relu
  2. Regularization methods:
    1. Dropout
    2. Maxout
  3. Adoption learning rate
    1. Weight Decay
- ANN
  1. Structure, components
  2. Feedforward of MLP
  3. Back propagation
  4. Vanishing gradient
  5. Batch Normalizition:
    
    Batch normalization potentially helps in two ways: faster learning and higher overall accuracy. The improved method also allows you to use a higher learning rate, potentially providing another boost in speed.
    
    Why does this work? Well, we know that normalization (shifting inputs to zero-mean and unit variance) is often used as a pre-processing step to make the data comparable across features. As the data flows through a deep network, the weights and parameters adjust those values, sometimes making the data too big or too small again - a problem the authors refer to as "internal covariate shift". By normalizing the data in each mini-batch, this problem is largely avoided.
  Another explanition:
  Internal Covariate shift
  
  Covariate shift refers to the change in the input distribution to a learning system. In the case of deep networks, the input to each layer is affected by parameters in all the input layers. So even small changes to the network get amplified down the network. This leads to change in the input distribution to internal layers of the deep network and is known as internal covariate shift.
  
  It is well established that networks converge faster if the inputs have been whitened (ie zero mean, unit variances) and are uncorrelated and internal covariate shift leads to just the opposite.
  
  Vanishing Gradient
  
  Saturating nonlinearities (like tanh or sigmoid) can not be used for deep networks as they tend to get stuck in the saturation region as the network grows deeper. Some ways around this are to use:
  - Nonlinearities like ReLU which do not saturate
  - Smaller learning rates
  - Careful initializations
  - Batch Normalized Convolutional Networks
  Let us say that x = g(Wu+b) is the operation performed by the layer where W and b are the parameters to be learned, g is a nonlinearity and u is the input from the previous layer.
  
  The BN transform is added just before the nonlinearity, by normalizing x = Wu+b. An alternative would have been to normalize u itself but constraining just the first and the second moment would not eliminate the covariate shift from u.
  
  When normalizing Wu+b, we can ignore the b term as it would be canceled during the normalization step (b's role is subsumed by β) and we have
  
  z = g( BN(Wu) )
  
  For convolutional layers, normalization should follow the convolution property as well - ie different elements of the same feature map, at different locations, are normalized in the same way. So all the activations in a mini-batch are jointly normalized over all the locations and parameters (γ and β) are learnt per feature map instead of per activation.
  
  Advantages Of Batch Normalization
  
  Reduces internal covariant shift.
  - Reduces the dependence of gradients on the scale of the parameters or their initial values.
  - Regularizes the model and reduces the need for dropout, photometric distortions, local response normalization and other regularization techniques.
  - Allows use of saturating nonlinearities and higher learning rates.
  Batch Normalization was applied to models trained for MNIST and Inception Network for ImageNet. All the above-mentioned advantages were validated in the experiments. Interestingly, Batch Normalization with sigmoid achieved an accuracy of 69.8% (overall best, using any nonlinearity, was 74.8%) while Inception model (sigmoid nonlinearity), without Batch Normalisation, worked only as good as a random guess

Index of Introduction of Deep Learning

Index of Introduction of Deep Learning

CNN

RNN

NN training tricks

ANN

results matching ""

No results matching ""