Dr. Hua Zhou’s slides
Neural networks are not a fully automatic tool, as they are sometimes advertised; as with all statistical models, subject matter knowledge should and often be used to improve their performance.
Starting values: usually starting values for weights are chosen to be random values near zero; hence the model starts out nearly linear (for sigmoid), and becomes nonlinear as the weights increase.
Hanin and Rolnick argue that a proper choice of the net and the initial random weights have to meet two requirements:
He uniform
in Keras makes the choice
\(\sigma^2 = 2 / \text{fan-in}\), where
the fan-in is the maximum number of inputs to neurons.Scaling of inputs: mean 0 and standard deviation 1. With standardized inputs, it is typical to take random uniform weights over the range [−0.7,+0.7].
Overfitting (too many parameters):
Figure from Srivastava, Hinton, Krizhevsky, Sutskever, and Salakhutdinov (2014).
How many hidden units and how many hidden layers: guided by domain knowledge and experimentation.
Multiple minima: try with different starting values.
Fully connected networks don’t scale well with dimension of input images. E.g. \(1000 \times 1000\) images have about \(10^6\) input units, and assuming you want to learn 1 million features (hidden units), you have about \(10^{12}\) parameters to learn!
In locally connected networks, each hidden unit only connects to a small contiguous region of pixels in the input, e.g., a patch of image or a time span of the input audio.
Convolutions. Natural images have the property of being stationary, meaning that the statistics of one part of the image are the same as any other part. This suggests that the features that we learn at one part of the image can also be applied to other parts of the image, and we can use the same features at all locations by weight sharing.
Consider \(96 \times 96\) images. For each feature, first learn a \(8 \times 8\) feature detector (or filter or kernel) from (possibly randomly sampled) \(8 \times 8\) patches from the larger image. Then apply the learned detector to all \(8 \times 8\) regions of the \(96 \times 96\) image to obtain one \(89 \times 89\) convolved feature for that feature.
From Wang and Raj
(2017):
Pooling. For a neural network with 100 hidden units, we have \(89^2 \times 100 = 792,100\) convolved features. This can be reduced by calculating the mean (or max) value of a particular feature over a region of the image. These summary statistics are much lower in dimension (compared to using all of the extracted features) and can also improve results (less over-fitting). We call this aggregation operation pooling, or sometimes mean pooling or max pooling (depending on the pooling operation applied).
Convolutional neural network (CNN). Convolution + pooling + multi-layer neural networks.
Input: 256 pixel values from \(16 \times 16\) grayscale images. Output: 0, 1, …, 9, 10 class-classification.
A modest experiment subset: 320 training digits and 160 testing digits.
net-1: no hidden layer, equivalent to multinomial logistic
regression. Number of parameters is \((16
\times 16 + 1) \times 10 = 2570\).
net-2: one hidden layer, 12 hidden units fully connected. Number of
parameters is \((16 \times 16 + 1) \times 12 +
(13 \times 10) = 3214\).
net-3: two hidden layers locally connected. Each unit of the first
hidden layer takes input from a \(3
\times 3\) patch; neighboring patches overlap by by one row or
column. Each unit of the second hidden layer takes input from a \(5 \times 5\) patch; neighboring patches are
two units apart. Number of parameters is \((3
\times 3 + 1) \times 64 + (5 \times 5 + 1) \times 16 + (16 + 1) \times
10 = 1226\).
net-4: two hidden layers, locally connected with weight sharing. \((3 \times 3 + 64) \times 2 + (5 \times 5 + 1)
\times 16 + (16 + 1) * 10 = 1148\) (???).
net-5: two hidden layers, locally connected, two levels of weight
sharing (was the result of many person years of
experimentation).
Results (320 training cases, 160 test cases):
network | links | weights | accuracy |
---|---|---|---|
net 1 | 2570 | 2570 | 80.0% |
net 2 | 3124 | 3214 | 87.0% |
net 3 | 1226 | 1226 | 88.5% |
net 4 | 2266 | 1131 | 94.0% |
net 5 | 5194 | 1060 | 98.4% |
Net-5 and similar networks were state-of-the-art in early 1990s.
On the larger benchmark dataset MNIST (60,000 training images, 10,000 testing images), accuracies of following methods were reported:
Method | Error rate |
---|---|
tangent distance with 1-nearest neighbor classifier | 1.1% |
degree-9 polynomial SVM | 0.8% |
LeNet-5 | 0.8% |
boosted LeNet-4 | 0.7% |
Source: http://cs231n.github.io/convolutional-networks/
AlexNet: Krizhevsky, Sutskever, Hinton (2012)
ImageNet dataset. Classify 1.2 million high-resoultion images (\(224 \times 224 \times 3\)) into 1000 classes.
A combination of techniques: GPU, ReLU, DropOut (0.5), SGD + Momentum with 0.9, initial learning rate 0.01 and again reduced by 10 when validation accuracy become flat.
5 convolutional layers, pooling interspersed, 3 fully connected layers. 60 million parameters, 650,000 neurons.
AlexNet was the winner of the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) classification the benchmark in 2012.
Achieved 62.5% accuracy:
96 learnt filters:
Source: Architecture comparison of AlexNet, VGGNet, ResNet, Inception, DenseNet
Sources:
MLP (multi-layer perceptron) and CNN (convolutional neural network) are examples of feed forward neural network, where connections between the units do not form a cycle.
MLP and CNN accept a fixed-sized vector as input (e.g. an image) and produce a fixed-sized vector as output (e.g. probabilities of different classes).
Reccurent neural networks (RNN) instead have loops, which can be un-rolled into a sequence of MLP.
RNNs allow us to operate over sequences of vectors: sequences in the input, the output, or in the most general case both.
Applications of RNN:
Above: generated (fake) LaTeX on algebraic geometry; see http://karpathy.github.io/2015/05/21/rnn-effectiveness/.
RNNs accept an input vector \(x\) and give you an output vector \(y\). However, crucially this output vector’s contents are influenced not only by the input you just fed in, but also on the entire history of inputs you’ve fed in the past.
Short-term dependencies: to predict the last word in “the clouds are in the sky”:
Long-term dependencies: to predict the last word in “I grew up in France… I speek fluent French”:
Typical RNNs are having trouble with learning long-term dependencies.
Long Short-Term Memory networks (LSTM) are a special kind of RNN capable of learning long-term dependencies.
The cell state allows information to flow along it unchanged.
The gates give the ability to remove or add information to the cell state.
The coolest idea in deep learning in the last 20 years.
- Yann LeCun on GANs.
Sources:
Applications:
AI-generated celebrity photos: https://www.youtube.com/watch?v=G06dEcZ-QTg
Self play
Value function of GAN \[ \min_G \max_D V(D, G) = \mathbb{E}_{x \sim p_{\text{data}}(x)} [\log D(x)] + \mathbb{E}_{z \sim p_z(z)} [\log (1 - D(G(z)))]. \]
Training GAN
High-level software focuses on user-friendly interface to specify
and train models.
Keras, PyTorch (only Linux and MacOS), scikit-learn, …
Lower-level software focuses on developer tools for impelementing
deep learning models.
TensorFlow, Theano, CNTK, Caffe, Torch, …
Most tools are developed in Python plus a low-level language.
Developed by Google Brain team for internal Google use. Formerly DistBelief.
Open sourced in Nov 2015.
OS: Linux, MacOS, and Windows (since Nov 2016).
GPU support: NVIDIA CUDA.
TPU (tensor processing unit), built specifically for machine learning and tailored for TensorFlow.
Mobile device deployment: TensorFlow Lite (May 2017) for Android and iOS.
when you have a hammer, everything looks like a nail.
R users can access Keras and TensorFlow via the keras
and tensorflow
packages.
#install.packages("keras")
library(keras)
install_keras()
# install_keras(tensorflow = "gpu") # if NVIDIA GPU is available