Reading group on Deep Learning

LEAR - XRCE, 23 Nov. 2012

Notes on this document:

this HTML page was generated from the IPython notebook available here (pure python version executable in any python interpreter: here)
dependencies to run the code (only Open Source Software!):
- Python: the best programming language, ever.
- IPython: an enhanced python interpreter + great tools for scientific workflow
- Numpy: the Matlab library for Python (multi-dimensional arrays, linear algebra, ...)
- Matplotlib: the scientific plotting library for Python
- Networkx: package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks
- Theano: efficient tensor library, CPU + GPU expresion compiler, used for Deep Learning

Outline

Refresher on Neural Networks
- Neural Networks
- Back-Propagation
Deep Learning in practice
- What is Deep Learning?
- Overview of Theano
- Overview of EBLearn

Pointers

Reading material

The Elements of Statistical Learning (5th ed.) (T. Hastie, R. Tibshirani, & J. Friedman) : Chapter 11 (Neural Networks)
Learning Deep Architectures for AI (Y. Bengio, In Foundations & Trends in Machine Learning, 2009) : small book providing an introduction to deep learning and individual chapters on most of the main deep architectures
A fast learning algorithm for deep belief nets (G. Hinton et al., Neural Computation, 2006): fast algorithm for learning deep beliefs nets + justification of greedy layer-wise training and stacking
Building High-level Features Using Large Scale Unsupervised Learning (Quoc V. Le et al., ICML'2012): Large scale deep learning (sparse auto-encoders) using 16000 cores; led to face and cat neurons from unlabeled data; state of the art on ImageNet from raw pixels; made the front page of NY Times.
On Deep Generative Models with Applications to Recognition (M.A. Ranzato et al., CVPR 2011)
Y. LeCun's recent papers (cf. his publication list)
CVPR'2012 workshop on Deep Learning Methods for Vision: overview of most common deep architectures, transfer learning, application to motion and video
CVPR'2012 workshop on Multiview Feature Learning: recent workshop about deep unsupervised feature learning, includes application of deep learning techniques to video
tutorial slides at the NIPS 2010 Workshop on Deep Learning and Unsupervised Feature Learning
cf. the deeplearning.net reading list for additional pointers

Websites

deeplearning.net : Big hub with many links related to deep learning (software, publications, ...)
Deep learning tutorials : Set of tutorials on deep learning (RBM, ConvNets, ...) with theano
Neural Nets F.A.Q. : extremely comprehensive FAQ on neural nets (from stats to software)
Homepage of G. Hinton
Homepage of Y. LeCun
Mainstream media coverage of advances in deep learning:
- NY Times front page article on the ICML'12 paper of Andrew Ng's group (Ng's interview: here)
- Another NY Times article about the recent progress in deep learning on a wide range of problems
- Blog post of Rick Rashid Microsoft's CRO, with an interesting video explaining the progress made in Automatic Speech Recognition and Translation: from matching sound waves, to HMMs, to Deep Learning; R. Rashid makes a cool live demo (at the end of the video) where his speech gets translated on the fly in (spoken!) Mandarin

Code

theano : Python library for efficient computations on tensors (multi-dimensional arrays)
EBLearn : C++ library developped @NYU (LeCun) for Energy-Based Learning
cuda-convnet : fast implementation of ConvNet in C++ and Cuda (GPU), by winners of ImageNet LSVRC 2012 (Krizhevsky A. et al. NIPS'12 paper to come)

Refresher on Neural Networks

Central idea

extract derived features = linear combination of the inputs
predict target using a non-linear function of these features

Preliminary: Projection Pursuit Regression (PPR)

Supervised learning: input vector $X \in \mathbb{R}^p$, target $Y$
$V_m = w_m^T X$ : derived features, with $M$ unit vectors $w_m \in \mathbb{R}^p$
PPR prediction:

$$f(X) = \sum_{m=1}^{M} g_m(V_m)$$

Parameters:
- $\{w_m\}_m$ : projection directions
- $\{g_m\}_m$ : ridge functions (each varies only in the direction of $w_m$)

In [1]:

# Plot of two ridge function examples


from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm
import matplotlib.pyplot as plt
import numpy as np


def g_left(X1, X2):
    V = (X1 + X2) / np.sqrt(2)
    return 1.0 / (1.0 + np.exp(-5 * (V - 0.5)))


def g_right(X1, X2):
    V = X1
    return (V + 0.1) * np.sin(10 / (V / 3.0 + 0.1))


def plot_surface_3d(X1, X2, func, fig=None, subplot=(1, 1, 1)):
    if fig is None:
        fig = plt.figure(figsize=(12, 5))
    ax = fig.add_subplot(*subplot, projection='3d')
    X1, X2 = np.meshgrid(X1, X2)
    Z = func(X1, X2)
    surf = ax.plot_surface(X1, X2, Z,
        rstride=1, cstride=1, cmap=cm.coolwarm,
        linewidth=0, antialiased=False)


fig = plt.figure(figsize=(12, 5))
X1, X2 = np.linspace(-1.5, 1.5, 150), np.linspace(-0.5, 1, 150)
plot_surface_3d(X1, X2, g_left, fig, subplot=(1, 2, 1))
X1, X2 =  np.linspace(-0.05, 0.05, 300), np.linspace(0, 1, 100)
plot_surface_3d(X1, X2, g_right, fig, subplot=(1, 2, 2))

PPR model is a universal approximator: can approximate any continuous function in $\mathbb{R}^p$ (if M is taken arbitrarily large, for appropriate choices of $g_m$)
but interpretation of the fitted model is difficult (mixing of the input features)
Learning: PPR model is fitted by minimizing the squared error in an alternating manner:
- Given $w_m$, estimate $g_m$: 1D smoothing problem (e.g. use splines)
- Given $g_m$, estimate $w_m$: non-linear least-squares (e.g. use Gauss-Newton)
- Iterate these two steps until convergence
- Greedily add a new pair $(w_{m+1}, g_{m+1})$
PPR not used much (originally for computational reasons)

Neural Networks (NN)

The model

NN = non-linear statistical model inspired by biological neural networks
Set of neurons (much like $g_m(w_m^T X)$ in PPR) interconnected in layers in a feed-forward fashion

In [2]:

# How to plot a network diagram of a NN


import matplotlib.pyplot as plt
import networkx as nx


class NeuralNetwork(object):
    """ A simple neural network class for visualization purposes
    """
    def __init__(self, n_nodes_per_layer):
        """ Build the network layer by layer
        """
        self.n_nodes_per_layer = n_nodes_per_layer
        self.graph = nx.DiGraph()
        self.nodes_pos = {}
        self.nodes_label = {}
        self.input_units = []
        self.hidden_units = []
        self.output_units = []

        # add the units per layer
        for layer_i, layer_size in enumerate(n_nodes_per_layer):

            # add the nodes for this layer
            for node_i in range(layer_size):

                node = "%d_%d" % (layer_i, node_i)  # simple encoding of a node
                self.graph.add_node(node)
                self.nodes_pos[node] = (layer_i, layer_size / 2.0 - node_i)

                if layer_i == 0:
                    # label for input layer
                    self.input_units.append(node)
                    self.nodes_label[node] = r"$X_%d$" % node_i
                elif layer_i == len(n_nodes_per_layer) - 1:
                    # label for output layer
                    self.output_units.append(node)
                    self.nodes_label[node] = r"$Y_%d$" % node_i
                else:
                    # hidden layer
                    self.hidden_units.append(node)
                    self.nodes_label[node] = r"$Z_{%d, %d}$" % (layer_i, node_i)

            # add the edges: full connection between layer_i -1 and layer_i
            if layer_i > 0:
                prev_layer_i = layer_i - 1
                prev_layer_size = n_nodes_per_layer[prev_layer_i]
                self.graph.add_edges_from([
                    ("%d_%d" % (prev_layer_i, l), "%d_%d" % (layer_i, k))
                    for k in range(layer_size) for l in range(prev_layer_size)
                ])

    def draw(self, ax=None):
        """ Draw the neural network
        """
        if ax is None:
            ax = plt.figure(figsize=(10, 6)).add_subplot(1, 1, 1)
        nx.draw_networkx_edges(self.graph, pos=self.nodes_pos, alpha=0.7, ax=ax)
        nx.draw_networkx_nodes(self.graph, nodelist=self.input_units,
                               pos=self.nodes_pos, ax=ax,
                               node_color='#66FFFF', node_size=700)
        nx.draw_networkx_nodes(self.graph, nodelist=self.hidden_units,
                               pos=self.nodes_pos, ax=ax,
                               node_color='#CCCCCC', node_size=900)
        nx.draw_networkx_nodes(self.graph, nodelist=self.output_units,
                               pos=self.nodes_pos, ax=ax,
                               node_color='#FFFF99', node_size=700)
        nx.draw_networkx_labels(self.graph, labels=self.nodes_label,
                                pos=self.nodes_pos, font_size=14, ax=ax)
        ax.axis('off')


# classifcal 3 layer neural network example
n_layers = 3  # input layer | hidden layer | output layer
n_input_dims = 5
n_hidden_units = 3
n_output_dims = 2
n_nodes_per_layer = [n_input_dims, n_hidden_units, n_output_dims]
nn1 = NeuralNetwork(n_nodes_per_layer)

# another example
nn2 = NeuralNetwork([10, 5, 10])

# show the graphical representations
fig = plt.figure(figsize=(12, 8))
nn1.draw(fig.add_subplot(1, 2, 1))
nn2.draw(fig.add_subplot(1, 2, 2))

Formal definition of NN (assuming full connectivity between layers):

Input layer: one unit for each dimension of the input $X \in \mathbb{R}^p$
Hidden layers: derived features from the input of the previous layer
- Assuming one hidden layer with $M$ units
- Derived features: $Z_m = \sigma (\alpha_m^T X + \alpha_{0, m}), \ m = 1, \cdots, M$
- Activation function $\sigma$: typically a sigmoid $\sigma(v) = (1 + e^{-v})^{-1}$
Output layer: one unit for each dimension of the output $Y$
- Output $Y_k$ again computed using a linear combination of the outputs $Z$ of the previous layer: $Y_k = f_k(X) = g_k(\beta_m^T Z + \beta_{0, m})$
- Output function $g_k$ chosen based on task:
  - regression (e.g. denoising with auto-encoders, $p$ units to obtain $Y \in \mathbb{R}^p$ = denoised $X$): identity $g_k(V) = V$
  - classification (e.g. for object recognition, $K$ units to output probability for each possible object category): softmax $g_k(V) = \frac{e^{V_k}}{\sum_{l=1}^K e^{V_k}}$
Parameters (weights, noted $\theta$) to estimate from the data:
- the $\alpha_m$'s and bias terms $\alpha_{0, m}$ of each unit at each hidden layer
- the $\beta_m$'s and bias terms $\beta_{0, m}$ of each unit of the output layer
Remarks:
- linear $\sigma \Rightarrow$ linear model
- one hidden layer $\Rightarrow$ similar to PPR (but PPR uses few non-parametric $g_m$ functions, whereas NN uses many units with a simple $\sigma$ function)

The Back-Propagation algorithm

(Again assuming one hidden layer for brevity)

Algorithm used to learn the weights $\theta$ in a NN
Minimization of a loss function $R(\theta)$
- Regression: SSE
  
  $$R(\theta) = \sum_{k=1}^K \sum_{i=1}^N (y_{i,k} - f_k(x_i))^2$$
- Classification: (SSE or) Cross-Entropy
  
  $$R(\theta) = - \sum_{i=1}^N \sum_{k=1}^K y_{i,k} \mathrm{log} f_k(x_i)$$
Regularization:
- by adding a penalty term (e.g. $\Vert \theta \Vert_2^2$, called weight decay)
- or early stopping
Back-Propagation:
- minimization of $R(\theta)$ by gradient descent:
  
  $$\beta_{k, m}^{(r+1)} = \beta_{k, m}^{(r)} - \gamma_r \frac{\partial R_i}{\partial \beta_{k, m}^{(r)}}$$
  
  $$\alpha_{m, l}^{(r+1)} = \alpha_{m, l}^{(r)} - \gamma_r \frac{\partial R_i}{\partial \alpha_{m, l}^{(r)}}$$
- Computing the gradient: using the chain rule, done in a forward and backward sweep over the network
  - forward pass: fix the weights, get the predicted values $\hat{f}_k(x_i)$
  - backward pass: compute the errors and back-propagate (estimate weights from output $\rightarrow$ input layers

Remarks

back-propagation can be done in batch or online, i.e. one observation at a time $\Rightarrow$ (SGD: stochastic gradient descent)
learning rate $\gamma_r$:
- usually constant in batch
- must verify $\gamma_r \rightarrow 0$, $\sum_r \gamma_r = \infty$, and $\sum_r \gamma_r^2 \lt \infty$ (e.g. $\gamma_r = 1/r$) in SGD
back-prop is very slow in practice: use conjugate gradient
scaling of inputs determines scaling of weights $\Rightarrow$ standardize (0 mean, unit variance)
initialization:
- important in practice: loss is not convex (many local minima!)
- multiple re-starts with random near-zero weights (nearly linear regime for sigmoid)
- Combine the outputs of $L$ different learned NNs by averaing their predictions
  
  $$\hat{f}(X_{\mathrm{new}}) = \sum_{l=1}^L w_l \mathbb{E}(Y_{\mathrm{new}} | X_{\mathrm{new}}, \hat{\theta}_l)$$
  - bagging: $w_l = 1/L$ and $\hat{\theta}_l$ parameters obtained by fitting on different random permutations of the training data
  - boosting: $w_l = 1$ but $\hat{\theta}_l$ chosen in non-random sequential fashion to improve the fit
  - Bayesian inference: $w_l = 1/L$ and sample $\theta_l$ from the posterior using MCMC
  - number of hidden:
    - units: the more the better + regularization
    - layers: task-dependent (multiple layers $\Rightarrow$ hierarchical features)
$\Rightarrow$ that is where hand-tuning comes in! no free lunch ;-)
difficult to interpret / visualize what is learned (is it still true?)

Deep Learning in practice

What is Deep Learning?

According to deeplearning.net:

"Deep Learning is a new area of Machine Learning research, which has been introduced with the objective of moving Machine Learning closer to one of its original goals: Artificial Intelligence."

"... moving beyond shallow learning since 2006!"
Learning of probabilistic generative models with deep architectures, i.e. with many layers of non-linearities

Examples

Deep Multi-Layer Perceptrons (i.e. NN like before, but with many hidden layers)
Convolutional Networks
Stacked Auto-encoders
Restricted Boltzman Machines (RBM)
Deep Belief Networks (DBN)

Why going deep?

Can trade-off number of units for layers (more efficient: less connections and therefore weights)
Hierarchical: useful for multi-scale analysis
"Combinatorial sharing of statistical strength between multiple levels of latent variables" (NIPS'10 workshop)... ?
it works (according to Hinton and Lecun... but starting to generalize more and more to other researchers)

How to learn deeply?

Back-prop has some severe limitations:
- Gradient gets progressively diluted across layers (very limited corrections in first layers)
- Local minima (very sensitive to initialization)
- Needs labeled data (in usual settings)
What works, e.g. on Deep Belief Networks (DBN):
- greedy layer-wise unsupervised learning (serves as initialization step):
  - learn a RBM by maximizing lower bound on likelihood
  - stack a new RBM on top, update the prior of the second layer with posterior of the first, learn
  - iterate
  - justified in theory (cf. Hinton'2006)
- top-down supervised training as final refinement step
Many tools to learn deep architectures:
- At the core: libraries to work on tensors (multi-dimensional arrays)
- We will review two recent, efficient, easy to use, and actively maintained librairies:
  - Theano (python)
  - EBLearn (C++)

Overview of Theano

Theano:
- python Open Source library (BSD-like permissive license, cross-platform)
- Active project maintained by many people, including members of Yoshua Bengio's LISA group at U. of Montreal
Main purpose of Theano (from their website):

Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. Theano features:

tight integration with numpy -- Use numpy.ndarray in Theano-compiled functions.

transparent use of a GPU -- Perform data-intensive calculations up to 140x faster than with CPU.(float32 only)

efficient symbolic differentiation -- Theano does your derivatives for function with one or many inputs.

speed and stability optimizations -- Get the right answer for log(1+x) even when x is really tiny.

dynamic C code generation -- Evaluate expressions faster.

extensive unit-testing and self-verification -- Detect and diagnose many types of mistake.

Sort of mix between Numpy (Python's Matlab) and Sympy (Python's Mathematica) focusing on tensor operations

execution speed optimizations: Theano can use g++ or nvcc to compile parts your expression graph into CPU or GPU instructions, which run much faster than pure Python.

symbolic differentiation: Theano can automatically build symbolic graphs for computing gradients.

stability optimizations: Theano can recognize [some] numerically unstable expressions and compute them with more stable algorithms.

Ambitious goals (and they're making progress!)

Support tensor and sparse operations

Support linear algebra operations

Graph Transformations

Differentiation/higher order differentiation

'R' and 'L' differential operators

Speed/memory optimizations

Numerical stability optimizations

Can use many compiled languages, instructions sets: C/C++, CUDA, OpenCL, PTX, CAL, AVX, ...

Lazy evaluation

Loop

Parallel execution (SIMD, multi-core, multi-node on cluster, multi-node distributed)

Support all NumPy/basic SciPy functionality

Easy wrapping of library functions in Theano

Easy to install (just do: pip install Theano)
Extensively documented (docs)
Great tutorial (tuto)
Great tutorials on deep learning written by members of the LISA lab

In [3]:

# Sneak peek

import theano
from theano import tensor

# declare two symbolic floating-point scalars
a = tensor.dscalar()
b = tensor.dscalar()

# create a simple expression
c = a + b

# convert the expression into a callable object that takes (a,b)
# values as input and computes a value for c
f = theano.function([a,b], c)

# bind 1.5 to 'a', 2.5 to 'b', and evaluate 'c'
assert 4.0 == f(1.5, 2.5)

In [4]:

# Logistic Regression simple example

import numpy
import theano
import theano.tensor as T
rng = numpy.random

N = 400
feats = 784
D = (rng.randn(N, feats), rng.randint(size=N,low=0, high=2))
training_steps = 10000

# Declare Theano symbolic variables
x = T.matrix("x")
y = T.vector("y")
w = theano.shared(rng.randn(feats), name="w")
b = theano.shared(0., name="b")
#print "Initial model:"
#print w.get_value(), b.get_value()

# Construct Theano expression graph
p_1 = 1 / (1 + T.exp(-T.dot(x, w) - b))       # Probability that target = 1
prediction = p_1 > 0.5                        # The prediction thresholded
xent = -y * T.log(p_1) - (1-y) * T.log(1-p_1) # Cross-entropy loss function
cost = xent.mean() + 0.01 * (w ** 2).sum()    # The cost to minimize
gw, gb = T.grad(cost, [w, b])                 # Compute the gradient of the cost

# Compile
train = theano.function(
          inputs=[x,y],
          outputs=[prediction, xent],
          updates={w: w - 0.1 * gw, b: b - 0.1 * gb})
predict = theano.function(inputs=[x], outputs=prediction)

# Train
for i in range(training_steps):
    pred, err = train(D[0], D[1])

# To display the model, uncomment below:
#print "Final model:"
#print w.get_value(), b.get_value()
#print "target values for D:", D[1]
#print "prediction on D:", predict(D[0])

Deep Learning in Theano

Many implementations already available and documented in the deep learning tutorials (cf. the code on github)

Supervised learning

Introduction: notations, datasets, extensive primer on supervised deep learning
Logistic Regression (LR) + application on MNIST (both implemented using conjugate gradient and SGD): often used as the last layer of classification architectures

$$ P(Y=i|x, W,b) = softmax_i(W x + b) \ = \frac {e^{W_i x + b_i}} {\sum_j e^{W_j x + b_j}} $$

$$y_{pred} = {\rm argmax}_i P(Y=i|x,W,b)$$

Multilayer Perceptron (MLP): going from LR to MLP, train with SGD + tricks of the trade (activation functions, regularization, initialization, learning rate, ...)

Deep Convolutional Neural Networks (CNN): variants of MLP with local connectivity (to mimic biological receptive fields), weight sharing (detect features regardless of position + reduces parameters / capacity control + still trainable with SGD), and many layers, succession of pooling and subsampling operations, example CNN used: LeNet (cf. "Gradient-based learning applied to document recognition", LeCun et al., 1998)

Unsupervised learning

Auto-Encoders: building a latent representation (coding), mapped back (decoding) into a reconstruction of the input by minimzing the reconstruction error, application to image denoising
Stacked Denoising Auto-Encoders: unsupervised pre-training one layer at a time for deep nets, final stage: task-specific fine tuning using a supervised MLP
Restricted Boltzmann Machines (RBM):
- single layer generative Energy-Based Model (EBM, cf. EBLearn overview below)
- energy: $E(v, h) = -b' v - c'h - h'Wv$, where $v$ is the observed input units, $h$ the hidden units, $W$ are the connection weights, $b, c$ are offset parameters
- a particular form of log-linear Markov Random Field (MRF), i.e., for which the energy function is linear in its free parameters
- learned by SGD with MCMC sampling to approximate the gradient
Deep Belief Networks (DBN): stacked RBMs, trained in a greedy layer-wise manner

Overview of EBLearn

EBLearn:
- well-architectured C++ library for Energy-Based Learning, maintained by Yann LeCun's group @NYU (in particular Pierre Sermanet)
- used in robotics (c.f the Learning Applied to Ground Robots (LAGR) DARPA project)
- particular focus on supervised training of convolutional neural networks (especially LeNet)
Enery-Based learning (overview from here):
- associate a scalar energy to each configuration of the variables of interest
- learning corresponds to modifying that energy function so that its shape has desirable properties:
  - plausible configurations $\Rightarrow$ low energy
  - bad configurations $\Rightarrow$ high energy
- probability distribution: $Pr(x) = e^{-E(x)} / Z$, where $Z=\sum_x e^{-E(x)}$ is the partition function
- model learnt by (stochastic) gradient descent on the empirical negative log-likelihood of the training data
  
  $$ l(\theta, \mathcal{D}) = - \frac{1}{N} \sum_{x^{(i)} \in \mathcal{D}} \mathrm{log} Pr(x^{(i)}) $$
with latent variables $h$, the model becomes

$$ Pr(x) = \sum_h Pr(x, h) = \sum_h \frac{e^{-E(x,h)}}{Z} = \frac{e^{-\mathcal{F}(x)}}{Z} $$

where $\mathcal{F}(x) = - \mathrm{log} \sum_h e^{-E(x, h)}$ is called the free energy and $Z = \sum_x e^{-E(x,h)}$
in this case, the gradient of the data's negative log-likelihood decomposes over a positive and negative phase:

$$ \frac{\partial \log p(x)}{\partial \theta} = \frac{\partial \mathcal{F}(x)}{\partial \theta} - \sum_{\tilde{x}} p(\tilde{x}) \frac{\partial \mathcal{F}(\tilde{x})}{\partial \theta} $$
approximate estimation of the negative phase is obtained by averaging over negative particles $\mathcal{N}$ sampled using MCMC

$$ \frac{\partial \log p(x)}{\partial \theta} \approx \frac{\partial \mathcal{F}(x)}{\partial \theta} - \frac{1}{|\mathcal{N}|}\sum_{\tilde{x} \in \mathcal{N}} \frac{\partial \mathcal{F}(\tilde{x})}{\partial \theta} $$

EBLearn usage

Starting point: paper, tutorials, demos, and documentation
Installation: using CMake, cross-platform (even Android and iOS!), only few dependencies
Beginner tutorial (part 1, part 2):
- build a LeNet5 digit regognizer on MNIST
- no C++ coding required, just write EBLearn configuration files and use EBLearn's command-line tools (e.g. train)
Set of more advanced tutorials (cf. here) decribe how to use EBLearn's C++ API, resting on its powerful tensor library libidx (tuto) and energy-based learning library libeblearn (tuto)
Face demo on the "Labeled Faces in the Wild" dataset: just type make detect :-)