# Reading group on Deep Learning

LEAR - XRCE, 23 Nov. 2012

Notes on this document:

• this HTML page was generated from the IPython notebook available here (pure python version executable in any python interpreter: here)

• dependencies to run the code (only Open Source Software!):

• Python: the best programming language, ever.
• IPython: an enhanced python interpreter + great tools for scientific workflow
• Numpy: the Matlab library for Python (multi-dimensional arrays, linear algebra, ...)
• Matplotlib: the scientific plotting library for Python
• Networkx: package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks
• Theano: efficient tensor library, CPU + GPU expresion compiler, used for Deep Learning

## Outline

• Refresher on Neural Networks

• Neural Networks

• Back-Propagation

• Deep Learning in practice

• What is Deep Learning?

• Overview of Theano

• Overview of EBLearn

## Pointers

### Code

• theano : Python library for efficient computations on tensors (multi-dimensional arrays)
• EBLearn : C++ library developped @NYU (LeCun) for Energy-Based Learning
• cuda-convnet : fast implementation of ConvNet in C++ and Cuda (GPU), by winners of ImageNet LSVRC 2012 (Krizhevsky A. et al. NIPS'12 paper to come)

# Refresher on Neural Networks

## Central idea

• extract derived features = linear combination of the inputs
• predict target using a non-linear function of these features

## Preliminary: Projection Pursuit Regression (PPR)

• Supervised learning: input vector $X \in \mathbb{R}^p$, target $Y$
• $V_m = w_m^T X$ : derived features, with $M$ unit vectors $w_m \in \mathbb{R}^p$
• PPR prediction:

$$f(X) = \sum_{m=1}^{M} g_m(V_m)$$

• Parameters:
• $\{w_m\}_m$ : projection directions
• $\{g_m\}_m$ : ridge functions (each varies only in the direction of $w_m$)
In [1]:
# Plot of two ridge function examples

from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm
import matplotlib.pyplot as plt
import numpy as np

def g_left(X1, X2):
V = (X1 + X2) / np.sqrt(2)
return 1.0 / (1.0 + np.exp(-5 * (V - 0.5)))

def g_right(X1, X2):
V = X1
return (V + 0.1) * np.sin(10 / (V / 3.0 + 0.1))

def plot_surface_3d(X1, X2, func, fig=None, subplot=(1, 1, 1)):
if fig is None:
fig = plt.figure(figsize=(12, 5))
X1, X2 = np.meshgrid(X1, X2)
Z = func(X1, X2)
surf = ax.plot_surface(X1, X2, Z,
rstride=1, cstride=1, cmap=cm.coolwarm,
linewidth=0, antialiased=False)

fig = plt.figure(figsize=(12, 5))
X1, X2 = np.linspace(-1.5, 1.5, 150), np.linspace(-0.5, 1, 150)
plot_surface_3d(X1, X2, g_left, fig, subplot=(1, 2, 1))
X1, X2 =  np.linspace(-0.05, 0.05, 300), np.linspace(0, 1, 100)
plot_surface_3d(X1, X2, g_right, fig, subplot=(1, 2, 2))
• PPR model is a universal approximator: can approximate any continuous function in $\mathbb{R}^p$ (if M is taken arbitrarily large, for appropriate choices of $g_m$)
• but interpretation of the fitted model is difficult (mixing of the input features)
• Learning: PPR model is fitted by minimizing the squared error in an alternating manner:
• Given $w_m$, estimate $g_m$: 1D smoothing problem (e.g. use splines)
• Given $g_m$, estimate $w_m$: non-linear least-squares (e.g. use Gauss-Newton)
• Iterate these two steps until convergence
• Greedily add a new pair $(w_{m+1}, g_{m+1})$
• PPR not used much (originally for computational reasons)

## Neural Networks (NN)

### The model

• NN = non-linear statistical model inspired by biological neural networks
• Set of neurons (much like $g_m(w_m^T X)$ in PPR) interconnected in layers in a feed-forward fashion
In [2]:
# How to plot a network diagram of a NN

import matplotlib.pyplot as plt
import networkx as nx

class NeuralNetwork(object):
""" A simple neural network class for visualization purposes
"""
def __init__(self, n_nodes_per_layer):
""" Build the network layer by layer
"""
self.n_nodes_per_layer = n_nodes_per_layer
self.graph = nx.DiGraph()
self.nodes_pos = {}
self.nodes_label = {}
self.input_units = []
self.hidden_units = []
self.output_units = []

# add the units per layer
for layer_i, layer_size in enumerate(n_nodes_per_layer):

# add the nodes for this layer
for node_i in range(layer_size):

node = "%d_%d" % (layer_i, node_i)  # simple encoding of a node
self.nodes_pos[node] = (layer_i, layer_size / 2.0 - node_i)

if layer_i == 0:
# label for input layer
self.input_units.append(node)
self.nodes_label[node] = r"$X_%d$" % node_i
elif layer_i == len(n_nodes_per_layer) - 1:
# label for output layer
self.output_units.append(node)
self.nodes_label[node] = r"$Y_%d$" % node_i
else:
# hidden layer
self.hidden_units.append(node)
self.nodes_label[node] = r"$Z_{%d, %d}$" % (layer_i, node_i)

# add the edges: full connection between layer_i -1 and layer_i
if layer_i > 0:
prev_layer_i = layer_i - 1
prev_layer_size = n_nodes_per_layer[prev_layer_i]
("%d_%d" % (prev_layer_i, l), "%d_%d" % (layer_i, k))
for k in range(layer_size) for l in range(prev_layer_size)
])

def draw(self, ax=None):
""" Draw the neural network
"""
if ax is None:
ax = plt.figure(figsize=(10, 6)).add_subplot(1, 1, 1)
nx.draw_networkx_edges(self.graph, pos=self.nodes_pos, alpha=0.7, ax=ax)
nx.draw_networkx_nodes(self.graph, nodelist=self.input_units,
pos=self.nodes_pos, ax=ax,
node_color='#66FFFF', node_size=700)
nx.draw_networkx_nodes(self.graph, nodelist=self.hidden_units,
pos=self.nodes_pos, ax=ax,
node_color='#CCCCCC', node_size=900)
nx.draw_networkx_nodes(self.graph, nodelist=self.output_units,
pos=self.nodes_pos, ax=ax,
node_color='#FFFF99', node_size=700)
nx.draw_networkx_labels(self.graph, labels=self.nodes_label,
pos=self.nodes_pos, font_size=14, ax=ax)
ax.axis('off')

# classifcal 3 layer neural network example
n_layers = 3  # input layer | hidden layer | output layer
n_input_dims = 5
n_hidden_units = 3
n_output_dims = 2
n_nodes_per_layer = [n_input_dims, n_hidden_units, n_output_dims]
nn1 = NeuralNetwork(n_nodes_per_layer)

# another example
nn2 = NeuralNetwork([10, 5, 10])

# show the graphical representations
fig = plt.figure(figsize=(12, 8))

Formal definition of NN (assuming full connectivity between layers):

• Input layer: one unit for each dimension of the input $X \in \mathbb{R}^p$
• Hidden layers: derived features from the input of the previous layer
• Assuming one hidden layer with $M$ units
• Derived features: $Z_m = \sigma (\alpha_m^T X + \alpha_{0, m}), \ m = 1, \cdots, M$
• Activation function $\sigma$: typically a sigmoid $\sigma(v) = (1 + e^{-v})^{-1}$
• Output layer: one unit for each dimension of the output $Y$
• Output $Y_k$ again computed using a linear combination of the outputs $Z$ of the previous layer: $Y_k = f_k(X) = g_k(\beta_m^T Z + \beta_{0, m})$
• Output function $g_k$ chosen based on task:
• regression (e.g. denoising with auto-encoders, $p$ units to obtain $Y \in \mathbb{R}^p$ = denoised $X$): identity $g_k(V) = V$
• classification (e.g. for object recognition, $K$ units to output probability for each possible object category): softmax $g_k(V) = \frac{e^{V_k}}{\sum_{l=1}^K e^{V_k}}$
• Parameters (weights, noted $\theta$) to estimate from the data:
• the $\alpha_m$'s and bias terms $\alpha_{0, m}$ of each unit at each hidden layer
• the $\beta_m$'s and bias terms $\beta_{0, m}$ of each unit of the output layer
• Remarks:
• linear $\sigma \Rightarrow$ linear model
• one hidden layer $\Rightarrow$ similar to PPR (but PPR uses few non-parametric $g_m$ functions, whereas NN uses many units with a simple $\sigma$ function)

## The Back-Propagation algorithm

(Again assuming one hidden layer for brevity)

• Algorithm used to learn the weights $\theta$ in a NN
• Minimization of a loss function $R(\theta)$

• Regression: SSE

$$R(\theta) = \sum_{k=1}^K \sum_{i=1}^N (y_{i,k} - f_k(x_i))^2$$

• Classification: (SSE or) Cross-Entropy

$$R(\theta) = - \sum_{i=1}^N \sum_{k=1}^K y_{i,k} \mathrm{log} f_k(x_i)$$

• Regularization:

• by adding a penalty term (e.g. $\Vert \theta \Vert_2^2$, called weight decay)
• or early stopping
• Back-Propagation:

• minimization of $R(\theta)$ by gradient descent:

$$\beta_{k, m}^{(r+1)} = \beta_{k, m}^{(r)} - \gamma_r \frac{\partial R_i}{\partial \beta_{k, m}^{(r)}}$$

$$\alpha_{m, l}^{(r+1)} = \alpha_{m, l}^{(r)} - \gamma_r \frac{\partial R_i}{\partial \alpha_{m, l}^{(r)}}$$

• Computing the gradient: using the chain rule, done in a forward and backward sweep over the network

• forward pass: fix the weights, get the predicted values $\hat{f}_k(x_i)$
• backward pass: compute the errors and back-propagate (estimate weights from output $\rightarrow$ input layers

### Remarks

• back-propagation can be done in batch or online, i.e. one observation at a time $\Rightarrow$ (SGD: stochastic gradient descent)
• learning rate $\gamma_r$:
• usually constant in batch
• must verify $\gamma_r \rightarrow 0$, $\sum_r \gamma_r = \infty$, and $\sum_r \gamma_r^2 \lt \infty$ (e.g. $\gamma_r = 1/r$) in SGD
• back-prop is very slow in practice: use conjugate gradient
• scaling of inputs determines scaling of weights $\Rightarrow$ standardize (0 mean, unit variance)
• initialization:

• important in practice: loss is not convex (many local minima!)
• multiple re-starts with random near-zero weights (nearly linear regime for sigmoid)
• Combine the outputs of $L$ different learned NNs by averaing their predictions

$$\hat{f}(X_{\mathrm{new}}) = \sum_{l=1}^L w_l \mathbb{E}(Y_{\mathrm{new}} | X_{\mathrm{new}}, \hat{\theta}_l)$$

• bagging: $w_l = 1/L$ and $\hat{\theta}_l$ parameters obtained by fitting on different random permutations of the training data
• boosting: $w_l = 1$ but $\hat{\theta}_l$ chosen in non-random sequential fashion to improve the fit
• Bayesian inference: $w_l = 1/L$ and sample $\theta_l$ from the posterior using MCMC
• number of hidden:
• units: the more the better + regularization
• layers: task-dependent (multiple layers $\Rightarrow$ hierarchical features)

$\Rightarrow$ that is where hand-tuning comes in! no free lunch ;-)

• difficult to interpret / visualize what is learned (is it still true?)

# Deep Learning in practice

## What is Deep Learning?

• According to deeplearning.net:

"Deep Learning is a new area of Machine Learning research, which has been introduced with the objective of moving Machine Learning closer to one of its original goals: Artificial Intelligence."

"... moving beyond shallow learning since 2006!"

• Learning of probabilistic generative models with deep architectures, i.e. with many layers of non-linearities

### Examples

• Deep Multi-Layer Perceptrons (i.e. NN like before, but with many hidden layers)
• Convolutional Networks
• Stacked Auto-encoders
• Restricted Boltzman Machines (RBM)
• Deep Belief Networks (DBN)

### Why going deep?

• Can trade-off number of units for layers (more efficient: less connections and therefore weights)
• Hierarchical: useful for multi-scale analysis
• "Combinatorial sharing of statistical strength between multiple levels of latent variables" (NIPS'10 workshop)... ?
• it works (according to Hinton and Lecun... but starting to generalize more and more to other researchers)

### How to learn deeply?

• Back-prop has some severe limitations:

• Gradient gets progressively diluted across layers (very limited corrections in first layers)
• Local minima (very sensitive to initialization)
• Needs labeled data (in usual settings)
• What works, e.g. on Deep Belief Networks (DBN):

• greedy layer-wise unsupervised learning (serves as initialization step):
• learn a RBM by maximizing lower bound on likelihood
• stack a new RBM on top, update the prior of the second layer with posterior of the first, learn
• iterate
• justified in theory (cf. Hinton'2006)
• top-down supervised training as final refinement step
• Many tools to learn deep architectures:

• At the core: libraries to work on tensors (multi-dimensional arrays)
• We will review two recent, efficient, easy to use, and actively maintained librairies:
• Theano (python)
• EBLearn (C++)

## Overview of Theano

• python Open Source library (BSD-like permissive license, cross-platform)
• Active project maintained by many people, including members of Yoshua Bengio's LISA group at U. of Montreal
• Main purpose of Theano (from their website):

Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. Theano features:

• tight integration with numpy -- Use numpy.ndarray in Theano-compiled functions.
• transparent use of a GPU -- Perform data-intensive calculations up to 140x faster than with CPU.(float32 only)
• efficient symbolic differentiation -- Theano does your derivatives for function with one or many inputs.
• speed and stability optimizations -- Get the right answer for log(1+x) even when x is really tiny.
• dynamic C code generation -- Evaluate expressions faster.
• extensive unit-testing and self-verification -- Detect and diagnose many types of mistake.
• Sort of mix between Numpy (Python's Matlab) and Sympy (Python's Mathematica) focusing on tensor operations
• execution speed optimizations: Theano can use g++ or nvcc to compile parts your expression graph into CPU or GPU instructions, which run much faster than pure Python.

• symbolic differentiation: Theano can automatically build symbolic graphs for computing gradients.

• stability optimizations: Theano can recognize [some] numerically unstable expressions and compute them with more stable algorithms.

• Ambitious goals (and they're making progress!)
• Support tensor and sparse operations
• Support linear algebra operations
• Graph Transformations
• Differentiation/higher order differentiation
• 'R' and 'L' differential operators
• Speed/memory optimizations
• Numerical stability optimizations
• Can use many compiled languages, instructions sets: C/C++, CUDA, OpenCL, PTX, CAL, AVX, ...
• Lazy evaluation
• Loop
• Parallel execution (SIMD, multi-core, multi-node on cluster, multi-node distributed)
• Support all NumPy/basic SciPy functionality
• Easy wrapping of library functions in Theano
In [3]:
# Sneak peek

import theano
from theano import tensor

# declare two symbolic floating-point scalars
a = tensor.dscalar()
b = tensor.dscalar()

# create a simple expression
c = a + b

# convert the expression into a callable object that takes (a,b)
# values as input and computes a value for c
f = theano.function([a,b], c)

# bind 1.5 to 'a', 2.5 to 'b', and evaluate 'c'
assert 4.0 == f(1.5, 2.5)
In [4]:
# Logistic Regression simple example

import numpy
import theano
import theano.tensor as T
rng = numpy.random

N = 400
feats = 784
D = (rng.randn(N, feats), rng.randint(size=N,low=0, high=2))
training_steps = 10000

# Declare Theano symbolic variables
x = T.matrix("x")
y = T.vector("y")
w = theano.shared(rng.randn(feats), name="w")
b = theano.shared(0., name="b")
#print "Initial model:"
#print w.get_value(), b.get_value()

# Construct Theano expression graph
p_1 = 1 / (1 + T.exp(-T.dot(x, w) - b))       # Probability that target = 1
prediction = p_1 > 0.5                        # The prediction thresholded
xent = -y * T.log(p_1) - (1-y) * T.log(1-p_1) # Cross-entropy loss function
cost = xent.mean() + 0.01 * (w ** 2).sum()    # The cost to minimize
gw, gb = T.grad(cost, [w, b])                 # Compute the gradient of the cost

# Compile
train = theano.function(
inputs=[x,y],
outputs=[prediction, xent],
updates={w: w - 0.1 * gw, b: b - 0.1 * gb})
predict = theano.function(inputs=[x], outputs=prediction)

# Train
for i in range(training_steps):
pred, err = train(D[0], D[1])

# To display the model, uncomment below:
#print "Final model:"
#print w.get_value(), b.get_value()
#print "target values for D:", D[1]
#print "prediction on D:", predict(D[0])

## Deep Learning in Theano

Many implementations already available and documented in the deep learning tutorials (cf. the code on github)

### Supervised learning

$$P(Y=i|x, W,b) = softmax_i(W x + b) \ = \frac {e^{W_i x + b_i}} {\sum_j e^{W_j x + b_j}}$$

$$y_{pred} = {\rm argmax}_i P(Y=i|x,W,b)$$

• Multilayer Perceptron (MLP): going from LR to MLP, train with SGD + tricks of the trade (activation functions, regularization, initialization, learning rate, ...)

### Unsupervised learning

• Auto-Encoders: building a latent representation (coding), mapped back (decoding) into a reconstruction of the input by minimzing the reconstruction error, application to image denoising

• Stacked Denoising Auto-Encoders: unsupervised pre-training one layer at a time for deep nets, final stage: task-specific fine tuning using a supervised MLP

• single layer generative Energy-Based Model (EBM, cf. EBLearn overview below)
• energy: $E(v, h) = -b' v - c'h - h'Wv$, where $v$ is the observed input units, $h$ the hidden units, $W$ are the connection weights, $b, c$ are offset parameters

• a particular form of log-linear Markov Random Field (MRF), i.e., for which the energy function is linear in its free parameters
• learned by SGD with MCMC sampling to approximate the gradient
• Deep Belief Networks (DBN): stacked RBMs, trained in a greedy layer-wise manner

## Overview of EBLearn

• well-architectured C++ library for Energy-Based Learning, maintained by Yann LeCun's group @NYU (in particular Pierre Sermanet)

• Enery-Based learning (overview from here):

• associate a scalar energy to each configuration of the variables of interest
• learning corresponds to modifying that energy function so that its shape has desirable properties:
• plausible configurations $\Rightarrow$ low energy
• bad configurations $\Rightarrow$ high energy
• probability distribution: $Pr(x) = e^{-E(x)} / Z$, where $Z=\sum_x e^{-E(x)}$ is the partition function
• model learnt by (stochastic) gradient descent on the empirical negative log-likelihood of the training data

$$l(\theta, \mathcal{D}) = - \frac{1}{N} \sum_{x^{(i)} \in \mathcal{D}} \mathrm{log} Pr(x^{(i)})$$

• with latent variables $h$, the model becomes

$$Pr(x) = \sum_h Pr(x, h) = \sum_h \frac{e^{-E(x,h)}}{Z} = \frac{e^{-\mathcal{F}(x)}}{Z}$$

where $\mathcal{F}(x) = - \mathrm{log} \sum_h e^{-E(x, h)}$ is called the free energy and $Z = \sum_x e^{-E(x,h)}$

• in this case, the gradient of the data's negative log-likelihood decomposes over a positive and negative phase:

$$\frac{\partial \log p(x)}{\partial \theta} = \frac{\partial \mathcal{F}(x)}{\partial \theta} - \sum_{\tilde{x}} p(\tilde{x}) \frac{\partial \mathcal{F}(\tilde{x})}{\partial \theta}$$

• approximate estimation of the negative phase is obtained by averaging over negative particles $\mathcal{N}$ sampled using MCMC

$$\frac{\partial \log p(x)}{\partial \theta} \approx \frac{\partial \mathcal{F}(x)}{\partial \theta} - \frac{1}{|\mathcal{N}|}\sum_{\tilde{x} \in \mathcal{N}} \frac{\partial \mathcal{F}(\tilde{x})}{\partial \theta}$$

### EBLearn usage

• Starting point: paper, tutorials, demos, and documentation

• Installation: using CMake, cross-platform (even Android and iOS!), only few dependencies

• Beginner tutorial (part 1, part 2):

• build a LeNet5 digit regognizer on MNIST

• no C++ coding required, just write EBLearn configuration files and use EBLearn's command-line tools (e.g. train)
• Set of more advanced tutorials (cf. here) decribe how to use EBLearn's C++ API, resting on its powerful tensor library libidx (tuto) and energy-based learning library libeblearn (tuto)

• Face demo on the "Labeled Faces in the Wild" dataset: just type make detect :-)