# -*- coding: utf-8 -*-
# 3.0
#
# # Reading group on Deep Learning
#
# LEAR - XRCE, 23 Nov. 2012
#
# Notes on this document:
#
# * this HTML page was generated from the [IPython](http://ipython.org/) notebook available [here](http://lear.inrialpes.fr/people/gaidon/lear_xrce_deep_learning_01.ipynb) (pure python version executable in any python interpreter: [here](http://lear.inrialpes.fr/people/gaidon/lear_xrce_deep_learning_01.py))
#
# * dependencies to run the code (only Open Source Software!):
# * [Python](http://python.org/): the best programming language, ever.
# * [IPython](http://ipython.org/): an enhanced python interpreter + great tools for scientific workflow
# * [Numpy](http://numpy.scipy.org/): the Matlab library for Python (multi-dimensional arrays, linear algebra, ...)
# * [Matplotlib](http://matplotlib.org/): the scientific plotting library for Python
# * [Networkx](http://networkx.lanl.gov/): package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks
# * [Theano](http://deeplearning.net/software/theano/): efficient tensor library, CPU + GPU expresion compiler, used for Deep Learning
#
#
# ## Outline
#
# * Refresher on Neural Networks
#
# * Neural Networks
#
# * Back-Propagation
#
# * Deep Learning in practice
#
# * What is Deep Learning?
#
# * Overview of Theano
#
# * Overview of EBLearn
#
#
# ## Pointers
#
#
# ### Reading material
#
# * [The Elements of Statistical Learning (5th ed.)](http://www-stat.stanford.edu/~tibs/ElemStatLearn/) (T. Hastie, R. Tibshirani, & J. Friedman) : Chapter 11 (Neural Networks)
# * [Learning Deep Architectures for AI](http://www.iro.umontreal.ca/~lisa/publications2/index.php/publications/show/239) (Y. Bengio, In Foundations & Trends in Machine Learning, 2009) : small book providing an introduction to deep learning and individual chapters on most of the main deep architectures
# * [A fast learning algorithm for deep belief nets](http://www.cs.toronto.edu/~hinton/absps/ncfast.pdf) (G. Hinton *et al.*, Neural Computation, 2006): fast algorithm for learning deep beliefs nets + justification of greedy layer-wise training and stacking
# * [Building High-level Features Using Large Scale Unsupervised Learning](http://research.google.com/archive/unsupervised_icml2012.html) (Quoc V. Le *et al.*, ICML'2012): Large scale deep learning (sparse auto-encoders) using 16000 cores; led to face and cat neurons from unlabeled data; state of the art on ImageNet from raw pixels; made the front page of NY Times.
# * [On Deep Generative Models with Applications to Recognition](http://www.cs.toronto.edu/~hinton/absps/ranzato_cvpr2011.pdf) (M.A. Ranzato *et al.*, CVPR 2011)
# * Y. LeCun's recent papers (cf. his [publication list](http://yann.lecun.com/exdb/publis/index.html))
# * [CVPR'2012 workshop on Deep Learning Methods for Vision](http://cs.nyu.edu/~fergus/tutorials/deep_learning_cvpr12/): overview of most common deep architectures, transfer learning, application to motion and video
# * [CVPR'2012 workshop on Multiview Feature Learning](http://learning.cs.toronto.edu/~rfm/multiview-feature-learning-cvpr/): recent workshop about deep unsupervised feature learning, includes application of deep learning techniques to video
# * [tutorial slides](http://deeplearningworkshopnips2010.files.wordpress.com/2010/09/nips10-workshop-tutorial-final.pdf) at the NIPS 2010 Workshop on Deep Learning and Unsupervised Feature Learning
# * cf. the [deeplearning.net reading list](http://deeplearning.net/reading-list/) for additional pointers
#
#
# ### Websites
#
# * [deeplearning.net](http://deeplearning.net) : Big hub with many links related to deep learning (software, publications, ...)
# * [Deep learning tutorials](http://deeplearning.net/tutorial/) : Set of tutorials on deep learning (RBM, ConvNets, ...) with theano
# * [Neural Nets F.A.Q.](ftp://ftp.sas.com/pub/neural/FAQ.html) : extremely comprehensive FAQ on neural nets (from stats to software)
# * [Homepage of G. Hinton](http://www.cs.toronto.edu/~hinton/)
# * [Homepage of Y. LeCun](http://yann.lecun.com/)
# * Mainstream media coverage of advances in deep learning:
# * [NY Times front page article](http://www.nytimes.com/2012/06/26/technology/in-a-big-network-of-computers-evidence-of-machine-learning.html) on the [ICML'12 paper](http://research.google.com/archive/unsupervised_icml2012.html) of Andrew Ng's group (Ng's interview: [here](http://www.npr.org/2012/06/26/155792609/a-massive-google-network-learns-to-identify))
# * [Another NY Times article](http://www.nytimes.com/2012/11/24/science/scientists-see-advances-in-deep-learning-a-part-of-artificial-intelligence.html?pagewanted=all&_r=1&&_r=0) about the recent progress in deep learning on a wide range of problems
# * [Blog post](http://blogs.technet.com/b/next/archive/2012/11/08/microsoft-research-shows-a-promising-new-breakthrough-in-speech-translation-technology.aspx#.ULM2fZu9DSJ) of [Rick Rashid](http://research.microsoft.com/en-us/people/rashid/) Microsoft's CRO, with an interesting video explaining the progress made in Automatic Speech Recognition and Translation: from matching sound waves, to HMMs, to Deep Learning; R. Rashid makes a cool live demo (at the end of the video) where his speech gets translated on the fly in (spoken!) Mandarin
#
#
# ### Code
#
# * [theano](http://deeplearning.net/software/theano/) : Python library for efficient computations on tensors (multi-dimensional arrays)
# * [EBLearn](http://eblearn.cs.nyu.edu:21991/doku.php) : C++ library developped @NYU (LeCun) for *Energy-Based Learning*
# * [cuda-convnet](http://code.google.com/p/cuda-convnet/) : fast implementation of ConvNet in C++ and Cuda (GPU), by winners of ImageNet LSVRC 2012 (Krizhevsky A. *et al.* NIPS'12 paper to come)
#
# # Refresher on Neural Networks
#
#
# ## Central idea
#
# * extract *derived features* = linear combination of the inputs
# * predict target using a *non-linear* function of these features
#
#
# ## Preliminary: Projection Pursuit Regression (PPR)
#
# * Supervised learning: input vector $X \in \mathbb{R}^p$, target $Y$
# * $V_m = w_m^T X$ : derived features, with $M$ unit vectors $w_m \in \mathbb{R}^p$
# * PPR prediction:
#
# > $$f(X) = \sum_{m=1}^{M} g_m(V_m)$$
#
# * Parameters:
# * $\\{w_m\\}_m$ : projection directions
# * $\\{g_m\\}_m$ : ridge functions (each varies only in the direction of $w_m$)
#
# Plot of two ridge function examples
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm
import matplotlib.pyplot as plt
import numpy as np
def g_left(X1, X2):
V = (X1 + X2) / np.sqrt(2)
return 1.0 / (1.0 + np.exp(-5 * (V - 0.5)))
def g_right(X1, X2):
V = X1
return (V + 0.1) * np.sin(10 / (V / 3.0 + 0.1))
def plot_surface_3d(X1, X2, func, fig=None, subplot=(1, 1, 1)):
if fig is None:
fig = plt.figure(figsize=(12, 5))
ax = fig.add_subplot(*subplot, projection='3d')
X1, X2 = np.meshgrid(X1, X2)
Z = func(X1, X2)
surf = ax.plot_surface(X1, X2, Z,
rstride=1, cstride=1, cmap=cm.coolwarm,
linewidth=0, antialiased=False)
fig = plt.figure(figsize=(12, 5))
X1, X2 = np.linspace(-1.5, 1.5, 150), np.linspace(-0.5, 1, 150)
plot_surface_3d(X1, X2, g_left, fig, subplot=(1, 2, 1))
X1, X2 = np.linspace(-0.05, 0.05, 300), np.linspace(0, 1, 100)
plot_surface_3d(X1, X2, g_right, fig, subplot=(1, 2, 2))
#
# * PPR model is a *universal approximator*: can approximate any continuous function in $\mathbb{R}^p$ (if M is taken arbitrarily large, for appropriate choices of $g_m$)
# * but interpretation of the fitted model is difficult (mixing of the input features)
# * **Learning**: PPR model is fitted by minimizing the squared error in an alternating manner:
# * Given $w_m$, estimate $g_m$: 1D smoothing problem (e.g. use splines)
# * Given $g_m$, estimate $w_m$: non-linear least-squares (e.g. use Gauss-Newton)
# * Iterate these two steps until convergence
# * Greedily add a new pair $(w_{m+1}, g_{m+1})$
# * PPR not used much (originally for computational reasons)
#
# ## Neural Networks (NN)
#
#
# ### The model
#
# * NN = non-linear statistical model inspired by biological neural networks
# * Set of neurons (much like $g_m(w_m^T X)$ in PPR) interconnected in layers in a feed-forward fashion
#
# How to plot a network diagram of a NN
import matplotlib.pyplot as plt
import networkx as nx
class NeuralNetwork(object):
""" A simple neural network class for visualization purposes
"""
def __init__(self, n_nodes_per_layer):
""" Build the network layer by layer
"""
self.n_nodes_per_layer = n_nodes_per_layer
self.graph = nx.DiGraph()
self.nodes_pos = {}
self.nodes_label = {}
self.input_units = []
self.hidden_units = []
self.output_units = []
# add the units per layer
for layer_i, layer_size in enumerate(n_nodes_per_layer):
# add the nodes for this layer
for node_i in range(layer_size):
node = "%d_%d" % (layer_i, node_i) # simple encoding of a node
self.graph.add_node(node)
self.nodes_pos[node] = (layer_i, layer_size / 2.0 - node_i)
if layer_i == 0:
# label for input layer
self.input_units.append(node)
self.nodes_label[node] = r"$X_%d$" % node_i
elif layer_i == len(n_nodes_per_layer) - 1:
# label for output layer
self.output_units.append(node)
self.nodes_label[node] = r"$Y_%d$" % node_i
else:
# hidden layer
self.hidden_units.append(node)
self.nodes_label[node] = r"$Z_{%d, %d}$" % (layer_i, node_i)
# add the edges: full connection between layer_i -1 and layer_i
if layer_i > 0:
prev_layer_i = layer_i - 1
prev_layer_size = n_nodes_per_layer[prev_layer_i]
self.graph.add_edges_from([
("%d_%d" % (prev_layer_i, l), "%d_%d" % (layer_i, k))
for k in range(layer_size) for l in range(prev_layer_size)
])
def draw(self, ax=None):
""" Draw the neural network
"""
if ax is None:
ax = plt.figure(figsize=(10, 6)).add_subplot(1, 1, 1)
nx.draw_networkx_edges(self.graph, pos=self.nodes_pos, alpha=0.7, ax=ax)
nx.draw_networkx_nodes(self.graph, nodelist=self.input_units,
pos=self.nodes_pos, ax=ax,
node_color='#66FFFF', node_size=700)
nx.draw_networkx_nodes(self.graph, nodelist=self.hidden_units,
pos=self.nodes_pos, ax=ax,
node_color='#CCCCCC', node_size=900)
nx.draw_networkx_nodes(self.graph, nodelist=self.output_units,
pos=self.nodes_pos, ax=ax,
node_color='#FFFF99', node_size=700)
nx.draw_networkx_labels(self.graph, labels=self.nodes_label,
pos=self.nodes_pos, font_size=14, ax=ax)
ax.axis('off')
# classifcal 3 layer neural network example
n_layers = 3 # input layer | hidden layer | output layer
n_input_dims = 5
n_hidden_units = 3
n_output_dims = 2
n_nodes_per_layer = [n_input_dims, n_hidden_units, n_output_dims]
nn1 = NeuralNetwork(n_nodes_per_layer)
# another example
nn2 = NeuralNetwork([10, 5, 10])
# show the graphical representations
fig = plt.figure(figsize=(12, 8))
nn1.draw(fig.add_subplot(1, 2, 1))
nn2.draw(fig.add_subplot(1, 2, 2))
#
# Formal definition of NN (assuming full connectivity between layers):
#
# * **Input layer**: one unit for each dimension of the input $X \in \mathbb{R}^p$
# * **Hidden layers**: *derived features* from the input of the previous layer
# * Assuming one hidden layer with $M$ units
# * Derived features: $Z_m = \sigma (\alpha_m^T X + \alpha_{0, m}), \ m = 1, \cdots, M$
# * Activation function $\sigma$: typically a sigmoid $\sigma(v) = (1 + e^{-v})^{-1}$
# * **Output layer**: one unit for each dimension of the output $Y$
# * Output $Y_k$ again computed using a linear combination of the outputs $Z$ of the previous layer: $Y_k = f_k(X) = g_k(\beta_m^T Z + \beta_{0, m})$
# * Output function $g_k$ chosen based on task:
# * regression (e.g. denoising with auto-encoders, $p$ units to obtain $Y \in \mathbb{R}^p$ = denoised $X$): identity $g_k(V) = V$
# * classification (e.g. for object recognition, $K$ units to output probability for each possible object category): softmax $g_k(V) = \frac{e^{V_k}}{\sum_{l=1}^K e^{V_k}}$
# * Parameters (*weights*, noted $\theta$) to estimate from the data:
# * the $\alpha_m$'s and bias terms $\alpha_{0, m}$ of each unit at each hidden layer
# * the $\beta_m$'s and bias terms $\beta_{0, m}$ of each unit of the output layer
# * Remarks:
# * linear $\sigma \Rightarrow$ linear model
# * one hidden layer $\Rightarrow$ similar to PPR (but PPR uses few non-parametric $g_m$ functions, whereas NN uses many units with a simple $\sigma$ function)
#
# ## The Back-Propagation algorithm
#
# (Again assuming one hidden layer for brevity)
#
# * Algorithm used to learn the weights $\theta$ in a NN
# * Minimization of a loss function $R(\theta)$
# * Regression: SSE
#
# $$R(\theta) = \sum_{k=1}^K \sum_{i=1}^N (y_{i,k} - f_k(x_i))^2$$
#
# * Classification: (SSE or) Cross-Entropy
#
# $$R(\theta) = - \sum_{i=1}^N \sum_{k=1}^K y_{i,k} \mathrm{log} f_k(x_i)$$
#
# * **Regularization**:
# * by adding a penalty term (e.g. $\Vert \theta \Vert_2^2$, called weight decay)
# * or early stopping
#
# * **Back-Propagation**:
# * minimization of $R(\theta)$ by *gradient descent*:
#
# $$\beta_{k, m}^{(r+1)} = \beta_{k, m}^{(r)} - \gamma_r \frac{\partial R_i}{\partial \beta_{k, m}^{(r)}}$$
#
# $$\alpha_{m, l}^{(r+1)} = \alpha_{m, l}^{(r)} - \gamma_r \frac{\partial R_i}{\partial \alpha_{m, l}^{(r)}}$$
#
#
# * Computing the gradient: using the chain rule, done in a forward and backward sweep over the network
# * forward pass: fix the weights, get the predicted values $\hat{f}_k(x_i)$
# * backward pass: compute the errors and back-propagate (estimate weights from output $\rightarrow$ input layers
#
# ### Remarks
#
# * back-propagation can be done in batch or *online*, i.e. one observation at a time $\Rightarrow$ (SGD: stochastic gradient descent)
# * learning rate $\gamma_r$:
# * usually constant in batch
# * must verify $\gamma_r \rightarrow 0$, $\sum_r \gamma_r = \infty$, and $\sum_r \gamma_r^2 \lt \infty$ (e.g. $\gamma_r = 1/r$) in SGD
# * back-prop is very slow in practice: use conjugate gradient
# * scaling of inputs determines scaling of weights $\Rightarrow$ standardize (0 mean, unit variance)
# * initialization:
# * important in practice: loss is not convex (many local minima!)
# * multiple re-starts with random near-zero weights (nearly linear regime for sigmoid)
# * Combine the outputs of $L$ different learned NNs by averaing their predictions
#
# $$\hat{f}(X_{\mathrm{new}}) = \sum_{l=1}^L w_l \mathbb{E}(Y_{\mathrm{new}} | X_{\mathrm{new}}, \hat{\theta}_l)$$
#
# * *bagging*: $w_l = 1/L$ and $\hat{\theta}_l$ parameters obtained by fitting on different random permutations of the training data
# * *boosting*: $w_l = 1$ but $\hat{\theta}_l$ chosen in non-random sequential fashion to improve the fit
# * *Bayesian inference*: $w_l = 1/L$ and sample $\theta_l$ from the posterior using MCMC
# * number of hidden:
# * units: the more the better + regularization
# * layers: task-dependent (multiple layers $\Rightarrow$ hierarchical features)
#
# > $\Rightarrow$ that is where hand-tuning comes in! no free lunch ;-)
#
# * difficult to interpret / visualize what is learned (is it still true?)
#
# # Deep Learning in practice
#
# ## What is Deep Learning?
#
# * According to [deeplearning.net](http://deeplearning.net):
#
# > "Deep Learning is a new area of Machine Learning research, which has been introduced with the objective of moving Machine Learning closer to one of its original goals: Artificial Intelligence."
#
# > "... moving beyond shallow learning since 2006!"
#
# * Learning of probabilistic generative models with deep architectures, i.e. with many layers of non-linearities
#
#
# ### Examples
#
# * Deep Multi-Layer Perceptrons (i.e. NN like before, but with many hidden layers)
# * Convolutional Networks
# * Stacked Auto-encoders
# * Restricted Boltzman Machines (RBM)
# * Deep Belief Networks (DBN)
#
#
# ### Why going deep?
#
# * Can trade-off number of units for layers (more efficient: less connections and therefore weights)
# * Hierarchical: useful for multi-scale analysis
# * "Combinatorial sharing of statistical strength between multiple levels of latent variables" (NIPS'10 workshop)... ?
# * it works (according to Hinton and Lecun... but starting to generalize more and more to other researchers)
#
#
# ### How to learn deeply?
#
# * Back-prop has some severe limitations:
# * Gradient gets progressively diluted across layers (very limited corrections in first layers)
# * Local minima (very sensitive to initialization)
# * Needs labeled data (in usual settings)
#
# * What works, e.g. on Deep Belief Networks (DBN):
# * greedy layer-wise unsupervised learning (serves as initialization step):
# * learn a RBM by maximizing lower bound on likelihood
# * stack a new RBM on top, update the prior of the second layer with posterior of the first, learn
# * iterate
# * justified in theory (cf. [Hinton'2006](http://www.cs.toronto.edu/~hinton/absps/ncfast.pdf))
# * top-down supervised training as final refinement step
#
# * Many tools to learn deep architectures:
# * At the core: libraries to work on *tensors* (multi-dimensional arrays)
# * We will review two recent, efficient, easy to use, and actively maintained librairies:
# * Theano (python)
# * EBLearn (C++)
#
# ## Overview of Theano
#
# * [Theano](http://deeplearning.net/software/theano/):
# * python Open Source library (BSD-like permissive license, cross-platform)
# * Active project maintained by many people, including members of Yoshua Bengio's [LISA](http://www.iro.umontreal.ca/rubrique.php3?id_rubrique=27) group at U. of Montreal
#
# * Main purpose of Theano (from their [website](http://deeplearning.net/software/theano/)):
#
# > Theano is a Python library that allows you to define, optimize, and
# > evaluate mathematical expressions involving multi-dimensional
# > arrays efficiently. Theano features:
# >
# > * **tight integration with numpy** -- Use `numpy.ndarray` in Theano-compiled functions.
# > * **transparent use of a GPU** -- Perform data-intensive calculations up to 140x faster than with CPU.(float32 only)
# > * **efficient symbolic differentiation** -- Theano does your derivatives for function with one or many inputs.
# > * **speed and stability optimizations** -- Get the right answer for ``log(1+x)`` even when ``x`` is really tiny.
# > * **dynamic C code generation** -- Evaluate expressions faster.
# > * **extensive unit-testing and self-verification** -- Detect and diagnose many types of mistake.
#
# * Sort of mix between [Numpy](http://numpy.scipy.org/) (Python's Matlab) and [Sympy](http://sympy.org/en/index.html) (Python's Mathematica) focusing on tensor operations
#
# > * *execution speed optimizations*: Theano can use `g++` or `nvcc` to compile
# > parts your expression graph into CPU or GPU instructions, which run
# > much faster than pure Python.
# >
# > * *symbolic differentiation*: Theano can automatically build symbolic graphs
# > for computing gradients.
# >
# > * *stability optimizations*: Theano can recognize [some] numerically unstable
# > expressions and compute them with more stable algorithms.
#
# * Ambitious goals (and they're making progress!)
#
# > * Support tensor and sparse operations
# > * Support linear algebra operations
# > * Graph Transformations
# > * Differentiation/higher order differentiation
# > * 'R' and 'L' differential operators
# > * Speed/memory optimizations
# > * Numerical stability optimizations
# > * Can use many compiled languages, instructions sets: C/C++, CUDA, OpenCL, PTX, CAL, AVX, ...
# > * Lazy evaluation
# > * Loop
# > * Parallel execution (SIMD, multi-core, multi-node on cluster,
# > multi-node distributed)
# > * Support all NumPy/basic SciPy functionality
# > * Easy wrapping of library functions in Theano
#
# * Easy to install (just do: ``pip install Theano``)
# * Extensively documented ([docs](http://deeplearning.net/software/theano/library/index.html#libdoc))
# * Great tutorial ([tuto](http://deeplearning.net/software/theano/tutorial/index.html#tutorial))
# * Great [tutorials on deep learning](http://deeplearning.net/tutorial/) written by members of the LISA lab
#
# Sneak peek
import theano
from theano import tensor
# declare two symbolic floating-point scalars
a = tensor.dscalar()
b = tensor.dscalar()
# create a simple expression
c = a + b
# convert the expression into a callable object that takes (a,b)
# values as input and computes a value for c
f = theano.function([a,b], c)
# bind 1.5 to 'a', 2.5 to 'b', and evaluate 'c'
assert 4.0 == f(1.5, 2.5)
#
# Logistic Regression simple example
import numpy
import theano
import theano.tensor as T
rng = numpy.random
N = 400
feats = 784
D = (rng.randn(N, feats), rng.randint(size=N,low=0, high=2))
training_steps = 10000
# Declare Theano symbolic variables
x = T.matrix("x")
y = T.vector("y")
w = theano.shared(rng.randn(feats), name="w")
b = theano.shared(0., name="b")
#print "Initial model:"
#print w.get_value(), b.get_value()
# Construct Theano expression graph
p_1 = 1 / (1 + T.exp(-T.dot(x, w) - b)) # Probability that target = 1
prediction = p_1 > 0.5 # The prediction thresholded
xent = -y * T.log(p_1) - (1-y) * T.log(1-p_1) # Cross-entropy loss function
cost = xent.mean() + 0.01 * (w ** 2).sum() # The cost to minimize
gw, gb = T.grad(cost, [w, b]) # Compute the gradient of the cost
# Compile
train = theano.function(
inputs=[x,y],
outputs=[prediction, xent],
updates={w: w - 0.1 * gw, b: b - 0.1 * gb})
predict = theano.function(inputs=[x], outputs=prediction)
# Train
for i in range(training_steps):
pred, err = train(D[0], D[1])
# To display the model, uncomment below:
#print "Final model:"
#print w.get_value(), b.get_value()
#print "target values for D:", D[1]
#print "prediction on D:", predict(D[0])
#
# ## Deep Learning in Theano
#
# Many implementations already available and documented in the [deep learning tutorials]((http://deeplearning.net/tutorial/)) (cf. [the code on github](https://github.com/lisa-lab/DeepLearningTutorials))
#
#
# ### Supervised learning
#
# * [Introduction](http://deeplearning.net/tutorial/gettingstarted.html): notations, datasets, extensive primer on supervised deep learning
# * [Logistic Regression (LR) + application on MNIST](http://deeplearning.net/tutorial/logreg.html) (both implemented using [conjugate gradient](https://github.com/lisa-lab/DeepLearningTutorials/blob/master/code/logistic_cg.py) and [SGD](https://github.com/lisa-lab/DeepLearningTutorials/blob/master/code/logistic_sgd.py)): often used as the last layer of classification architectures
#
# > $$
# > P(Y=i|x, W,b) = softmax_i(W x + b) \\
# > = \frac {e^{W_i x + b_i}} {\sum_j e^{W_j x + b_j}}
# > $$
#
# > $$y_{pred} = {\rm argmax}_i P(Y=i|x,W,b)$$
#
# * [Multilayer Perceptron (MLP)](http://deeplearning.net/tutorial/mlp.html): going from LR to MLP, train with SGD + tricks of the trade (activation functions, regularization, initialization, learning rate, ...)
#
#
#
# * [Deep Convolutional Neural Networks (CNN)](http://deeplearning.net/tutorial/lenet.html): variants of MLP with local connectivity (to mimic biological receptive fields), weight sharing (detect features regardless of position + reduces parameters / capacity control + still trainable with SGD), and many layers, succession of pooling and subsampling operations, example CNN used: LeNet (cf. ["Gradient-based learning applied to document recognition", LeCun *et al.*, 1998](http://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf))
#
#
#
#
# ### Unsupervised learning
#
# * [Auto-Encoders](http://deeplearning.net/tutorial/dA.html#daa): building a latent representation (*coding*), mapped back (*decoding*) into a reconstruction of the input by minimzing the reconstruction error, application to image denoising
#
# * [Stacked Denoising Auto-Encoders](http://deeplearning.net/tutorial/SdA.html#sda): unsupervised pre-training one layer at a time for deep nets, final stage: task-specific fine tuning using a supervised MLP
#
# * [Restricted Boltzmann Machines (RBM)](http://deeplearning.net/tutorial/rbm.html#rbm):
# * single layer generative Energy-Based Model (EBM, cf. EBLearn overview below)
# * energy: $E(v, h) = -b' v - c'h - h'Wv$, where $v$ is the observed input units, $h$ the hidden units, $W$ are the connection weights, $b, c$ are offset parameters
#
#
#
# * a particular form of log-linear Markov Random Field (MRF), i.e., for which the energy function is linear in its free parameters
# * learned by SGD with MCMC sampling to approximate the gradient
#
# * [Deep Belief Networks (DBN)](http://deeplearning.net/tutorial/DBN.html#dbn): stacked RBMs, trained in a greedy layer-wise manner
#
#
#
# ## Overview of EBLearn
#
# * [EBLearn](http://eblearn.cs.nyu.edu:21991/doku.php):
# * well-architectured C++ library for Energy-Based Learning, maintained by Yann LeCun's group @NYU (in particular [Pierre Sermanet](http://cs.nyu.edu/~sermanet/))
#
#
#
# * used in robotics (c.f the [Learning Applied to Ground Robots (LAGR)](http://www.cs.nyu.edu/~yann/research/lagr/) DARPA project)
# * particular focus on supervised training of convolutional neural networks (especially LeNet)
#
# * Enery-Based learning (overview from [here](http://deeplearning.net/tutorial/rbm.html)):
# * associate a scalar energy to each configuration of the variables of interest
# * learning corresponds to modifying that energy function so that its shape has desirable properties:
# * plausible configurations $\Rightarrow$ low energy
# * bad configurations $\Rightarrow$ high energy
# * probability distribution: $Pr(x) = e^{-E(x)} / Z$, where $Z=\sum_x e^{-E(x)}$ is the *partition function*
# * model learnt by (stochastic) gradient descent on the empirical negative log-likelihood of the training data
#
# $$
# l(\theta, \mathcal{D}) = - \frac{1}{N} \sum_{x^{(i)} \in \mathcal{D}} \mathrm{log} Pr(x^{(i)})
# $$
#
# * with latent variables $h$, the model becomes
#
# $$
# Pr(x) = \sum_h Pr(x, h) = \sum_h \frac{e^{-E(x,h)}}{Z} = \frac{e^{-\mathcal{F}(x)}}{Z}
# $$
#
# where $\mathcal{F}(x) = - \mathrm{log} \sum_h e^{-E(x, h)}$ is called the **free energy** and $Z = \sum_x e^{-E(x,h)}$
#
# * in this case, the gradient of the data's negative log-likelihood decomposes over a *positive* and *negative* phase:
#
# $$
# \frac{\partial \log p(x)}{\partial \theta}
# = \frac{\partial \mathcal{F}(x)}{\partial \theta} -
# \sum_{\tilde{x}} p(\tilde{x})
# \frac{\partial \mathcal{F}(\tilde{x})}{\partial \theta}
# $$
#
# * approximate estimation of the *negative* phase is obtained by averaging over *negative particles* $\mathcal{N}$ sampled using MCMC
#
# $$
# \frac{\partial \log p(x)}{\partial \theta}
# \approx \frac{\partial \mathcal{F}(x)}{\partial \theta} -
# \frac{1}{|\mathcal{N}|}\sum_{\tilde{x} \in \mathcal{N}}
# \frac{\partial \mathcal{F}(\tilde{x})}{\partial \theta}
# $$
#
#
# ### EBLearn usage
#
# * Starting point: [paper](http://cs.nyu.edu/~koray/publis/sermanet-ictai-09.pdf), [tutorials](http://eblearn.cs.nyu.edu:21991/doku.php?id=start), [demos](http://eblearn.cs.nyu.edu:21991/doku.php?id=all_demos), and [documentation](http://eblearn.cs.nyu.edu:21991/doku.php?id=all_docs)
#
# * Installation: using CMake, cross-platform (even Android and iOS!), only few dependencies
#
# * Beginner tutorial ([part 1](http://eblearn.cs.nyu.edu:21991/doku.php?id=beginner_tutorial1_dscompile), [part 2](http://eblearn.cs.nyu.edu:21991/doku.php?id=beginner_tutorial2_train)):
# * build a LeNet5 digit regognizer on MNIST
#
#
#
# * no C++ coding required, just write EBLearn configuration files and use EBLearn's command-line tools (e.g. `train`)
#
# * Set of more advanced tutorials (cf. [here](http://eblearn.cs.nyu.edu:21991/doku.php?id=start)) decribe how to use EBLearn's C++ API, resting on its powerful tensor library `libidx` ([tuto](http://eblearn.cs.nyu.edu:21991/doku.php?id=libidx)) and energy-based learning library `libeblearn` ([tuto](http://eblearn.sourceforge.net/old/tutorials/libeblearn/index.shtml))
#
# * [Face demo](http://eblearn.cs.nyu.edu:21991/doku.php?id=face_detector) on the "Labeled Faces in the Wild" dataset: just type `make detect` :-)
#
#
#