# -*- coding: utf-8 -*-
# <nbformat>3.0</nbformat>

# <markdowncell>

# # Reading group on Deep Learning
# 
# LEAR - XRCE, 23 Nov. 2012
# 
# Notes on this document:
# 
# * this HTML page was generated from the [IPython](http://ipython.org/) notebook available [here](http://lear.inrialpes.fr/people/gaidon/lear_xrce_deep_learning_01.ipynb) (pure python version executable in any python interpreter: [here](http://lear.inrialpes.fr/people/gaidon/lear_xrce_deep_learning_01.py))
# 
# * dependencies to run the code (only Open Source Software!):
#     * [Python](http://python.org/): the best programming language, ever.
#     * [IPython](http://ipython.org/): an enhanced python interpreter + great tools for scientific workflow
#     * [Numpy](http://numpy.scipy.org/): the Matlab library for Python (multi-dimensional arrays, linear algebra, ...)
#     * [Matplotlib](http://matplotlib.org/): the scientific plotting library for Python
#     * [Networkx](http://networkx.lanl.gov/): package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks
#     * [Theano](http://deeplearning.net/software/theano/): efficient tensor library, CPU + GPU expresion compiler, used for Deep Learning
# 
# 
# ## Outline
# 
# * Refresher on Neural Networks
# 
#     * Neural Networks
# 
#     * Back-Propagation
# 
# * Deep Learning in practice
# 
#     * What is Deep Learning?
# 
#     * Overview of Theano
# 
#     * Overview of EBLearn
# 
# 
# ## Pointers
# 
# 
# ### Reading material
# 
# * [The Elements of Statistical Learning (5th ed.)](http://www-stat.stanford.edu/~tibs/ElemStatLearn/) (T. Hastie, R. Tibshirani, & J. Friedman) : Chapter 11 (Neural Networks)
# * [Learning Deep Architectures for AI](http://www.iro.umontreal.ca/~lisa/publications2/index.php/publications/show/239) (Y. Bengio, In Foundations & Trends in Machine Learning, 2009) : small book providing an introduction to deep learning and individual chapters on most of the main deep architectures
# * [A fast learning algorithm for deep belief nets](http://www.cs.toronto.edu/~hinton/absps/ncfast.pdf) (G. Hinton *et al.*, Neural Computation, 2006): fast algorithm for learning deep beliefs nets + justification of greedy layer-wise training and stacking
# * [Building High-level Features Using Large Scale Unsupervised Learning](http://research.google.com/archive/unsupervised_icml2012.html) (Quoc V. Le *et al.*, ICML'2012): Large scale deep learning (sparse auto-encoders) using 16000 cores; led to face and cat neurons from unlabeled data; state of the art on ImageNet from raw pixels; made the front page of NY Times.
# * [On Deep Generative Models with Applications to Recognition](http://www.cs.toronto.edu/~hinton/absps/ranzato_cvpr2011.pdf) (M.A. Ranzato *et al.*, CVPR 2011)
# * Y. LeCun's recent papers (cf. his [publication list](http://yann.lecun.com/exdb/publis/index.html))
# * [CVPR'2012 workshop on Deep Learning Methods for Vision](http://cs.nyu.edu/~fergus/tutorials/deep_learning_cvpr12/): overview of most common deep architectures, transfer learning, application to motion and video
# * [CVPR'2012 workshop on Multiview Feature Learning](http://learning.cs.toronto.edu/~rfm/multiview-feature-learning-cvpr/): recent workshop about deep unsupervised feature learning, includes application of deep learning techniques to video
# * [tutorial slides](http://deeplearningworkshopnips2010.files.wordpress.com/2010/09/nips10-workshop-tutorial-final.pdf) at the NIPS 2010 Workshop on Deep Learning and Unsupervised Feature Learning
# * cf. the [deeplearning.net reading list](http://deeplearning.net/reading-list/) for additional pointers
# 
# 
# ### Websites
# 
# * [deeplearning.net](http://deeplearning.net) : Big hub with many links related to deep learning (software, publications, ...)
# * [Deep learning tutorials](http://deeplearning.net/tutorial/) : Set of tutorials on deep learning (RBM, ConvNets, ...) with theano
# * [Neural Nets F.A.Q.](ftp://ftp.sas.com/pub/neural/FAQ.html) : extremely comprehensive FAQ on neural nets (from stats to software)
# * [Homepage of G. Hinton](http://www.cs.toronto.edu/~hinton/)
# * [Homepage of Y. LeCun](http://yann.lecun.com/)
# * Mainstream media coverage of advances in deep learning:
#     * [NY Times front page article](http://www.nytimes.com/2012/06/26/technology/in-a-big-network-of-computers-evidence-of-machine-learning.html) on the [ICML'12 paper](http://research.google.com/archive/unsupervised_icml2012.html) of Andrew Ng's group (Ng's interview: [here](http://www.npr.org/2012/06/26/155792609/a-massive-google-network-learns-to-identify))
#     * [Another NY Times article](http://www.nytimes.com/2012/11/24/science/scientists-see-advances-in-deep-learning-a-part-of-artificial-intelligence.html?pagewanted=all&_r=1&&_r=0) about the recent progress in deep learning on a wide range of problems
#     * [Blog post](http://blogs.technet.com/b/next/archive/2012/11/08/microsoft-research-shows-a-promising-new-breakthrough-in-speech-translation-technology.aspx#.ULM2fZu9DSJ) of [Rick Rashid](http://research.microsoft.com/en-us/people/rashid/) Microsoft's CRO, with an interesting video explaining the progress made in Automatic Speech Recognition and Translation: from matching sound waves, to HMMs, to Deep Learning; R. Rashid makes a cool live demo (at the end of the video) where his speech gets translated on the fly in (spoken!) Mandarin
# 
# 
# ### Code
# 
# * [theano](http://deeplearning.net/software/theano/) : Python library for efficient computations on tensors (multi-dimensional arrays)
# * [EBLearn](http://eblearn.cs.nyu.edu:21991/doku.php) : C++ library developped @NYU (LeCun) for *Energy-Based Learning*
# * [cuda-convnet](http://code.google.com/p/cuda-convnet/) : fast implementation of ConvNet in C++ and Cuda (GPU), by winners of ImageNet LSVRC 2012 (Krizhevsky A. *et al.* NIPS'12 paper to come)

# <markdowncell>

# # Refresher on Neural Networks
# 
# 
# ## Central idea
# 
# * extract *derived features* = linear combination of the inputs
# * predict target using a *non-linear* function of these features
# 
# 
# ## Preliminary: Projection Pursuit Regression (PPR)
# 
# * Supervised learning: input vector $X \in \mathbb{R}^p$, target $Y$
# * $V_m = w_m^T X$ : derived features, with $M$ unit vectors $w_m \in \mathbb{R}^p$
# * PPR prediction:
# 
# >    $$f(X) = \sum_{m=1}^{M} g_m(V_m)$$
# 
# * Parameters:
#     * $\\{w_m\\}_m$ : projection directions
#     * $\\{g_m\\}_m$ : ridge functions (each varies only in the direction of $w_m$) 

# <codecell>

# Plot of two ridge function examples


from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm
import matplotlib.pyplot as plt
import numpy as np


def g_left(X1, X2):
    V = (X1 + X2) / np.sqrt(2)
    return 1.0 / (1.0 + np.exp(-5 * (V - 0.5)))


def g_right(X1, X2):
    V = X1
    return (V + 0.1) * np.sin(10 / (V / 3.0 + 0.1))


def plot_surface_3d(X1, X2, func, fig=None, subplot=(1, 1, 1)):
    if fig is None:
        fig = plt.figure(figsize=(12, 5))
    ax = fig.add_subplot(*subplot, projection='3d')
    X1, X2 = np.meshgrid(X1, X2)
    Z = func(X1, X2)
    surf = ax.plot_surface(X1, X2, Z,
        rstride=1, cstride=1, cmap=cm.coolwarm,
        linewidth=0, antialiased=False)


fig = plt.figure(figsize=(12, 5))
X1, X2 = np.linspace(-1.5, 1.5, 150), np.linspace(-0.5, 1, 150)
plot_surface_3d(X1, X2, g_left, fig, subplot=(1, 2, 1))
X1, X2 =  np.linspace(-0.05, 0.05, 300), np.linspace(0, 1, 100)
plot_surface_3d(X1, X2, g_right, fig, subplot=(1, 2, 2))

# <markdowncell>

# * PPR model is a *universal approximator*: can approximate any continuous function in $\mathbb{R}^p$ (if M is taken arbitrarily large, for appropriate choices of $g_m$)
# * but interpretation of the fitted model is difficult (mixing of the input features)
# * **Learning**: PPR model is fitted by minimizing the squared error in an alternating manner:
#     * Given $w_m$, estimate $g_m$: 1D smoothing problem (e.g. use splines)
#     * Given $g_m$, estimate $w_m$: non-linear least-squares (e.g. use Gauss-Newton)
#     * Iterate these two steps until convergence
#     * Greedily add a new pair $(w_{m+1}, g_{m+1})$
# * PPR not used much (originally for computational reasons)

# <markdowncell>

# ## Neural Networks (NN)
# 
# 
# ### The model
# 
# * NN = non-linear statistical model inspired by biological neural networks
# * Set of neurons (much like $g_m(w_m^T X)$ in PPR) interconnected in layers in a feed-forward fashion

# <codecell>

# How to plot a network diagram of a NN


import matplotlib.pyplot as plt
import networkx as nx


class NeuralNetwork(object):
    """ A simple neural network class for visualization purposes
    """
    def __init__(self, n_nodes_per_layer):
        """ Build the network layer by layer
        """
        self.n_nodes_per_layer = n_nodes_per_layer
        self.graph = nx.DiGraph()
        self.nodes_pos = {}
        self.nodes_label = {}
        self.input_units = []
        self.hidden_units = []
        self.output_units = []

        # add the units per layer
        for layer_i, layer_size in enumerate(n_nodes_per_layer):

            # add the nodes for this layer
            for node_i in range(layer_size):

                node = "%d_%d" % (layer_i, node_i)  # simple encoding of a node
                self.graph.add_node(node)
                self.nodes_pos[node] = (layer_i, layer_size / 2.0 - node_i)

                if layer_i == 0:
                    # label for input layer
                    self.input_units.append(node)
                    self.nodes_label[node] = r"$X_%d$" % node_i
                elif layer_i == len(n_nodes_per_layer) - 1:
                    # label for output layer
                    self.output_units.append(node)
                    self.nodes_label[node] = r"$Y_%d$" % node_i
                else:
                    # hidden layer
                    self.hidden_units.append(node)
                    self.nodes_label[node] = r"$Z_{%d, %d}$" % (layer_i, node_i)

            # add the edges: full connection between layer_i -1 and layer_i
            if layer_i > 0:
                prev_layer_i = layer_i - 1
                prev_layer_size = n_nodes_per_layer[prev_layer_i]
                self.graph.add_edges_from([
                    ("%d_%d" % (prev_layer_i, l), "%d_%d" % (layer_i, k))
                    for k in range(layer_size) for l in range(prev_layer_size)
                ])

    def draw(self, ax=None):
        """ Draw the neural network
        """
        if ax is None:
            ax = plt.figure(figsize=(10, 6)).add_subplot(1, 1, 1)
        nx.draw_networkx_edges(self.graph, pos=self.nodes_pos, alpha=0.7, ax=ax)
        nx.draw_networkx_nodes(self.graph, nodelist=self.input_units,
                               pos=self.nodes_pos, ax=ax,
                               node_color='#66FFFF', node_size=700)
        nx.draw_networkx_nodes(self.graph, nodelist=self.hidden_units,
                               pos=self.nodes_pos, ax=ax,
                               node_color='#CCCCCC', node_size=900)
        nx.draw_networkx_nodes(self.graph, nodelist=self.output_units,
                               pos=self.nodes_pos, ax=ax,
                               node_color='#FFFF99', node_size=700)
        nx.draw_networkx_labels(self.graph, labels=self.nodes_label,
                                pos=self.nodes_pos, font_size=14, ax=ax)
        ax.axis('off')


# classifcal 3 layer neural network example
n_layers = 3  # input layer | hidden layer | output layer
n_input_dims = 5
n_hidden_units = 3
n_output_dims = 2
n_nodes_per_layer = [n_input_dims, n_hidden_units, n_output_dims]
nn1 = NeuralNetwork(n_nodes_per_layer)

# another example
nn2 = NeuralNetwork([10, 5, 10])

# show the graphical representations
fig = plt.figure(figsize=(12, 8))
nn1.draw(fig.add_subplot(1, 2, 1))
nn2.draw(fig.add_subplot(1, 2, 2))

# <markdowncell>

# Formal definition of NN (assuming full connectivity between layers):
# 
# * **Input layer**: one unit for each dimension of the input $X \in \mathbb{R}^p$
# * **Hidden layers**: *derived features* from the input of the previous layer
#     * Assuming one hidden layer with $M$ units
#     * Derived features: $Z_m = \sigma (\alpha_m^T X + \alpha_{0, m}), \ m = 1, \cdots, M$
#     * Activation function $\sigma$: typically a sigmoid $\sigma(v) = (1 + e^{-v})^{-1}$
# * **Output layer**: one unit for each dimension of the output $Y$
#     * Output $Y_k$ again computed using a linear combination of the outputs $Z$ of the previous layer: $Y_k = f_k(X) = g_k(\beta_m^T Z + \beta_{0, m})$
#     * Output function $g_k$ chosen based on task:
#         * regression (e.g. denoising with auto-encoders, $p$ units to obtain $Y \in \mathbb{R}^p$ = denoised $X$): identity $g_k(V) = V$
#         * classification (e.g. for object recognition, $K$ units to output probability for each possible object category): softmax $g_k(V) = \frac{e^{V_k}}{\sum_{l=1}^K e^{V_k}}$
# * Parameters (*weights*, noted $\theta$) to estimate from the data:
#     * the $\alpha_m$'s and bias terms $\alpha_{0, m}$ of each unit at each hidden layer
#     * the $\beta_m$'s and bias terms $\beta_{0, m}$ of each unit of the output layer
# * Remarks:
#     * linear $\sigma \Rightarrow$ linear model
#     * one hidden layer $\Rightarrow$ similar to PPR (but PPR uses few non-parametric $g_m$ functions, whereas NN uses many units with a simple $\sigma$ function)

# <markdowncell>

# ## The Back-Propagation algorithm
# 
# (Again assuming one hidden layer for brevity)
# 
# * Algorithm used to learn the weights $\theta$ in a NN
# * Minimization of a loss function $R(\theta)$
#     * Regression: SSE
# 
#         $$R(\theta) = \sum_{k=1}^K \sum_{i=1}^N (y_{i,k} - f_k(x_i))^2$$
# 
#     * Classification: (SSE or) Cross-Entropy
# 
#         $$R(\theta) = - \sum_{i=1}^N \sum_{k=1}^K y_{i,k} \mathrm{log} f_k(x_i)$$
# 
# * **Regularization**:
#     * by adding a penalty term (e.g. $\Vert \theta \Vert_2^2$, called weight decay)
#     * or early stopping
# 
# * **Back-Propagation**:
#     * minimization of $R(\theta)$ by *gradient descent*:
# 
#         $$\beta_{k, m}^{(r+1)} = \beta_{k, m}^{(r)} - \gamma_r \frac{\partial R_i}{\partial \beta_{k, m}^{(r)}}$$
# 
#         $$\alpha_{m, l}^{(r+1)} = \alpha_{m, l}^{(r)} - \gamma_r \frac{\partial R_i}{\partial \alpha_{m, l}^{(r)}}$$
# 
# 
#     * Computing the gradient: using the chain rule, done in a forward and backward sweep over the network
#         * forward pass: fix the weights, get the predicted values $\hat{f}_k(x_i)$
#         * backward pass: compute the errors and back-propagate (estimate weights from output $\rightarrow$ input layers
# 
# ### Remarks
# 
# * back-propagation can be done in batch or *online*, i.e. one observation at a time $\Rightarrow$ (SGD: stochastic gradient descent)
# * learning rate $\gamma_r$:
#     * usually constant in batch
#     * must verify $\gamma_r \rightarrow 0$, $\sum_r \gamma_r = \infty$, and $\sum_r \gamma_r^2 \lt \infty$ (e.g. $\gamma_r = 1/r$) in SGD
# * back-prop is very slow in practice: use conjugate gradient
# * scaling of inputs determines scaling of weights $\Rightarrow$ standardize (0 mean, unit variance)
# * initialization:
#     * important in practice: loss is not convex (many local minima!)
#     * multiple re-starts with random near-zero weights (nearly linear regime for sigmoid)
#     * Combine the outputs of $L$ different learned NNs by averaing their predictions
#     
#         $$\hat{f}(X_{\mathrm{new}}) = \sum_{l=1}^L w_l \mathbb{E}(Y_{\mathrm{new}} | X_{\mathrm{new}}, \hat{\theta}_l)$$
# 
#         * *bagging*: $w_l = 1/L$ and $\hat{\theta}_l$ parameters obtained by fitting on different random permutations of the training data
#         * *boosting*: $w_l = 1$ but $\hat{\theta}_l$ chosen in non-random sequential fashion to improve the fit
#         * *Bayesian inference*: $w_l = 1/L$ and sample $\theta_l$ from the posterior using MCMC
# * number of hidden:
#     * units: the more the better + regularization
#     * layers: task-dependent (multiple layers $\Rightarrow$ hierarchical features)
# 
#     > $\Rightarrow$ that is where hand-tuning comes in! no free lunch ;-)
# 
# * difficult to interpret / visualize what is learned (is it still true?)

# <markdowncell>

# # Deep Learning in practice

# <markdowncell>

# ## What is Deep Learning?
# 
# * According to [deeplearning.net](http://deeplearning.net):
# 
#     > "Deep Learning is a new area of Machine Learning research, which has been introduced with the objective of moving Machine Learning closer to one of its original goals: Artificial Intelligence."
# 
#     > "... moving beyond shallow learning since 2006!"
# 
# * Learning of probabilistic generative models with deep architectures, i.e. with many layers of non-linearities
# 
# 
# ### Examples
# 
# * Deep Multi-Layer Perceptrons (i.e. NN like before, but with many hidden layers)
# * Convolutional Networks
# * Stacked Auto-encoders
# * Restricted Boltzman Machines (RBM)
# * Deep Belief Networks (DBN)
# 
# 
# ### Why going deep?
#     
# * Can trade-off number of units for layers (more efficient: less connections and therefore weights)
# * Hierarchical: useful for multi-scale analysis
# * "Combinatorial sharing of statistical strength between multiple levels of latent variables" (NIPS'10 workshop)... ?
# * it works (according to Hinton and Lecun... but starting to generalize more and more to other researchers)
# 
# 
# ### How to learn deeply?
# 
# * Back-prop has some severe limitations:
#     * Gradient gets progressively diluted across layers (very limited corrections in first layers)
#     * Local minima (very sensitive to initialization)
#     * Needs labeled data (in usual settings)
# 
# * What works, e.g. on Deep Belief Networks (DBN):
#     * greedy layer-wise unsupervised learning (serves as initialization step):
#         * learn a RBM by maximizing lower bound on likelihood
#         * stack a new RBM on top, update the prior of the second layer with posterior of the first, learn
#         * iterate
#         * justified in theory (cf. [Hinton'2006](http://www.cs.toronto.edu/~hinton/absps/ncfast.pdf))
#     * top-down supervised training as final refinement step
# 
# * Many tools to learn deep architectures:
#     * At the core: libraries to work on *tensors* (multi-dimensional arrays)
#     * We will review two recent, efficient, easy to use, and actively maintained librairies:
#         * Theano (python)
#         * EBLearn (C++)

# <markdowncell>

# ## Overview of Theano
# 
# * [Theano](http://deeplearning.net/software/theano/):
#     * python Open Source library (BSD-like permissive license, cross-platform)
#     * Active project maintained by many people, including members of Yoshua Bengio's [LISA](http://www.iro.umontreal.ca/rubrique.php3?id_rubrique=27) group at U. of Montreal
# 
# * Main purpose of Theano (from their [website](http://deeplearning.net/software/theano/)):
# 
# > Theano is a Python library that allows you to define, optimize, and
# > evaluate mathematical expressions involving multi-dimensional
# > arrays efficiently. Theano features:
# >
# > * **tight integration with numpy** -- Use `numpy.ndarray` in Theano-compiled functions.
# > * **transparent use of a GPU** -- Perform data-intensive calculations up to 140x faster than with CPU.(float32 only)
# > * **efficient symbolic differentiation** -- Theano does your derivatives for function with one or many inputs.
# > * **speed and stability optimizations** -- Get the right answer for ``log(1+x)`` even when ``x`` is really tiny.
# > * **dynamic C code generation** -- Evaluate expressions faster.
# > * **extensive unit-testing and self-verification** -- Detect and diagnose many types of mistake.
# 
# * Sort of mix between [Numpy](http://numpy.scipy.org/) (Python's Matlab) and [Sympy](http://sympy.org/en/index.html) (Python's Mathematica) focusing on tensor operations
# 
# >    * *execution speed optimizations*: Theano can use `g++` or `nvcc` to compile
# >      parts your expression graph into CPU or GPU instructions, which run
# >      much faster than pure Python.
# >
# >    * *symbolic differentiation*: Theano can automatically build symbolic graphs
# >      for computing gradients.
# >
# >    * *stability optimizations*: Theano can recognize [some] numerically unstable
# >      expressions and compute them with more stable algorithms.
# 
# * Ambitious goals (and they're making progress!)
# 
# > * Support tensor and sparse operations
# > * Support linear algebra operations
# > * Graph Transformations
# >     * Differentiation/higher order differentiation
# >     * 'R' and 'L' differential operators
# >     * Speed/memory optimizations
# >     * Numerical stability optimizations
# > * Can use many compiled languages, instructions sets: C/C++, CUDA, OpenCL, PTX, CAL, AVX, ...
# > * Lazy evaluation
# > * Loop
# > * Parallel execution (SIMD, multi-core, multi-node on cluster,
# >   multi-node distributed)
# > * Support all NumPy/basic SciPy functionality
# > * Easy wrapping of library functions in Theano
# 
# * Easy to install (just do: ``pip install Theano``)
# * Extensively documented ([docs](http://deeplearning.net/software/theano/library/index.html#libdoc))
# * Great tutorial ([tuto](http://deeplearning.net/software/theano/tutorial/index.html#tutorial))
# * Great [tutorials on deep learning](http://deeplearning.net/tutorial/) written by members of the LISA lab

# <codecell>

# Sneak peek

import theano
from theano import tensor

# declare two symbolic floating-point scalars
a = tensor.dscalar()
b = tensor.dscalar()

# create a simple expression
c = a + b

# convert the expression into a callable object that takes (a,b)
# values as input and computes a value for c
f = theano.function([a,b], c)

# bind 1.5 to 'a', 2.5 to 'b', and evaluate 'c'
assert 4.0 == f(1.5, 2.5)

# <codecell>

# Logistic Regression simple example

import numpy
import theano
import theano.tensor as T
rng = numpy.random

N = 400
feats = 784
D = (rng.randn(N, feats), rng.randint(size=N,low=0, high=2))
training_steps = 10000

# Declare Theano symbolic variables
x = T.matrix("x")
y = T.vector("y")
w = theano.shared(rng.randn(feats), name="w")
b = theano.shared(0., name="b")
#print "Initial model:"
#print w.get_value(), b.get_value()

# Construct Theano expression graph
p_1 = 1 / (1 + T.exp(-T.dot(x, w) - b))       # Probability that target = 1
prediction = p_1 > 0.5                        # The prediction thresholded
xent = -y * T.log(p_1) - (1-y) * T.log(1-p_1) # Cross-entropy loss function
cost = xent.mean() + 0.01 * (w ** 2).sum()    # The cost to minimize
gw, gb = T.grad(cost, [w, b])                 # Compute the gradient of the cost

# Compile
train = theano.function(
          inputs=[x,y],
          outputs=[prediction, xent],
          updates={w: w - 0.1 * gw, b: b - 0.1 * gb})
predict = theano.function(inputs=[x], outputs=prediction)

# Train
for i in range(training_steps):
    pred, err = train(D[0], D[1])

# To display the model, uncomment below:
#print "Final model:"
#print w.get_value(), b.get_value()
#print "target values for D:", D[1]
#print "prediction on D:", predict(D[0])

# <markdowncell>

# ## Deep Learning in Theano
# 
# Many implementations already available and documented in the [deep learning tutorials]((http://deeplearning.net/tutorial/)) (cf. [the code on github](https://github.com/lisa-lab/DeepLearningTutorials))
# 
# 
# ### Supervised learning
# 
# * [Introduction](http://deeplearning.net/tutorial/gettingstarted.html): notations, datasets, extensive primer on supervised deep learning
# * [Logistic Regression (LR) + application on MNIST](http://deeplearning.net/tutorial/logreg.html) (both implemented using [conjugate gradient](https://github.com/lisa-lab/DeepLearningTutorials/blob/master/code/logistic_cg.py) and [SGD](https://github.com/lisa-lab/DeepLearningTutorials/blob/master/code/logistic_sgd.py)): often used as the last layer of classification architectures
# 
# >   $$
# >   P(Y=i|x, W,b) = softmax_i(W x + b) \\
# >                   = \frac {e^{W_i x + b_i}} {\sum_j e^{W_j x + b_j}}
# >  $$
# 
# >  $$y_{pred} = {\rm argmax}_i P(Y=i|x,W,b)$$
# 
# * [Multilayer Perceptron (MLP)](http://deeplearning.net/tutorial/mlp.html): going from LR to MLP, train with SGD + tricks of the trade (activation functions, regularization, initialization, learning rate, ...)
# 
# <img src="http://deeplearning.net/tutorial/_images/mlp.png" align=center>
# 
# * [Deep Convolutional Neural Networks (CNN)](http://deeplearning.net/tutorial/lenet.html): variants of MLP with local connectivity (to mimic biological receptive fields), weight sharing (detect features regardless of position + reduces parameters / capacity control + still trainable with SGD), and many layers, succession of pooling and subsampling operations, example CNN used: LeNet (cf. ["Gradient-based learning applied to document recognition", LeCun *et al.*, 1998](http://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf))
# 
# <img src="http://deeplearning.net/tutorial/_images/mylenet.png" align=center>
# 
# 
# ### Unsupervised learning
# 
# * [Auto-Encoders](http://deeplearning.net/tutorial/dA.html#daa): building a latent representation (*coding*), mapped back (*decoding*) into a reconstruction of the input by minimzing the reconstruction error, application to image denoising
# 
# * [Stacked  Denoising Auto-Encoders](http://deeplearning.net/tutorial/SdA.html#sda): unsupervised pre-training one layer at a time for deep nets, final stage: task-specific fine tuning using a supervised MLP
# 
# * [Restricted Boltzmann Machines (RBM)](http://deeplearning.net/tutorial/rbm.html#rbm):
#     * single layer generative Energy-Based Model (EBM, cf. EBLearn overview below)
#     * energy: $E(v, h) = -b' v - c'h - h'Wv$, where $v$ is the observed input units, $h$ the hidden units, $W$ are the connection weights, $b, c$ are offset parameters
# 
#     <img src="http://deeplearning.net/tutorial/_images/rbm.png" align=center>
# 
#     * a particular form of log-linear Markov Random Field (MRF), i.e., for which the energy function is linear in its free parameters
#     * learned by SGD with MCMC sampling to approximate the gradient
# 
# * [Deep Belief Networks (DBN)](http://deeplearning.net/tutorial/DBN.html#dbn): stacked RBMs, trained in a greedy layer-wise manner
# 
# <img src="http://deeplearning.net/tutorial/_images/DBN3.png" align=center>

# <markdowncell>

# ## Overview of EBLearn
# 
# * [EBLearn](http://eblearn.cs.nyu.edu:21991/doku.php):
#     * well-architectured C++ library for Energy-Based Learning, maintained by Yann LeCun's group @NYU (in particular [Pierre Sermanet](http://cs.nyu.edu/~sermanet/))
#     
#     <img src="http://eblearn.cs.nyu.edu:21991/lib/exe/fetch.php?cache=&media=eblearn.png" align=center>
# 
#     * used in robotics (c.f the [Learning Applied to Ground Robots (LAGR)](http://www.cs.nyu.edu/~yann/research/lagr/) DARPA project)
#     * particular focus on supervised training of convolutional neural networks (especially LeNet)
# 
# * Enery-Based learning (overview from [here](http://deeplearning.net/tutorial/rbm.html)):
#     * associate a scalar energy to each configuration of the variables of interest
#     * learning corresponds to modifying that energy function so that its shape has desirable properties:
#         * plausible configurations $\Rightarrow$ low energy
#         * bad configurations $\Rightarrow$ high energy
#     * probability distribution: $Pr(x) = e^{-E(x)} / Z$, where $Z=\sum_x e^{-E(x)}$ is the *partition function*
#     * model learnt by (stochastic) gradient descent on the empirical negative log-likelihood of the training data
# 
#         $$
#         l(\theta, \mathcal{D}) = - \frac{1}{N} \sum_{x^{(i)} \in \mathcal{D}} \mathrm{log} Pr(x^{(i)})
#         $$
# 
# * with latent variables $h$, the model becomes
# 
#     $$
#     Pr(x) = \sum_h Pr(x, h) = \sum_h \frac{e^{-E(x,h)}}{Z} = \frac{e^{-\mathcal{F}(x)}}{Z}
#     $$
# 
#     where $\mathcal{F}(x) = - \mathrm{log} \sum_h e^{-E(x, h)}$ is called the **free energy** and $Z = \sum_x e^{-E(x,h)}$
# 
# * in this case, the gradient of the data's negative log-likelihood decomposes over a *positive* and *negative* phase:
# 
#     $$
#     \frac{\partial  \log p(x)}{\partial \theta}
#     = \frac{\partial \mathcal{F}(x)}{\partial \theta} -
#       \sum_{\tilde{x}} p(\tilde{x})
#       \frac{\partial \mathcal{F}(\tilde{x})}{\partial \theta}
#     $$
# 
# * approximate estimation of the *negative* phase is obtained by averaging over *negative particles* $\mathcal{N}$ sampled using MCMC
# 
#     $$
#     \frac{\partial  \log p(x)}{\partial \theta}
#     \approx \frac{\partial \mathcal{F}(x)}{\partial \theta} -
#     \frac{1}{|\mathcal{N}|}\sum_{\tilde{x} \in \mathcal{N}}
#     \frac{\partial \mathcal{F}(\tilde{x})}{\partial \theta}
#     $$
# 
# 
# ### EBLearn usage
# 
# * Starting point: [paper](http://cs.nyu.edu/~koray/publis/sermanet-ictai-09.pdf), [tutorials](http://eblearn.cs.nyu.edu:21991/doku.php?id=start), [demos](http://eblearn.cs.nyu.edu:21991/doku.php?id=all_demos), and [documentation](http://eblearn.cs.nyu.edu:21991/doku.php?id=all_docs)
# 
# * Installation: using CMake, cross-platform (even Android and iOS!), only few dependencies
# 
# * Beginner tutorial ([part 1](http://eblearn.cs.nyu.edu:21991/doku.php?id=beginner_tutorial1_dscompile), [part 2](http://eblearn.cs.nyu.edu:21991/doku.php?id=beginner_tutorial2_train)):
#     * build a LeNet5 digit regognizer on MNIST
# 
#     <img src="http://eblearn.cs.nyu.edu:21991/lib/exe/fetch.php?cache=&media=lenet5.png" align=center>
# 
#     * no C++ coding required, just write EBLearn configuration files and use EBLearn's command-line tools (e.g. `train`)
# 
# * Set of more advanced tutorials (cf. [here](http://eblearn.cs.nyu.edu:21991/doku.php?id=start)) decribe how to use EBLearn's C++ API, resting on its powerful tensor library `libidx` ([tuto](http://eblearn.cs.nyu.edu:21991/doku.php?id=libidx)) and energy-based learning library `libeblearn` ([tuto](http://eblearn.sourceforge.net/old/tutorials/libeblearn/index.shtml))
# 
# * [Face demo](http://eblearn.cs.nyu.edu:21991/doku.php?id=face_detector) on the "Labeled Faces in the Wild" dataset: just type `make detect` :-)
# 
# <img src="http://eblearn.cs.nyu.edu:21991/lib/exe/fetch.php?media=face_detection_details.png" align=center>

# <codecell>