# -*- coding: utf-8 -*- # 3.0 # # # Reading group on Deep Learning # # LEAR - XRCE, 23 Nov. 2012 # # Notes on this document: # # * this HTML page was generated from the [IPython](http://ipython.org/) notebook available [here](http://lear.inrialpes.fr/people/gaidon/lear_xrce_deep_learning_01.ipynb) (pure python version executable in any python interpreter: [here](http://lear.inrialpes.fr/people/gaidon/lear_xrce_deep_learning_01.py)) # # * dependencies to run the code (only Open Source Software!): # * [Python](http://python.org/): the best programming language, ever. # * [IPython](http://ipython.org/): an enhanced python interpreter + great tools for scientific workflow # * [Numpy](http://numpy.scipy.org/): the Matlab library for Python (multi-dimensional arrays, linear algebra, ...) # * [Matplotlib](http://matplotlib.org/): the scientific plotting library for Python # * [Networkx](http://networkx.lanl.gov/): package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks # * [Theano](http://deeplearning.net/software/theano/): efficient tensor library, CPU + GPU expresion compiler, used for Deep Learning # # # ## Outline # # * Refresher on Neural Networks # # * Neural Networks # # * Back-Propagation # # * Deep Learning in practice # # * What is Deep Learning? # # * Overview of Theano # # * Overview of EBLearn # # # ## Pointers # # # ### Reading material # # * [The Elements of Statistical Learning (5th ed.)](http://www-stat.stanford.edu/~tibs/ElemStatLearn/) (T. Hastie, R. Tibshirani, & J. Friedman) : Chapter 11 (Neural Networks) # * [Learning Deep Architectures for AI](http://www.iro.umontreal.ca/~lisa/publications2/index.php/publications/show/239) (Y. Bengio, In Foundations & Trends in Machine Learning, 2009) : small book providing an introduction to deep learning and individual chapters on most of the main deep architectures # * [A fast learning algorithm for deep belief nets](http://www.cs.toronto.edu/~hinton/absps/ncfast.pdf) (G. Hinton *et al.*, Neural Computation, 2006): fast algorithm for learning deep beliefs nets + justification of greedy layer-wise training and stacking # * [Building High-level Features Using Large Scale Unsupervised Learning](http://research.google.com/archive/unsupervised_icml2012.html) (Quoc V. Le *et al.*, ICML'2012): Large scale deep learning (sparse auto-encoders) using 16000 cores; led to face and cat neurons from unlabeled data; state of the art on ImageNet from raw pixels; made the front page of NY Times. # * [On Deep Generative Models with Applications to Recognition](http://www.cs.toronto.edu/~hinton/absps/ranzato_cvpr2011.pdf) (M.A. Ranzato *et al.*, CVPR 2011) # * Y. LeCun's recent papers (cf. his [publication list](http://yann.lecun.com/exdb/publis/index.html)) # * [CVPR'2012 workshop on Deep Learning Methods for Vision](http://cs.nyu.edu/~fergus/tutorials/deep_learning_cvpr12/): overview of most common deep architectures, transfer learning, application to motion and video # * [CVPR'2012 workshop on Multiview Feature Learning](http://learning.cs.toronto.edu/~rfm/multiview-feature-learning-cvpr/): recent workshop about deep unsupervised feature learning, includes application of deep learning techniques to video # * [tutorial slides](http://deeplearningworkshopnips2010.files.wordpress.com/2010/09/nips10-workshop-tutorial-final.pdf) at the NIPS 2010 Workshop on Deep Learning and Unsupervised Feature Learning # * cf. the [deeplearning.net reading list](http://deeplearning.net/reading-list/) for additional pointers # # # ### Websites # # * [deeplearning.net](http://deeplearning.net) : Big hub with many links related to deep learning (software, publications, ...) # * [Deep learning tutorials](http://deeplearning.net/tutorial/) : Set of tutorials on deep learning (RBM, ConvNets, ...) with theano # * [Neural Nets F.A.Q.](ftp://ftp.sas.com/pub/neural/FAQ.html) : extremely comprehensive FAQ on neural nets (from stats to software) # * [Homepage of G. Hinton](http://www.cs.toronto.edu/~hinton/) # * [Homepage of Y. LeCun](http://yann.lecun.com/) # * Mainstream media coverage of advances in deep learning: # * [NY Times front page article](http://www.nytimes.com/2012/06/26/technology/in-a-big-network-of-computers-evidence-of-machine-learning.html) on the [ICML'12 paper](http://research.google.com/archive/unsupervised_icml2012.html) of Andrew Ng's group (Ng's interview: [here](http://www.npr.org/2012/06/26/155792609/a-massive-google-network-learns-to-identify)) # * [Another NY Times article](http://www.nytimes.com/2012/11/24/science/scientists-see-advances-in-deep-learning-a-part-of-artificial-intelligence.html?pagewanted=all&_r=1&&_r=0) about the recent progress in deep learning on a wide range of problems # * [Blog post](http://blogs.technet.com/b/next/archive/2012/11/08/microsoft-research-shows-a-promising-new-breakthrough-in-speech-translation-technology.aspx#.ULM2fZu9DSJ) of [Rick Rashid](http://research.microsoft.com/en-us/people/rashid/) Microsoft's CRO, with an interesting video explaining the progress made in Automatic Speech Recognition and Translation: from matching sound waves, to HMMs, to Deep Learning; R. Rashid makes a cool live demo (at the end of the video) where his speech gets translated on the fly in (spoken!) Mandarin # # # ### Code # # * [theano](http://deeplearning.net/software/theano/) : Python library for efficient computations on tensors (multi-dimensional arrays) # * [EBLearn](http://eblearn.cs.nyu.edu:21991/doku.php) : C++ library developped @NYU (LeCun) for *Energy-Based Learning* # * [cuda-convnet](http://code.google.com/p/cuda-convnet/) : fast implementation of ConvNet in C++ and Cuda (GPU), by winners of ImageNet LSVRC 2012 (Krizhevsky A. *et al.* NIPS'12 paper to come) # # # Refresher on Neural Networks # # # ## Central idea # # * extract *derived features* = linear combination of the inputs # * predict target using a *non-linear* function of these features # # # ## Preliminary: Projection Pursuit Regression (PPR) # # * Supervised learning: input vector $X \in \mathbb{R}^p$, target $Y$ # * $V_m = w_m^T X$ : derived features, with $M$ unit vectors $w_m \in \mathbb{R}^p$ # * PPR prediction: # # > $$f(X) = \sum_{m=1}^{M} g_m(V_m)$$ # # * Parameters: # * $\\{w_m\\}_m$ : projection directions # * $\\{g_m\\}_m$ : ridge functions (each varies only in the direction of $w_m$) # # Plot of two ridge function examples from mpl_toolkits.mplot3d import Axes3D from matplotlib import cm import matplotlib.pyplot as plt import numpy as np def g_left(X1, X2): V = (X1 + X2) / np.sqrt(2) return 1.0 / (1.0 + np.exp(-5 * (V - 0.5))) def g_right(X1, X2): V = X1 return (V + 0.1) * np.sin(10 / (V / 3.0 + 0.1)) def plot_surface_3d(X1, X2, func, fig=None, subplot=(1, 1, 1)): if fig is None: fig = plt.figure(figsize=(12, 5)) ax = fig.add_subplot(*subplot, projection='3d') X1, X2 = np.meshgrid(X1, X2) Z = func(X1, X2) surf = ax.plot_surface(X1, X2, Z, rstride=1, cstride=1, cmap=cm.coolwarm, linewidth=0, antialiased=False) fig = plt.figure(figsize=(12, 5)) X1, X2 = np.linspace(-1.5, 1.5, 150), np.linspace(-0.5, 1, 150) plot_surface_3d(X1, X2, g_left, fig, subplot=(1, 2, 1)) X1, X2 = np.linspace(-0.05, 0.05, 300), np.linspace(0, 1, 100) plot_surface_3d(X1, X2, g_right, fig, subplot=(1, 2, 2)) # # * PPR model is a *universal approximator*: can approximate any continuous function in $\mathbb{R}^p$ (if M is taken arbitrarily large, for appropriate choices of $g_m$) # * but interpretation of the fitted model is difficult (mixing of the input features) # * **Learning**: PPR model is fitted by minimizing the squared error in an alternating manner: # * Given $w_m$, estimate $g_m$: 1D smoothing problem (e.g. use splines) # * Given $g_m$, estimate $w_m$: non-linear least-squares (e.g. use Gauss-Newton) # * Iterate these two steps until convergence # * Greedily add a new pair $(w_{m+1}, g_{m+1})$ # * PPR not used much (originally for computational reasons) # # ## Neural Networks (NN) # # # ### The model # # * NN = non-linear statistical model inspired by biological neural networks # * Set of neurons (much like $g_m(w_m^T X)$ in PPR) interconnected in layers in a feed-forward fashion # # How to plot a network diagram of a NN import matplotlib.pyplot as plt import networkx as nx class NeuralNetwork(object): """ A simple neural network class for visualization purposes """ def __init__(self, n_nodes_per_layer): """ Build the network layer by layer """ self.n_nodes_per_layer = n_nodes_per_layer self.graph = nx.DiGraph() self.nodes_pos = {} self.nodes_label = {} self.input_units = [] self.hidden_units = [] self.output_units = [] # add the units per layer for layer_i, layer_size in enumerate(n_nodes_per_layer): # add the nodes for this layer for node_i in range(layer_size): node = "%d_%d" % (layer_i, node_i) # simple encoding of a node self.graph.add_node(node) self.nodes_pos[node] = (layer_i, layer_size / 2.0 - node_i) if layer_i == 0: # label for input layer self.input_units.append(node) self.nodes_label[node] = r"$X_%d$" % node_i elif layer_i == len(n_nodes_per_layer) - 1: # label for output layer self.output_units.append(node) self.nodes_label[node] = r"$Y_%d$" % node_i else: # hidden layer self.hidden_units.append(node) self.nodes_label[node] = r"$Z_{%d, %d}$" % (layer_i, node_i) # add the edges: full connection between layer_i -1 and layer_i if layer_i > 0: prev_layer_i = layer_i - 1 prev_layer_size = n_nodes_per_layer[prev_layer_i] self.graph.add_edges_from([ ("%d_%d" % (prev_layer_i, l), "%d_%d" % (layer_i, k)) for k in range(layer_size) for l in range(prev_layer_size) ]) def draw(self, ax=None): """ Draw the neural network """ if ax is None: ax = plt.figure(figsize=(10, 6)).add_subplot(1, 1, 1) nx.draw_networkx_edges(self.graph, pos=self.nodes_pos, alpha=0.7, ax=ax) nx.draw_networkx_nodes(self.graph, nodelist=self.input_units, pos=self.nodes_pos, ax=ax, node_color='#66FFFF', node_size=700) nx.draw_networkx_nodes(self.graph, nodelist=self.hidden_units, pos=self.nodes_pos, ax=ax, node_color='#CCCCCC', node_size=900) nx.draw_networkx_nodes(self.graph, nodelist=self.output_units, pos=self.nodes_pos, ax=ax, node_color='#FFFF99', node_size=700) nx.draw_networkx_labels(self.graph, labels=self.nodes_label, pos=self.nodes_pos, font_size=14, ax=ax) ax.axis('off') # classifcal 3 layer neural network example n_layers = 3 # input layer | hidden layer | output layer n_input_dims = 5 n_hidden_units = 3 n_output_dims = 2 n_nodes_per_layer = [n_input_dims, n_hidden_units, n_output_dims] nn1 = NeuralNetwork(n_nodes_per_layer) # another example nn2 = NeuralNetwork([10, 5, 10]) # show the graphical representations fig = plt.figure(figsize=(12, 8)) nn1.draw(fig.add_subplot(1, 2, 1)) nn2.draw(fig.add_subplot(1, 2, 2)) # # Formal definition of NN (assuming full connectivity between layers): # # * **Input layer**: one unit for each dimension of the input $X \in \mathbb{R}^p$ # * **Hidden layers**: *derived features* from the input of the previous layer # * Assuming one hidden layer with $M$ units # * Derived features: $Z_m = \sigma (\alpha_m^T X + \alpha_{0, m}), \ m = 1, \cdots, M$ # * Activation function $\sigma$: typically a sigmoid $\sigma(v) = (1 + e^{-v})^{-1}$ # * **Output layer**: one unit for each dimension of the output $Y$ # * Output $Y_k$ again computed using a linear combination of the outputs $Z$ of the previous layer: $Y_k = f_k(X) = g_k(\beta_m^T Z + \beta_{0, m})$ # * Output function $g_k$ chosen based on task: # * regression (e.g. denoising with auto-encoders, $p$ units to obtain $Y \in \mathbb{R}^p$ = denoised $X$): identity $g_k(V) = V$ # * classification (e.g. for object recognition, $K$ units to output probability for each possible object category): softmax $g_k(V) = \frac{e^{V_k}}{\sum_{l=1}^K e^{V_k}}$ # * Parameters (*weights*, noted $\theta$) to estimate from the data: # * the $\alpha_m$'s and bias terms $\alpha_{0, m}$ of each unit at each hidden layer # * the $\beta_m$'s and bias terms $\beta_{0, m}$ of each unit of the output layer # * Remarks: # * linear $\sigma \Rightarrow$ linear model # * one hidden layer $\Rightarrow$ similar to PPR (but PPR uses few non-parametric $g_m$ functions, whereas NN uses many units with a simple $\sigma$ function) # # ## The Back-Propagation algorithm # # (Again assuming one hidden layer for brevity) # # * Algorithm used to learn the weights $\theta$ in a NN # * Minimization of a loss function $R(\theta)$ # * Regression: SSE # # $$R(\theta) = \sum_{k=1}^K \sum_{i=1}^N (y_{i,k} - f_k(x_i))^2$$ # # * Classification: (SSE or) Cross-Entropy # # $$R(\theta) = - \sum_{i=1}^N \sum_{k=1}^K y_{i,k} \mathrm{log} f_k(x_i)$$ # # * **Regularization**: # * by adding a penalty term (e.g. $\Vert \theta \Vert_2^2$, called weight decay) # * or early stopping # # * **Back-Propagation**: # * minimization of $R(\theta)$ by *gradient descent*: # # $$\beta_{k, m}^{(r+1)} = \beta_{k, m}^{(r)} - \gamma_r \frac{\partial R_i}{\partial \beta_{k, m}^{(r)}}$$ # # $$\alpha_{m, l}^{(r+1)} = \alpha_{m, l}^{(r)} - \gamma_r \frac{\partial R_i}{\partial \alpha_{m, l}^{(r)}}$$ # # # * Computing the gradient: using the chain rule, done in a forward and backward sweep over the network # * forward pass: fix the weights, get the predicted values $\hat{f}_k(x_i)$ # * backward pass: compute the errors and back-propagate (estimate weights from output $\rightarrow$ input layers # # ### Remarks # # * back-propagation can be done in batch or *online*, i.e. one observation at a time $\Rightarrow$ (SGD: stochastic gradient descent) # * learning rate $\gamma_r$: # * usually constant in batch # * must verify $\gamma_r \rightarrow 0$, $\sum_r \gamma_r = \infty$, and $\sum_r \gamma_r^2 \lt \infty$ (e.g. $\gamma_r = 1/r$) in SGD # * back-prop is very slow in practice: use conjugate gradient # * scaling of inputs determines scaling of weights $\Rightarrow$ standardize (0 mean, unit variance) # * initialization: # * important in practice: loss is not convex (many local minima!) # * multiple re-starts with random near-zero weights (nearly linear regime for sigmoid) # * Combine the outputs of $L$ different learned NNs by averaing their predictions # # $$\hat{f}(X_{\mathrm{new}}) = \sum_{l=1}^L w_l \mathbb{E}(Y_{\mathrm{new}} | X_{\mathrm{new}}, \hat{\theta}_l)$$ # # * *bagging*: $w_l = 1/L$ and $\hat{\theta}_l$ parameters obtained by fitting on different random permutations of the training data # * *boosting*: $w_l = 1$ but $\hat{\theta}_l$ chosen in non-random sequential fashion to improve the fit # * *Bayesian inference*: $w_l = 1/L$ and sample $\theta_l$ from the posterior using MCMC # * number of hidden: # * units: the more the better + regularization # * layers: task-dependent (multiple layers $\Rightarrow$ hierarchical features) # # > $\Rightarrow$ that is where hand-tuning comes in! no free lunch ;-) # # * difficult to interpret / visualize what is learned (is it still true?) # # # Deep Learning in practice # # ## What is Deep Learning? # # * According to [deeplearning.net](http://deeplearning.net): # # > "Deep Learning is a new area of Machine Learning research, which has been introduced with the objective of moving Machine Learning closer to one of its original goals: Artificial Intelligence." # # > "... moving beyond shallow learning since 2006!" # # * Learning of probabilistic generative models with deep architectures, i.e. with many layers of non-linearities # # # ### Examples # # * Deep Multi-Layer Perceptrons (i.e. NN like before, but with many hidden layers) # * Convolutional Networks # * Stacked Auto-encoders # * Restricted Boltzman Machines (RBM) # * Deep Belief Networks (DBN) # # # ### Why going deep? # # * Can trade-off number of units for layers (more efficient: less connections and therefore weights) # * Hierarchical: useful for multi-scale analysis # * "Combinatorial sharing of statistical strength between multiple levels of latent variables" (NIPS'10 workshop)... ? # * it works (according to Hinton and Lecun... but starting to generalize more and more to other researchers) # # # ### How to learn deeply? # # * Back-prop has some severe limitations: # * Gradient gets progressively diluted across layers (very limited corrections in first layers) # * Local minima (very sensitive to initialization) # * Needs labeled data (in usual settings) # # * What works, e.g. on Deep Belief Networks (DBN): # * greedy layer-wise unsupervised learning (serves as initialization step): # * learn a RBM by maximizing lower bound on likelihood # * stack a new RBM on top, update the prior of the second layer with posterior of the first, learn # * iterate # * justified in theory (cf. [Hinton'2006](http://www.cs.toronto.edu/~hinton/absps/ncfast.pdf)) # * top-down supervised training as final refinement step # # * Many tools to learn deep architectures: # * At the core: libraries to work on *tensors* (multi-dimensional arrays) # * We will review two recent, efficient, easy to use, and actively maintained librairies: # * Theano (python) # * EBLearn (C++) # # ## Overview of Theano # # * [Theano](http://deeplearning.net/software/theano/): # * python Open Source library (BSD-like permissive license, cross-platform) # * Active project maintained by many people, including members of Yoshua Bengio's [LISA](http://www.iro.umontreal.ca/rubrique.php3?id_rubrique=27) group at U. of Montreal # # * Main purpose of Theano (from their [website](http://deeplearning.net/software/theano/)): # # > Theano is a Python library that allows you to define, optimize, and # > evaluate mathematical expressions involving multi-dimensional # > arrays efficiently. Theano features: # > # > * **tight integration with numpy** -- Use `numpy.ndarray` in Theano-compiled functions. # > * **transparent use of a GPU** -- Perform data-intensive calculations up to 140x faster than with CPU.(float32 only) # > * **efficient symbolic differentiation** -- Theano does your derivatives for function with one or many inputs. # > * **speed and stability optimizations** -- Get the right answer for ``log(1+x)`` even when ``x`` is really tiny. # > * **dynamic C code generation** -- Evaluate expressions faster. # > * **extensive unit-testing and self-verification** -- Detect and diagnose many types of mistake. # # * Sort of mix between [Numpy](http://numpy.scipy.org/) (Python's Matlab) and [Sympy](http://sympy.org/en/index.html) (Python's Mathematica) focusing on tensor operations # # > * *execution speed optimizations*: Theano can use `g++` or `nvcc` to compile # > parts your expression graph into CPU or GPU instructions, which run # > much faster than pure Python. # > # > * *symbolic differentiation*: Theano can automatically build symbolic graphs # > for computing gradients. # > # > * *stability optimizations*: Theano can recognize [some] numerically unstable # > expressions and compute them with more stable algorithms. # # * Ambitious goals (and they're making progress!) # # > * Support tensor and sparse operations # > * Support linear algebra operations # > * Graph Transformations # > * Differentiation/higher order differentiation # > * 'R' and 'L' differential operators # > * Speed/memory optimizations # > * Numerical stability optimizations # > * Can use many compiled languages, instructions sets: C/C++, CUDA, OpenCL, PTX, CAL, AVX, ... # > * Lazy evaluation # > * Loop # > * Parallel execution (SIMD, multi-core, multi-node on cluster, # > multi-node distributed) # > * Support all NumPy/basic SciPy functionality # > * Easy wrapping of library functions in Theano # # * Easy to install (just do: ``pip install Theano``) # * Extensively documented ([docs](http://deeplearning.net/software/theano/library/index.html#libdoc)) # * Great tutorial ([tuto](http://deeplearning.net/software/theano/tutorial/index.html#tutorial)) # * Great [tutorials on deep learning](http://deeplearning.net/tutorial/) written by members of the LISA lab # # Sneak peek import theano from theano import tensor # declare two symbolic floating-point scalars a = tensor.dscalar() b = tensor.dscalar() # create a simple expression c = a + b # convert the expression into a callable object that takes (a,b) # values as input and computes a value for c f = theano.function([a,b], c) # bind 1.5 to 'a', 2.5 to 'b', and evaluate 'c' assert 4.0 == f(1.5, 2.5) # # Logistic Regression simple example import numpy import theano import theano.tensor as T rng = numpy.random N = 400 feats = 784 D = (rng.randn(N, feats), rng.randint(size=N,low=0, high=2)) training_steps = 10000 # Declare Theano symbolic variables x = T.matrix("x") y = T.vector("y") w = theano.shared(rng.randn(feats), name="w") b = theano.shared(0., name="b") #print "Initial model:" #print w.get_value(), b.get_value() # Construct Theano expression graph p_1 = 1 / (1 + T.exp(-T.dot(x, w) - b)) # Probability that target = 1 prediction = p_1 > 0.5 # The prediction thresholded xent = -y * T.log(p_1) - (1-y) * T.log(1-p_1) # Cross-entropy loss function cost = xent.mean() + 0.01 * (w ** 2).sum() # The cost to minimize gw, gb = T.grad(cost, [w, b]) # Compute the gradient of the cost # Compile train = theano.function( inputs=[x,y], outputs=[prediction, xent], updates={w: w - 0.1 * gw, b: b - 0.1 * gb}) predict = theano.function(inputs=[x], outputs=prediction) # Train for i in range(training_steps): pred, err = train(D[0], D[1]) # To display the model, uncomment below: #print "Final model:" #print w.get_value(), b.get_value() #print "target values for D:", D[1] #print "prediction on D:", predict(D[0]) # # ## Deep Learning in Theano # # Many implementations already available and documented in the [deep learning tutorials]((http://deeplearning.net/tutorial/)) (cf. [the code on github](https://github.com/lisa-lab/DeepLearningTutorials)) # # # ### Supervised learning # # * [Introduction](http://deeplearning.net/tutorial/gettingstarted.html): notations, datasets, extensive primer on supervised deep learning # * [Logistic Regression (LR) + application on MNIST](http://deeplearning.net/tutorial/logreg.html) (both implemented using [conjugate gradient](https://github.com/lisa-lab/DeepLearningTutorials/blob/master/code/logistic_cg.py) and [SGD](https://github.com/lisa-lab/DeepLearningTutorials/blob/master/code/logistic_sgd.py)): often used as the last layer of classification architectures # # > $$ # > P(Y=i|x, W,b) = softmax_i(W x + b) \\ # > = \frac {e^{W_i x + b_i}} {\sum_j e^{W_j x + b_j}} # > $$ # # > $$y_{pred} = {\rm argmax}_i P(Y=i|x,W,b)$$ # # * [Multilayer Perceptron (MLP)](http://deeplearning.net/tutorial/mlp.html): going from LR to MLP, train with SGD + tricks of the trade (activation functions, regularization, initialization, learning rate, ...) # # # # * [Deep Convolutional Neural Networks (CNN)](http://deeplearning.net/tutorial/lenet.html): variants of MLP with local connectivity (to mimic biological receptive fields), weight sharing (detect features regardless of position + reduces parameters / capacity control + still trainable with SGD), and many layers, succession of pooling and subsampling operations, example CNN used: LeNet (cf. ["Gradient-based learning applied to document recognition", LeCun *et al.*, 1998](http://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf)) # # # # # ### Unsupervised learning # # * [Auto-Encoders](http://deeplearning.net/tutorial/dA.html#daa): building a latent representation (*coding*), mapped back (*decoding*) into a reconstruction of the input by minimzing the reconstruction error, application to image denoising # # * [Stacked Denoising Auto-Encoders](http://deeplearning.net/tutorial/SdA.html#sda): unsupervised pre-training one layer at a time for deep nets, final stage: task-specific fine tuning using a supervised MLP # # * [Restricted Boltzmann Machines (RBM)](http://deeplearning.net/tutorial/rbm.html#rbm): # * single layer generative Energy-Based Model (EBM, cf. EBLearn overview below) # * energy: $E(v, h) = -b' v - c'h - h'Wv$, where $v$ is the observed input units, $h$ the hidden units, $W$ are the connection weights, $b, c$ are offset parameters # # # # * a particular form of log-linear Markov Random Field (MRF), i.e., for which the energy function is linear in its free parameters # * learned by SGD with MCMC sampling to approximate the gradient # # * [Deep Belief Networks (DBN)](http://deeplearning.net/tutorial/DBN.html#dbn): stacked RBMs, trained in a greedy layer-wise manner # # # # ## Overview of EBLearn # # * [EBLearn](http://eblearn.cs.nyu.edu:21991/doku.php): # * well-architectured C++ library for Energy-Based Learning, maintained by Yann LeCun's group @NYU (in particular [Pierre Sermanet](http://cs.nyu.edu/~sermanet/)) # # # # * used in robotics (c.f the [Learning Applied to Ground Robots (LAGR)](http://www.cs.nyu.edu/~yann/research/lagr/) DARPA project) # * particular focus on supervised training of convolutional neural networks (especially LeNet) # # * Enery-Based learning (overview from [here](http://deeplearning.net/tutorial/rbm.html)): # * associate a scalar energy to each configuration of the variables of interest # * learning corresponds to modifying that energy function so that its shape has desirable properties: # * plausible configurations $\Rightarrow$ low energy # * bad configurations $\Rightarrow$ high energy # * probability distribution: $Pr(x) = e^{-E(x)} / Z$, where $Z=\sum_x e^{-E(x)}$ is the *partition function* # * model learnt by (stochastic) gradient descent on the empirical negative log-likelihood of the training data # # $$ # l(\theta, \mathcal{D}) = - \frac{1}{N} \sum_{x^{(i)} \in \mathcal{D}} \mathrm{log} Pr(x^{(i)}) # $$ # # * with latent variables $h$, the model becomes # # $$ # Pr(x) = \sum_h Pr(x, h) = \sum_h \frac{e^{-E(x,h)}}{Z} = \frac{e^{-\mathcal{F}(x)}}{Z} # $$ # # where $\mathcal{F}(x) = - \mathrm{log} \sum_h e^{-E(x, h)}$ is called the **free energy** and $Z = \sum_x e^{-E(x,h)}$ # # * in this case, the gradient of the data's negative log-likelihood decomposes over a *positive* and *negative* phase: # # $$ # \frac{\partial \log p(x)}{\partial \theta} # = \frac{\partial \mathcal{F}(x)}{\partial \theta} - # \sum_{\tilde{x}} p(\tilde{x}) # \frac{\partial \mathcal{F}(\tilde{x})}{\partial \theta} # $$ # # * approximate estimation of the *negative* phase is obtained by averaging over *negative particles* $\mathcal{N}$ sampled using MCMC # # $$ # \frac{\partial \log p(x)}{\partial \theta} # \approx \frac{\partial \mathcal{F}(x)}{\partial \theta} - # \frac{1}{|\mathcal{N}|}\sum_{\tilde{x} \in \mathcal{N}} # \frac{\partial \mathcal{F}(\tilde{x})}{\partial \theta} # $$ # # # ### EBLearn usage # # * Starting point: [paper](http://cs.nyu.edu/~koray/publis/sermanet-ictai-09.pdf), [tutorials](http://eblearn.cs.nyu.edu:21991/doku.php?id=start), [demos](http://eblearn.cs.nyu.edu:21991/doku.php?id=all_demos), and [documentation](http://eblearn.cs.nyu.edu:21991/doku.php?id=all_docs) # # * Installation: using CMake, cross-platform (even Android and iOS!), only few dependencies # # * Beginner tutorial ([part 1](http://eblearn.cs.nyu.edu:21991/doku.php?id=beginner_tutorial1_dscompile), [part 2](http://eblearn.cs.nyu.edu:21991/doku.php?id=beginner_tutorial2_train)): # * build a LeNet5 digit regognizer on MNIST # # # # * no C++ coding required, just write EBLearn configuration files and use EBLearn's command-line tools (e.g. `train`) # # * Set of more advanced tutorials (cf. [here](http://eblearn.cs.nyu.edu:21991/doku.php?id=start)) decribe how to use EBLearn's C++ API, resting on its powerful tensor library `libidx` ([tuto](http://eblearn.cs.nyu.edu:21991/doku.php?id=libidx)) and energy-based learning library `libeblearn` ([tuto](http://eblearn.sourceforge.net/old/tutorials/libeblearn/index.shtml)) # # * [Face demo](http://eblearn.cs.nyu.edu:21991/doku.php?id=face_detector) on the "Labeled Faces in the Wild" dataset: just type `make detect` :-) # # #