Neural Networks =============== We look at neural networks from the perspective of applied mathematics. Definitions ----------- .. topic:: Definition of a neuron. A *neuron* is a nonlinear bounded function .. math:: y = f(x_1, x_2, \ldots, x_n; w_1, w_2, \ldots, w_p) where * :math:`x_i` are the variables, or inputs, :math:`i=1,2,\ldots, n`, and * :math:`w_j` are the parameters, or weights, :math:`j = 1,2, \ldots,p`. In :numref:`figNeuron` is a schematic example of a :index:`neuron`. .. _figNeuron: .. figure:: ./figNeuron.png :align: center A schematic of a neuron with three inputs and one output. Two concrete examples of the function ``f`` are .. math:: y = \tanh \left( \sum_{i=1}^n w_i x_i + w_{n+1} \right) and .. math:: y = \exp \left( - \sum_{i=1}^n (x_i - w_i)^2 \Big/ (2 w^2_{n+1}) \right). .. topic:: Definition of a neural network. A *neural network* is the composition of the nonlinear functions of two or more neurons. An example of a :index:`neural network` is defined as .. math:: y = g({\bf x}, {\bf w}) = \sum_{i=1}^{N_c} \left( w_{N_c+1, i} \tanh \left( \sum_{j=1}^n w_{i,j} x_j + w_{i,n+1} \right) \right) + w_{N_c+1, n+1} where * :math:`\bf x` is the input vector, of :math:`n` inputs, or variables, and * :math:`\bf w` is the vector of :math:`(n+1)N_c + N_c + 1` parameters, or weights. The network shown in :numref:`figNeuralNetwork` * has three inputs with one bias :math:`x_0 = 1`, * has three hidden neurons, and * one linear output neuron. .. _figNeuralNetwork: .. figure:: ./figNeuralNetwork.png :align: center An example of a neural network. .. topic:: Definition of training of a neural network. The *training of a neural network* is the algorithmic procedure * whereby the parameters of the neurons of the network are estimated, * in order for the network to fulfill the tasks it is assigned to do. .. index:: training of a neural network .. index:: supervised training, unsupervised training Two categories of training are considered: 1. *supervised* training: we know the nonlinear function analytically, or numerical values of the function are known. 2. *unsupervised* training: no supervision. The training of a neural network can be viewed as a function approximation problem, where we keep in mind the following two properties. 1. Nonlinear in their parameters, neural networks are universal approximators. .. topic:: Proposition of the existence proof. Any bounded, sufficiently regular function can be approximated uniformly, with arbitrary accuracy in a finite region of variable space, by a neural network with a single layer of hidden neurons, having the same activation function, and a linear output neuron. 2. The most *parsimonious* model has the smallest number of parameters. An Artificial Neural Network ---------------------------- The training of a neural network will be illustrated via the processing of a collection of labeled points, shown in :numref:`figDataLabeled`. We could interpret the labels of the points as the outcome of a probe for oil. With the neural network we would like to predict the size of the oil field, or the location for where to drill next. .. _figDataLabeled: .. figure:: ./figDataLabeled.png :align: center A collection of labeled points. The coordinates of the points are given in two vectors: :: x1 = [0.1,0.3,0.1,0.6,0.4,0.6,0.5,0.9,0.4,0.7] x2 = [0.1,0.4,0.5,0.9,0.2,0.3,0.6,0.2,0.4,0.6] and their labels are defined as :: y = [ones(1,5) zeros(1,5); zeros(1,5) ones(1,5)] 2x10 Matrix{Float64}: 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 The problem can then be formulated as follows. Find a function :math:`F`, so :math:`y = F(z_1, z_2)`, for a point :math:`(z_1, z_2)`. To solve this problem, we use sigmoid functions. The :index:`sigmoid` function is .. math:: \sigma(x) = \frac{1}{1 + e^{-x}} and it can be viewed as a smoothed version of a step function, shown in :numref:`figSigmoid1`. .. _figSigmoid1: .. figure:: ./figSigmoid1.png :align: center The sigmoid function :math:`\sigma(x) = 1/(1 + e^{-x})`. A scaled and shifted sigmoid function :math:`\sigma(3(x-5))` is shown in :numref:`figSigmoid2`. .. _figSigmoid2: .. figure:: ./figSigmoid2.png :align: center The scaled and shifted sigmoid :math:`\sigma(3(x-5))`. A Julia function for :math:`\sigma(W x + b)` is listed below: :: """ function activate(x,W,b) evaluates the sigmoid function at x, with weight matrix W and bias vector b. """ function activate(x,W,b) dim = size(W,1) y = zeros(dim) argexp = -(W*x + b) for i=1:dim y[i] = 1.0/(1.0 + exp(argexp[i])) end return y end The dimensions of the variables ``x``, ``W``, and ``b`` are as follows: if :math:`\mbox{x} \in {\mathbb R}^n`, :math:`\mbox{W} \in {\mathbb R}^{m \times n}`, :math:`\mbox{b} \in {\mathbb R}^m`, then :math:`\mbox{y} \in {\mathbb R}^m`. The Julia function implements a neural network as :: function F(x, W2, W3, W4, b2, b3, b4) a2 = activate(x, W2, b2) a3 = activate(a2, W3, b3) a4 = activate(a3, W4, b4) return a4 end and it computes .. math:: \begin{array}{rcl} a_2 & = & \sigma(\mbox{W2*} \mbox{x} + \mbox{b2}) \\ a_3 & = & \sigma(\mbox{W3*} a_2 + \mbox{b3}) \\ a_4 & = & \sigma(\mbox{W4*} a_3 + \mbox{b4}) \end{array} which can be represented as in :numref:`figNetworkLayers`. .. _figNetworkLayers: .. figure:: ./figNetworkLayers.png :align: center The layers in an artificial neural network. .. index:: weight vector, bias vector The weights and bias vectors in this neural network define the function .. math:: F(x) = \sigma(~\!\mbox{W4 }\! \sigma(~\!\mbox{W3 }\! \sigma(~\!\mbox{W2 }\!\mbox{x} + \mbox{b2}) + \mbox{b3}) + \mbox{b4}) which produces labels for each point. The cost function is .. math:: \mbox{Cost}\left( \vphantom{\frac{1}{2}} \mbox{W2}, \mbox{W3}, \mbox{W4}, \mbox{b2}, \mbox{b3}, \mbox{b4} \right) = \frac{1}{10} \sum_{i=1}^{10} \frac{1}{2} \Big\| y\left(x_1^{(i)}, x_2^{(i)}\right) - F\left(x_1^{(i)}, x_2^{(i)}\right) \Big\|_2^2. The first exercise asks to count the number of parameters in the neural network defined by the function :math:`F`. The problem of constructing the neural network is now reduced to minimizing the cost function. For this optimization problem, we consider Taylor series and the gradient vector. Assume the current vector is :math:`p = (p_1, p_2, \ldots, p_N)`. Consider the Taylor series of the cost function at :math:`p`: .. math:: \mbox{Cost}(p + \Delta p) = \mbox{Cost}(p) + \sum_{i=1}^N \left( \frac{\partial \mbox{Cost}}{\partial p_i}(p) \right) \Delta p_i + O(\| \Delta p \|^2). Denoting the gradient vector .. math:: \nabla \mbox{Cost}(p) = \left( \frac{\partial \mbox{Cost}}{\partial p_1}(p) , \frac{\partial \mbox{Cost}}{\partial p_2}(p) , \ldots , \frac{\partial \mbox{Cost}}{\partial p_N}(p) \right) and ignoring the :math:`O(\| \Delta p \|^2)` term: .. math:: \mbox{Cost}(p + \Delta p) \approx \mbox{Cost}(p) + \left( \vphantom{\frac{1}{2}} \nabla \mbox{Cost}(p) \right)^T \Delta p. Minimizing the cost function is the objective. In an iterative method we replace :math:`p` by :math:`p + \Delta p`, choosing :math:`\Delta p` that makes :math:`\displaystyle \left( \vphantom{\frac{1}{2}} \nabla \mbox{Cost}(p) \right)^T \Delta p` as negative as possible. By application of the Cauchy-Schwartz inequality: .. math:: \left| \left( \vphantom{\frac{1}{2}} \nabla \mbox{Cost}(p) \right)^T \Delta p ~ \right| \leq \left\| \vphantom{\frac{1}{2}} \nabla \mbox{Cost}(p) \right\|_2 \| \Delta p \|_2. .. index:: learning rate, stochastic gradient method Therefore, choose :math:`\Delta p` in the direction of :math:`-\nabla \mbox{Cost}(p)`. Update :math:`p` as :math:`p = p - \eta \nabla \mbox{Cost}(p)`, where the step size :math:`\eta` is called the *learning rate*. As we have a large number of parameters and many training points, the computation of the gradient vector at each step is too costly. Instead, take one single, *randomly chosen* training point, and evaluate the gradient at that point. This gives rise to the *stochastic gradient method*. To run the stochastic gradient method, we need to evaluate the derivatives efficiently. Let :math:`a^{(1)} = x`, then the network returns :math:`a^{(L)}`, where .. math:: \begin{array}{rcl} z^{(\ell)} & = & W^{(\ell)} a^{(\ell-1)} + b^{(\ell)} \\ a^{(\ell)} & = & {\displaystyle \sigma \left( z^{(\ell)} \right)}, \quad \mbox{for} ~\ell = 2, 3, \ldots, L. \end{array} Let :math:`C` be the cost, define :math:`\displaystyle \delta_j^{(\ell)} = \frac{\partial C}{\partial z_j^{(\ell)}}` as the error of neuron :math:`j` at layer :math:`\ell`. By the chain rule, we have (with :math:`\circ` as the componentwise product): .. math:: \begin{array}{rcl} \delta^{(L)} & = & \sigma'(z^{(L)}) \circ (a^{(L)} - y) \\ \delta^{(\ell)} & = & \sigma'(z^{(\ell)}) \circ (W^{(\ell+1)})^T \delta^{(\ell+1)} \\ {\displaystyle \frac{\partial C}{\partial b_j^{(\ell)}}} & = & \delta_j^{(\ell)} \quad \mbox{and} \quad {\displaystyle \frac{\partial C}{\partial w_{j,k}^{(\ell)}}} ~~ = ~~ \delta_j^{(\ell)} a^{(\ell - 1)}. \end{array} Sigmoids have convenient derivatives: :math:`\sigma'(x) = \sigma(x) ( 1 - \sigma(x))`. The formulas lead to the following algorithm: 1. The forward pass evaluates :math:`a^{(1)}, z^{(2)}, a^{(2)}, z^{(3)}, \ldots, a^{(L)}`. 2. The backward pass evaluates :math:`\delta^{(L)}, \delta^{(L-1)}, \ldots, \delta^{(2)}`. This way of computing gradients is *back propagation*. The Julia code to train the network is listed below: :: eta = 0.05 Niter = 1000000 for counter = 1:Niter k = rand((1:10), 1)[1] x = [x1[k], x2[k]] # Forward pass a2 = activate(x,W2,b2) a3 = activate(a2,W3,b3) a4 = activate(a3,W4,b4) # Backward pass delta4 = a4.*(ones(length(a4))-a4).*(a4-y[:,k]) delta3 = a3.*(ones(length(a3))-a3).*(W4'*delta4) delta2 = a2.*(ones(length(a2))-a2).*(W3'*delta3) # Gradient step W2 = W2 - eta*delta2*x' W3 = W3 - eta*delta3*a2' W4 = W4 - eta*delta4*a3' b2 = b2 - eta*delta2 b3 = b3 - eta*delta3 b4 = b4 - eta*delta4 end The output of :: Z = [[x1[k], x2[k]] for k=1:10] Y = [F(z, W2, W3, W4, b2, b3, b4) for z in Z] is :: 10-element Vector{Vector{Float64}}: [0.9914991960849379, 0.008456961045118156] [0.9938902272050594, 0.006075750247468293] [0.998360435186392, 0.001628134278139295] [0.9865419031464729, 0.01338759273828821] [0.9737963089611226, 0.02615530711128986] [0.013028283269915059, 0.9869736384774328] [0.016354675508034607, 0.9837066212812329] [0.0005627461156236613, 0.9994383300176262] [0.02106910309704885, 0.979014892476464] [0.0012977070814389772, 0.9987047112752311] The result of evaluating :math:`F(x_1, x_2)` over :math:`[0,1] \times [0,1]` is shown in :numref:`figEvaluatedResults`. .. _figEvaluatedResults: .. figure:: ./figEvaluatedResults.png :align: center Evaluating the trained neural network over :math:`[0,1] \times [0,1]`. The white and black dots in :numref:`figEvaluatedResults` are the ten given points. Exercises --------- 1. How many parameters are in the neural network defined by the function :math:`F(x)`? Justify your answer. 2. Examine the computational cost to evaluate :math:`\nabla \mbox{Cost}(p)` of our :math:`\displaystyle F(x) = \sigma(~\!\mbox{W4 }\! \sigma(~\!\mbox{W3 }\! \sigma(~\!\mbox{W2 }\!\mbox{x} + \mbox{b2}) + \mbox{b3}) + \mbox{b4})`. 3. How fast did the stochastic gradient method converge? 1. Write a function to evaluate :math:`\mbox{Cost}(p)`. 2. Call the function in the code to train the network, at each step, store the value of the cost function. 3. Make a plot of the cost over all steps in the code. Bibliography ------------ 1. Gerard Dreyfus: *Neural Networks. Methodology and Applications.* Springer-Verlag 2005. 2. Catherine F. Higham and Desmond J. Higham: **Deep Learning: An Introduction for Applied Mathematicians.** *SIAM Review*, Vol. 61, No. 4, pages 860-891, 2019.