Introduction to Deep Learning ============================= We redo the example of an artificial neural network from the last lecture, but now with the Julia package Flux. As an illustration of a larger example, we consider the problem of recognizing handwritten numbers. Getting Started with Flux ------------------------- A workable definition of machine learning, also called *learning from experience*, is that a program learns from experience *E* with respect to some tasks *T* and performance measure *P*, if its performance at tasks in *T*, as measured by *P* improves with experience *E*. .. index:: deep neural network .. topic:: Definition of a deep neural network. A *deep neural network* is a neural network with multiple layers. The illustration in :numref:`figDeepNeuralNetwork` is taken from *Deep Learning Notes using Julia with Flux*, by Hugh Murrel and Nando de Freitas, available at , 2019. .. _figDeepNeuralNetwork: .. figure:: ./figDeepNeuralNetwork.png :align: center A deep neural network. :index:`Flux` is a library for machine learning geared towards high-performance production pipelines, written entirely in Julia. Our first example comes from the quickstart documentation of Flux. Let us go through an example, step-by-step: 1. define the task 2. get training data 3. define and initialize the model 4. define the loss function 5. set the optimizer 6. train the model step 1: *the task* Consider a weight matrix :math:`W_{\mbox{true}}` and bias vector :math:`b_{\mbox{true}}`: .. math:: W_{\mbox{true}} = \left[ \begin{array}{ccccc} 1 & 2 & 3 & 4 & 5 \\ 5 & 4 & 3 & 2 & 1 \end{array} \right], \quad b_{\mbox{true}} = \left[ \begin{array}{c} -1 \\ -2 \end{array} \right]. :math:`W_{\mbox{true}}` and :math:`b_{\mbox{true}}` are parameters in the function .. math:: F_{\mbox{true}}(x) = W_{\mbox{true}} x + b_{\mbox{true}} which * takes on input :math:`x`, a vector of five numbers, and * return :math:`y = F(x)`, a vector of two numbers. Then the task is to recover :math:`W_{\mbox{true}}` and :math:`b_{\mbox{true}}` from observed :math:`(x_{\mbox{train}}, y_{\mbox{train}})`. step 2: *training data* Let :math:`N` be the size of the training vectors :math:`x_{\mbox{train}}` and :math:`y_{\mbox{train}}`. Then, * the *i*-th vector of :math:`x_{\mbox{train}}` is :math:`x_i`, and * the *i*-th vector of :math:`y_{\mbox{train}}` is :math:`y_i`, with .. math:: x_i = \left[ \begin{array}{c} 5 + 5 u_{1,i} \\ 5 + 5 u_{2,i} \\ 5 + 5 u_{3,i} \\ 5 + 5 u_{4,i} \\ 5 + 5 u_{5,i} \end{array} \right], \quad z_i = F(x_i), \quad y_i = \left[ \begin{array}{c} z_{1,i} + 0.2 v_{1,i} \\ z_{2,i} + 0.2 v_{2,i} \end{array} \right], where :math:`i` runs from 1 to :math:`N`, and * :math:`u_{j,i}` are chosen at random, uniformly from :math:`[0,1]`. * :math:`v_{j,i}` are random, standard normally distributed numbers. step 3: *defining and initializing the model* The model is .. math:: M(x) = W x + b, \quad W \in {\mathbb R}^{2 \times 5}, \quad b \in {\mathbb R}^{2}. We initialize :math:`M(x)` with 1. a random 2-by-5 matrix for :math:`W`, and 2. a random 2-vector for :math:`b`. step 4: *the loss function* The performance is measured by a loss function: .. math:: \mbox{loss}(x, y) = \sum_{i=1}^2 \left( \vphantom{\frac{1}{2}} y_i - \widehat{y}_i \right)^2, \quad y = F(x), \quad {\widehat{y}} = M(x). step 5: *set an optimizer* We choose the classic :index:`gradient descent` as the optimizer with learning rate :math:`\eta = 0.01`. step 6: *train the model* :: Flux.train!(loss, params(W, b), train_data, opt) To train the model, the longer way is in the code below: :: opt = Descent(0.01) train_data = zip(x_train, y_train) ps = Flux.params(W, b) for (x, y) in train_data gs = gradient(ps) do loss(x,y) end Flux.Optimise.update!(opt, ps, gs) end The main benefit of this lower way to train the model is that we can monitor the progress of the loss function. An Artificial Neural Network Revisited -------------------------------------- .. _figNetworkLayers2: .. figure:: ./figNetworkLayers2.png :align: center The layers in an artificial neural network. The artificial neural network of the last lecture is shown in :numref:`figNetworkLayers2`. We will redo the training of this model with Flux. The weights and the bias vectors are .. math:: \mbox{W2} \in {\mathbb R}^{2 \times 2}, \mbox{W3} \in {\mathbb R}^{3 \times 2}, \mbox{W4} \in {\mathbb R}^{2 \times 3}, \mbox{b2} \in {\mathbb R}^{2}, \mbox{b3} \in {\mathbb R}^{3}, \mbox{b4} \in {\mathbb R}^{2}. With ``Flux.jl``, we define the model as :: L2 = Dense(W2, b2, sigmoid) L3 = Dense(W3, b3, sigmoid) L4 = Dense(W4, b4, sigmoid) M = Chain(L2, L3, L4) In the code above, ``L2 = Dense(W2, b2, sigmoid)`` makes one layer with weights ``W2``, bias ``b2``, and the sigmoid function. Multiple layers into one network are collected by ``M = Chain(L2, L3, L4).`` The output is :: Chain( Dense(2 => 2, \sigma), # 6 parameters Dense(2 => 3, \sigma), # 9 parameters Dense(3 => 2, \sigma), # 8 parameters ) # Total: 6 arrays, 23 parameters, 568 bytes. Observe that this output solves the first exercise of the last lecture on the number of parameters in the model. To train the network, we define next the loss function. The loss function is .. math:: \mbox{loss}\left( \vphantom{\frac{1}{2}} \mbox{W2}, \mbox{W3}, \mbox{W4}, \mbox{b2}, \mbox{b3}, \mbox{b4} \right) = \frac{1}{10} \sum_{i=1}^{10} \frac{1}{2} \Big\| y\left(x_1^{(i)}, x_2^{(i)}\right) - M\left(x_1^{(i)}, x_2^{(i)}\right) \Big\|_2^2, where :math:`M` is the model, defined in the code below: :: function loss(x, y) result = 0.0 for i=1:10 point = [x1[i]; x2[i]] yy = M(point) result = result + 0.5*((yy[1] - ylabels[1,i])^2 + (yy[2] - ylabels[2,i])^2) end return result/10.0 end Defining the training data and the optimizer, we take one million data points: :: N = 1_000_000 xtrain = [randn(2) for _ in 1:N]; ytrain = [randn(2) for _ in 1:N]; We define the optimizer and parameters: :: train_data = zip(xtrain, ytrain) opt = Descent(0.01) ps = Flux.params(M) Then the training of the model, happens via :: Flux.train!(loss, ps, train_data, opt) Evaluating :math:`F(x_1, x_2)` over :math:`[0,1] \times [0,1]` produces similar plots as in the previous lecture. Recognizing Handwritten Numbers ------------------------------- So far, our neural networks have been small. In this section we consider a larger problem. The MNIST database of handwritten digits is a classic image classification dataset with 70,000 images, available at . The data is described in the paper *Gradient-based learning applied to document recognition* by Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, published in the Proceedings of the IEEE, November 1998 In Julia, we access this data set as follows: :: julia> using MLDatasets: MNIST julia> MNIST.download() The preparation of this lecture benefited from the post at . The first five images are shown in :numref:`figMNISTfirst5images` and labeled as 5, 0, 4, 1, and 9. .. _figMNISTfirst5images: .. figure:: ./figMNISTfirst5images.png :align: center The first five images in the MNIST data base. * The *i*-th image is available as ``MNIST(split=:train).features[:,:,i}].`` * Its corresponding label is in ``MNIST(split:=train).targets[\mbox{i}].`` The training of the network will be described in a step-by-step fashion. step 1: *representing the labels as vectors* Consider the statements: :: using Flux: onehotbatch labels = onehotbatch(MNIST(split=:train).targets, 0:9) With ``onehotbatch``, each trainlabel is converted to a vector, As ``MNIST(split=:train).targets[i]`` returns an integer in the range from 0 to 9, the vector that corresponds to the label will have length 10. The vector will be zero, except for a one at the entry the corresponds to the value of the label. For example, the number 5 corresponds to the vector .. math:: [0, 0, 0, 0, 0, 1, 0, 0, 0, 0]^T. step 2: *representing the images as vectors* The images are 28-by-28 matrices of ``UInt8`` types. The code below takes the first 100 images, reshapes the matrix into a vector and converts the numbers. :: NBR = 100 images = Matrix{Float32}(zeros(28*28, NBR)) for i=1:size(images,2) image = MNIST(split=:train).features[:,:,i] imagereshaped = reshape(image,:) imagenumbers = [Float32(x) for x in imagereshaped] for j=1:size(images,1) images[j, i] = imagenumbers[j] end end step 3: *defining the model* We define the model as follows: :: using Flux model = Chain(Dense(28*28, 40, relu), Dense(40, 10), softmax) where * ``relu`` stands for rectified linear output, and * ``softmax`` formulates a logistic regression. The output is of the ``Chain`` statement is :: Chain( Dense(784 => 40, relu), # 31_400 parameters Dense(40 => 10), # 410 parameters NNlib.softmax, ) # Total: 4 arrays, 31_810 parameters, 124.508 KiB. Observe the large number of parameters. To test if the definition is valid, we evaluate at the 10-th image: ``model(images[:,10])``. step 4: *the loss function* In classifications with multiple classes, where the labels are given in a one-hot format, we use: :: using Flux: crossentropy loss(X, y) = crossentropy(model(X), y) step 5: *select the optimizer* The optimizer is set to one of the gradient descent methods: :: opt = Adam() ``Adam`` is a variation of Adaptive Gradient Descent where the components of the gradient are weighted. step 6: *training the model* To monitor the progress during the training, we define a callback function: :: progress = () -> @show(loss(images, labels[:,1:NBR])) Then the training happens as follows: :: using Flux: throttle for epoch in 1:100 Flux.train!(loss, Flux.params(model), [(images,labels[:,1:NBR])], opt, cb = throttle(progress, 10)) The loss is reported every 10 seconds. One ``epoch`` loops over the data only once. In the training, we limited the number of epochs to 100. To verify the model, we evaluate an image in the model, and then check for the index of the largest value: :: M1 = model(images[:,1]) println("output of the model :\n", M1) println("the number is ", argmax(M1)-1) The first five image are classified correctly. To check a random image from the training data: :: idx = rand((1:NBR),1) Midx = model(images[:,idx[1]]) println("output of the model :\n", Midx) println("the number is ", argmax(Midx)-1) println("label : ", MNIST(split=:train).targets[idx]) Proposal of a Project Topic --------------------------- 1. Machine learning with Flux. One software that learns from experience is Flux. 1. Read the software documentation. 2. Describe how it fits in the computational ecosystem of Julia. 3. Illustrate the capabilities by a good use case. Do not use MNIST, but a similar good data set. What is scientific machine learning? Exercises --------- 1. For the first example with Flux, make a plot of the loss function for all :math:`N` steps. Does :math:`N` really have to be 10,000 for sufficient accuracy? 2. In our artificial neural network, do we really need one million data points? Plot the evolution of the loss function for all steps. 3. In our artificial neural network, what if we use ``rand`` instead of ``randn`` in the definition of the ``xtrain`` and ``ytrain`` vectors? 4. After training the neural network on the MNIST data, verify a random image that is not in the training data. Bibliography ------------ 1. Catherine F. Higham and Desmond J. Higham: **Deep Learning: An Introduction for Applied Mathematicians.** *SIAM Review*, Vol. 61, No. 4, pages 860-891, 2019. 2. Michael Innes, Elliot Saba, Keno Fischer, Dhairya Gandhi, Marco Concetto Rudilosso, Neethu Mariya Joy, Tejan Karmali, Avik Pal, and Viral Shah: *Fashionable Modelling with Flux*, , 2018. 3. Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner: **Gradient-based learning applied to document recognition.** *Proceedings of the IEEE*, 86(11):2278-2324, November 1998 . 3. Hugh Murrel and Nando de Freitas: *Deep Learning Notes using Julia with Flux*, , 2019.