Face Recognition and Ranking Data ================================= This lecture covers two topics in data science. Face Recognition ---------------- Face recognition is challenging because of the position of the head, lightning conditions, and moods and expressions. Many automated systems have been developed. A linear algebra approach is based on eigenfaces, a method proposed by Sirovich and Kirby, 1987. The input is a *p*-by-*q* grayscale image. The resolution of an image is :math:`m = p \times q`. All images of the same resolution live in an *m*-dimensional space. The subspace of *all facial images*} has a low dimension, *independent of the resolution.* This result is described in the paper on *Singular Value Decomposition, Eigenfaces, and 3D Reconstructions* by Neil Muller, Lourenco Magaia, B. M. Herbst, *SIAM Review*, Vol. 46, No. 3, pages 518-545, 2004. The :index:`singular value decomposition` of a *p*-by-*q* matrix :math:`A` is .. math:: A = U \Sigma V^T, \quad U \in {\mathbb R}^{p \times p}, \quad \Sigma \in {\mathbb R}^{p \times q}, \quad V \in {\mathbb R}^{q \times q}, where :math:`U` and :math:`V` are orthogonal: :math:`U^{-1} = U^T`, :math:`V^{-1} = V^T`, and :math:`\Sigma` is a diagonal matrix, with on its diagonal .. math:: \sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_{\min(p,q)}, are the *singular values* of the matrix :math:`A`. If :math:`\mbox{rank}(A) = r`, then :math:`\sigma_i = 0` for all :math:`i > r`. Ignoring the smallest singular values leads to a dimension reduction. The geometric interpretation of the singular value decomposition is illustrated in :numref:`figSVDgeometry`. .. _figSVDgeometry: .. figure:: ./figSVDgeometry.png :align: center Computing :math:`U \Sigma V^T` corresponds to rotating a circle, stretching the circle into an ellipse, and then rotating the ellipse. The singular values and vectors relate to the :index:`eigenvalues` and :index:`eigenvectors` as follows. By the singular value decomposition of :math:`A`: .. math:: A = U \Sigma V^T \quad \Rightarrow \quad A V = U \Sigma = \Sigma U. This implies .. math:: A v_j = \sigma u_j, \quad j = 1,2, \ldots, \min(p,q). The largest singular value measures the magnitude of :math:`A`: .. math:: \| A \|_2 = \max_{\|x\|_2 = 1} \| A x \|_2 = \sigma_1. In machine learning, the eigenfaces method is known as the Principal Component Analysis (PCA). The procedure by Neil Muller et al. is summarized below: * The XM2VTS Face Database of the University of Surrey at consists of many RGB color images against a blue background. * The background is removed by color separation. * Averaging RGB values, the images were converted to grayscale. * A training set has 600 images, 3 of each, for 200 individuals. If the training set is representative, then all facial images would be in a 150-dimensional subspace. *What is the typical range of measurements of a face?* Consider :math:`n` measurements :math:`(x_i, y_i)` for :math:`i=1,2,\ldots,n`, where 1. :math:`x_i` is the width, measured ear-to-ear, and 2. :math:`y_i` is the height, measured from chin to between the eyes. .. _figEllipseAxes: .. figure:: ./figEllipseAxes.png :align: center A coordinate change defined by the axes of an ellipse. Assume all :math:`n` measurements lie within an ellipse as shown in :numref:`figEllipseAxes`. Then, the coordinate change to :math:`u_1` and :math:`u_2` as in :numref:`figEllipseAxes` is computed via a singular value computations on a shifted matrix. To find then the typical range of width and height of a face: 1. Compute the mean :math:`(\overline{x}, \overline{y})` of all :math:`n` observations. 2. Subtract :math:`\overline{x}` from all :math:`x_i` and subtract :math:`\overline{y}` from all :math:`y_i`. 3. Divide the :math:`n` shifted observations by :math:`\sqrt{n}`. 4. Apply the singular value decomposition on the matrix :math:`A`, which has in its columns the shifted observations. The first two columns of :math:`U` give the directions of maximum variations and the two largest singular values are the magnitudes of the standard deviations in those two directions. With the mean and standard deviation of the measurements, we can define a normal distribution. The farther observations fall outside the normal ellipse, the less likely the measurements represent a face. Stacking the columns of the images into one vector, we obtain a matrix :math:`F = [f_j \mbox{ is stacked image for } j=1,2,\ldots,n]`. 1. Compute the mean :math:`\displaystyle \mu = \frac{1}{n} \sum_{j=1}^n f_j`. 2. Shift the data :math:`a_j = x_j - \mu`, :math:`j=1,2,\ldots,n`. 3. Define the matrix .. math:: A = \frac{1}{\sqrt{n}} \left[ \vphantom{\frac{1}{2}} ~~ x_1 ~~ x_2 ~~ \cdots ~~ x_n ~~ \right]. 4. Compute the singular value decomposition :math:`A = U \Sigma V^T`. The columns of :math:`U` are the eigenvectors of :math:`A A^T` (the covariance matrix). Those columns of :math:`U` are *the eigenfaces of F* The mean of the training set is :math:`\mu`. The matrix :math:`U` is orthogonal. Use the first :math:`\nu` eigenfaces: .. math:: U_\nu = [u_1 ~~ u_2 ~~ \cdots ~~ u_\nu]. For any image :math:`f` (as a stacked vector), compute its projection .. math:: y = U_\nu^T (f - \mu), with is *the eigenface representation of f.* Then *the eigenface reconstruction of f* is :math:`\widetilde{f} = U_\nu y + \mu`. Comparing the reconstructed faces to the given ones in the training set leads to the finding of the best value for :math:`\nu`. The three steps in the :index:`Principal Component Analysis` are listed below: Three steps: 1. Standardize the data, via the formula .. math:: z = \frac{x - \mbox{mean}}{\mbox{standard deviation}}, applied to each number :math:`x` in the input. 2. Setup the covariance matrix :math:`C`. For each pair :math:`(z_i, z_j)`, .. math:: {\rm cov}(z_i, z_j) \mbox{ is the } (i,j)\mbox{-th element of the matrix.} The matrix :math:`C` is symmetric and positive definite. 3. Compute the eigenvectors and eigenvalues of :math:`C`. Sort the eigenvectors in increasing magnitude of the eigenvalues. Then the feature vector consist of the first :math:`p` eigenvectors. Ranking Data ------------ To build a search engine, the following problem must be solved: .. topic:: the search engine problem Given a search query from a user, among the immense set of web pages that contains the query words, which pages are most important to the user? This is an engineering challenge, as the search engine must answer tens of millions of queries each day. Google is designed to be a scalable search engine. The crawlable web is a graph of pages (nodes) and links (edges). * Every page has forward links (outedges) and backlinks (inedges). * Highly linked pages are more important than pages with few links. .. topic:: Definition of the simplified PageRank. For a web page :math:`u`, * :math:`F_u` is the set of pages :math:`u` points to, * :math:`B_u` is the set of pages that point to :math:`u`. Let :math:`N_u = \left| F_u \right|` and :math:`c` is a normalization factor. A simple ranking is *a simplified PageRank* is .. math:: R(u) = c \sum_{v \in B_u} \frac{R(v)}{N_v}. The rank of page :math:`u` is :math:`\displaystyle R(u) = c \sum_{v \in B_u} \frac{R(v)}{N_v}`. * :math:`N_u` is the number of pages the page :math:`u` points to. * :math:`B_u` is the number of pages that point to :math:`u`. * :math:`c < 1` because many pages do not point to any other pages. Let :math:`A` be a square matrix indexed by web pages. * :math:`A_{u, v} = 1/N_u`, if there is an edge from :math:`u` to :math:`v`, * :math:`A_{u, v} = 0`, otherwise. Consider the vector :math:`R = [R(u) \mbox{ for all pages } u]`, then: .. math:: R = c~\!A~\!R, \quad \mbox{or equivalently} \quad A~\!R = (1/c) R, so :math:`R` is an eigenvector of :math:`A` with eigenvalue :math:`1/c`. The problem with the simplified PageRank is a rank sink caused by two pages pointing to each other but to no other page, and one third page pointing to one of those two pages. Therefore, the simplified PageRank is modified as follows: .. topic:: Definition of the PageRank. The *PageRank of page u* is .. math:: R'(u) = c \sum_{v \in B_u} \frac{R'(v)}{N_v} + c~\!E(u), where :math:`E(u)` is some vector that corresponds to the source of the rank, such that :math:`c` is maximized and :math:`\| R \|_1 = 1`. The 1-norm is :math:`\| R \|_1 = |R_1| + |R_2| + \cdots + |R_n|`, :math:`n` is the length of :math:`R`. In matrix notation: :math:`R' = c (A~\!R' + E)`. To compute the PageRank, we introduce the random surfer model. The PageRank can be derived from random walks in graphs. If a surfer ever gets in a small loop of web pages, then the surfer will most likely jump out of the loop. The likelihood of a surfer jumping to a random page is modeled by :math:`E` in .. math:: R'(u) = c \sum_{v \in B_u} \frac{R'(v)}{N_v} + c~\!E(u). The algorithm to compute the PageRank is outlined below: 0. Initialize :math:`R_0` to some random vector. 1. do 1. :math:`R_{i+1} = A~\!R_i` 2. :math:`d = \| R_i \|_1 - \| R_{i+1} \|_1` 3. :math:`R_{i+1} = R_i + d~\!E` 4. :math:`\delta = \| R_{i+1} - R_i \|_1` while :math:`\delta > \epsilon`. We recognize a variant of the power method to compute the eigenvector associated with the dominant eigenvalue. Let :math:`P = [p_{i,j}]` a square matrix, with indices :math:`i` and :math:`j` running over all web pages, :math:`p_{i,j}` is the probability of moving from page :math:`j` to page :math:`i`. Because the :math:`p_{i,j}`'s are probabilities, the sum of all elements in the *j*-th column of :math:`P` equals one. The matrix :math:`P` is called a *stochastic matrix*. We can look to the equation :math:`x = c~\!P x`, which corresponds to the simplified PageRank definition. To model the likelihood that a surfer jumps out of a loop: * With probability :math:`\alpha`, in the next move the surfer follows one outedge at page :math:`j`. * With probability :math:`1 - \alpha`, the surfer *teleports* to any other page according to the distribution vector :math:`v`. Then we arrive at the equation .. math:: x = \alpha~\!P x + (1 - \alpha) v. The constant :math:`\alpha` is called the *teleportation parameter.* The search engine problem is reduced to the PageRank problem. .. topic:: the PageRank problem Let :math:`P` be a stochastic matrix: 1. all entries of :math:`P` are nonnegative, and 2. each column of :math:`P` sums up to one. Let :math:`v` be a column stochastic vector and :math:`\alpha` the teleportation constant. Then *the PageRank vector R* satisfies .. math:: (I - \alpha P) x = (1 - \alpha) v, where :math:`I` is the identity matrix of the same dimension as :math:`P`. This reduces the PageRank problem to solving a linear system. For the relation with eigenvectors, consider summing up the entries in the column stochastic vector :math:`v`: .. math:: e^T = [1, 1, \ldots, 1], \quad e^T v = 1. Then we rewrite :math:`(I - \alpha P) x = (1 - \alpha) v` as .. math:: \begin{array}{rcl} x & = & \alpha P x + (1 - \alpha) v, \\ & = & (\alpha P + (1 - \alpha) v~\!e^T) x. \end{array} In the rewrite we applied :math:`e^T x = 1`, as the PageRank vector is a column stochastic vector. The mathematics of PageRank apply to any graph or network and the ideas occur in a wide range of applications: * GeneRank and ProteinRank in bioinformatics * IsoRank in ranking graph isomorphisms * PageRank of the Linux kernel * Roads and urban spaces * PageRank of winner networks in sports * BookRank in literature, recommender systems * AuthorRank in the graph of coauthors * Paper and citation networks proposal for a project topic ---------------------------- Read the paper and consider the software: * Florian Schroff, Dmitry Kalenichenko, and James Philbin: **FaceNet: A Unified Embedding for Face Recognition and Clustering.** * OpenFace: free and open source face recognition with deep neural networks. Do the following: 1. Read the paper and install the software. 2. Experiment with many pictures of the same two people. Train the neural network to recognize the faces on the first half of the pictures. Test the neural network on the second half. 3. What is the success rate? Can you improve the selection of the pictures? What if passport style pictures are used? Exercises --------- 1. Verify that :math:`A = U \Sigma V^T` implies :math:`A^T A V = V \Sigma^T \Sigma` and :math:`A A^T U = U \Sigma \Sigma^T`. 2. In the algorithm to compute the PageRank, what about the growth of :math:`A~\!R_i`? Why does the algorithm above not divide :math:`R` by its largest component? Bibliography ------------ 1. S. Brin and L. Page: **The anatomy of large-scale hypertextual web search engine.** *Computer Networks and ISDN systems* 30, pages 107-117, 1998. 2. David F. Gleich: **PageRank Beyond the Web.** *SIAM Review*, vol. 57, no. 3, pages 321-363, 2015. 3. Neil Muller, Lourenco Magaia, B. M. Herbst: **Singular Value Decomposition, Eigenfaces, and 3D Reconstructions.** *SIAM Review*, Vol. 46, No. 3, pages 518-545, 2004. 4. L. Page, S. Brin, R. Motwani, T. Winograd: **The PageRank Citation Ranking: Bringing Order to the Web.** Technical Report. Stanford InfoLab.