Regression ========== We introduce regression from a linear algebra perspective. Solving Overdetermined Linear Systems ------------------------------------- Consider :math:`A {\bf x} = {\bf b}`, :math:`A` is *m*-by-*n*, with :math:`m \geq n`. If :math:`m > n`, then we have more equations than unknowns and we can ask: *How to solve this overdetermined linear system?* * As the overdetermined linear system has in many cases no exact solution, the problem of solving the system is reformulated into an optimization problem: Minimize :math:`\| {\bf b} - A {\bf x} \|_2^2`, or in words: solve the linear system in the least squares sence. * To solve in the least square sense, the matrix :math:`A` is factored into a product of an orthogonal matrix :math:`Q` and and upper triangular matrix :math:`R`. A numerically stable algorithm to compute the QR decomposition is the Householder QR algorithm. * While the Householder QR is a direct method to solve the least squares problem, for very large problems, an iterative algorithm such as the Generalized Minimum Residual Method is recommended. * To verify a solution of a least squares problem, we exploit the orthogonality: :math:`\langle {\bf b} - A {\bf x} , A \rangle = 0`, which expresses that the residual vector :math:`{\bf b} - A {\bf x}` is perpendicular to the column space of the matrix :math:`A`. * The solution of a least squares problem gives the coefficients of a data fit with a basis of polynomials, exponentials, or periodic functions. The backslash operator applies to overdetermined systems, as illustrated in the Julia session below: :: julia> A = rand(3,2); b = rand(3,1); julia> x = A\b; julia> r = b - A*x 3×1 Matrix{Float64}: -0.038628108251110294 0.4381284983642053 -0.17773235808476184 julia> r'*A 1×2 Matrix{Float64}: -2.77556e-17 -2.77556e-17 The outcome of ``r'*A`` or :math:`\langle {\bf b} - A {\bf x} , A \rangle` is below machine precision. A typical application is to find a trend in numerical data. .. index:: linear fit * Input: :math:`(x_i, y_i)`, :math:`i=1,2,\ldots,m`. * Output: :math:`(a, b)`, slope and intercept of a line :math:`y = a x + b`, so that .. math:: \sum_{i=1}^m \left( \vphantom{\frac{1}{2}} y_i - (a x_i + b) \right)^2 is minimal. If the slope :math:`a` is positive, then the :index:`trend in the data` is increasing. In :numref:`figFitData` is an illustration of :index:`fitting noisy data`. .. _figFitdata: .. figure:: ./figFitData.png :align: center Fitting data sampled from a line, with noise added. The Julia code to make the plot in :numref:`figFitData`, for execution in a Jupyter notebook, is below: :: using Plots # plot the exact data on the line y = 0.5*x + 1 x = -0.1:0.1:1.1 y = 0.5*x .+ 1 plot(x,y,label="y=0.5*x + 1",legend=:topleft) # add normally distributed noise ynoisy = [y[i] + 0.2*randn() for i=1:length(y)] scatter!(x[2:length(x)-1],ynoisy[2:length(x)-1], marker=(:red),label="data",legend=:topleft) # set up the linear system A = ones(length(x),2) A[:,2] = x # apply the backslash operator c = A\ynoisy # plot the fitted line lbl = string("y=", string(c[2]), "*x+", string(c[1])) fity = c[1] .+ c[2]*x plot!(x,fity,label=lbl) To fit exponential growth, :index:`linearization` is applied. Given :math:`(t_i, f_i)`, :math:`i=0,1,\ldots,n`, find the coefficients of .. math:: y(t) = c_1 e^{c_2 t} = c_1 \exp(c_2 t), so :math:`y(t)` fits the data :math:`(t_i, f_i)` best. With linearization, we consider .. math:: \ln(y(t)) = \ln(c_1 e^{c_2 t}) = \ln(c_1) + c_2 t. The :index:`logarithmic scale` helps to deal with numerical instabilities caused by extreme values which may arise with :index:`exponential growth`. Predictive Models ----------------- In statistical software, the :index:`least squares` problem is known as the ordinary least squares. The Generalized Linear Model is abbreviated as GLM. An example from the GLM documentation is illustrated below. :: julia> using DataFrames, GLM julia> data = DataFrame(X=[1,2,3], Y=[2,4,7]) 3x2 DataFrame Row X Y Int64 Int64 1 1 2 2 2 4 3 3 7 For the give data, we define a :index:`linear model` in the continuation of the Julia session: :: julia> ols = lm(@formula(Y ~ X), data) StatsModels.TableRegressionModel{ LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}}}}, Matrix{Float64}} Y ~ 1 + X Coefficients: Coef. Std. Error (Intercept) -0.666667 0.62361 X 2.5 0.288675 As an application of the least squares model, we can :index:`predict data`. :: julia> predict(ols) 3-element Vector{Float64}: 1.8333333333333308 4.333333333333334 6.833333333333336 julia> line(x) = -2/3 + 2.5*x line (generic function with 1 method) julia> [line(t) for t in [1,2,3]] 3-element Vector{Float64}: 1.8333333333333335 4.333333333333333 6.833333333333333 Recall an earlier lecture analyzing data and the question *Are the winters in Chicago becoming milder?* To answer this question, we obtained data from the National Weather Service, a site from the National Oceanic and Atmospheric Administration (NOAA). The data are average monthly temperatures in Chicago over the past 100 years, from 1922 to 2021. The file ``tempChicago100years.txt`` stores the data. The Julia code in the Jupyter notebook loads the data and makes a plot, shown in :numref:`figDecemberTemperatures`. .. _figDecemberTemperatures: .. figure:: ./figDecemberTemperatures.png :align: center Average December temperatures in Chicago from 1922 to 2021. :: using DelimitedFiles A = readdlm("tempChicago100years.txt") dectemp = A[2:end,13] plot(1:100, dectemp, label="Dec", legend=:bottomleft) The code in the Jupyter notebook continues with the preparation of the data for the fit. The type of the elements in the matrix ``A`` is ``Any``, which then causes problems later, so we must convert to ``Float64``. :: years = A[2:end,1] tempdf = DataFrame(X=Vector{Float64}(years), Y=Vector{Float64}(dectemp)) The command ``olstemp = lm(@formula(Y ~ X), tempdf)`` fits the average December temperatures, and gives the output: :: Y ~ 1 + X Coefficients: --------------------------------- Coef. Std. Error --------------------------------- (Intercept) 20.4113 35.8233 X 0.00433003 0.0181687 Observe that the slope ``0.00433`` is tiny ... which becomes apparent in the plot in :numref:`figDecemberTemperaturesFitted100`. .. _figDecemberTemperaturesFitted100: .. figure:: ./figDecemberTemperaturesFitted100.png :align: center A linear fit of the average December temperatures. To make the plot, we grab the coefficients of the fit: :: cff = coef(olstemp) and then compute the predictions: :: predicted = [cff[1] + cff[2]*t for t in 1922:2021] and plot for the same range ``1:100`` as the December temperatures: :: plot!(1:100, predicted, label="predictions") Instead of making the fit for the entire 100 years, let us look at the last 20 years, done by the code below: :: Xlast20=Vector{Float64}(years[end-19:end]) Ylast20=Vector{Float64}(dectemp[end-19:end]) tempdf20 = DataFrame(X=Xlast20,Y=Ylast20) olstemp20 = lm(@formula(Y ~ X), tempdf20) cff20 = coef(olstemp20) The coefficients are :: -607.1678947233604 0.3168421052564556 The larger slope ``0.316`` is noticeable in :numref:`figDecemberTemperaturesFittedLast20`. .. _figDecemberTemperaturesFittedLast20: .. figure:: ./figDecemberTemperaturesFittedLast20.png :align: center A linear fit of the average December temperatures, over the last 20 years. Regression as a Data Mining Task -------------------------------- .. index:: data mining .. topic:: Definition of Regression as a Data Mining Task. In the modeling view of data mining, *regression* is a task of predicting a numeric target attribute which represents some quantity of interest. This quantity of interest could be * an outcome or a parameter of an industrial process, * an amount of money earned or spent, * a cost or gain due to a business decision, etc. Data mining has its origins in machine learning and statistics. In data mining: *domain*, *instance*, *attribute*, and *dataset*. In statistics: *population*, *observation*, *variable*, and *sample* are the respective counterparts to the data mining terms. Proposals of Project Topics --------------------------- 1. Fit census data. Using census data from 1790 to today, find experimentally a best fitting polynomial for the U.S. population in millions against time in decades. Consider the following questions: * What is the best degree of polynomial? * Instead of one single polynomial, would a piecewise polynomial model fit better? 2. Fit corona virus data. Use data from the covid-19 pandemic to model the exponential growth in a surge of infections. Consider the following questions: * What is the exponent in the surge? * Do different surges have different exponents? Exercises --------- 1. Does Moore's Law still hold? The data in :numref:`tabTransistorCount` are taken from wikipedia. .. _tabTransistorCount: .. table:: Transistor counts for various Intel processors. +------+--------------------------+------------------+ | year | Intel processor | transistor count | +======+==========================+==================+ | 2000 | Pentium III Coppermine | 21,000,000 | +------+--------------------------+------------------+ | 2001 | Pentium III Tualatin | 45,000,000 | +------+--------------------------+------------------+ | 2002 | Pentium IV Northwood | 55,000,000 | +------+--------------------------+------------------+ | 2003 | Itanium 2 Madison | 410,000,000 | +------+--------------------------+------------------+ | 2004 | Itanium 2 | 592,000,000 | +------+--------------------------+------------------+ | 2005 | Pentium D Smithfield | 228,000,000 | +------+--------------------------+------------------+ | 2006 | Dual-core Itanium 2 | 1,700,000,000 | +------+--------------------------+------------------+ | 2007 | Core 2 Duo Wolfdale | 411,000,000 | +------+--------------------------+------------------+ | 2008 | Core i7 quad-core | 731,000,000 | +------+--------------------------+------------------+ | 2010 | Xeon Nehalem-EX (8-core) | 2,300,000,000 | +------+--------------------------+------------------+ Fit the data in :numref:`tabTransistorCount` to :math:`y(t) = c_1 2^{c_2 t}`. Make a :math:`\log_2` scale plot: :math:`(\mbox{year}, \log_2(\mbox{transistor count}))` of the data, and the model :math:`y(t)`. For year, use 0, 1, :math:`\ldots`, 10. 2. Complete the plot in :numref:`figDecemberTemperaturesFittedLast20` with the fit for the last 20 years with the fits for 1922-1941, 1942-1961, 1962-1981, and 1982-2001. 3. Adjust the code of the previous exercise, so that instead of 20, any divisor of 100 could be used to break up the data. Bibliography ------------ 1. Jose Storopoli, Rik Huijzer, Lazaro Alonso: *Julia Data Science*. First edition published 2021. Creative Commons Attribution-Noncommercial-ShareAlike 4.0 International. 2. Pawel Cichosz: *Data Mining Algorithms: Explained Using R*. Wiley 2015.