Lecture 43: Statistical Computing with R
========================================

R is a language and environment for statistical computing and graphics.
We can run R from a terminal session or evaluate cells in a Sage
notebook interface with R.

Computations with R
-------------------

The ``R`` in Sage is a little ``r.``

::

   c = r('choose(5,2)')
   print(c, 'has type', type(c))
   print(c.sage())

With the ``sage()`` we convert an ``RElement`` into a corresponding
Sage expression.

We can evaluate ``R`` commands given as strings to ``r.eval().``
The result of this evaluation is a string.

::

   r.eval('choose(100, 30)')

Notice that '[1] 2.937234e+25' contains a floating-point number.
R is not a computer algebra package, unlike Sage.
In Sage we compute binomials as follows.

::

   print(binomial(5,2))
   print(binomial(100,30))

To estimate \ :math:`\pi` we can run a Monte Carlo simulation.
We can use that

.. math:: \int_0^1 \sqrt{1-x^2} dx = \frac{\pi}{4}

to estimate \ :math:`\pi`.
We generate one million uniformly distributed points in the interval [0,1]
and compute the mean of the height of the points.

::

   data = r('u <- runif(1000000,min=0,max=1)')
   estpi = r('4*mean(sqrt(1 - u^2))')
   estpi

We see the estimate ``3.142004.``
To check, we use ``integrate.``

::

   f = r('integrand <- function(x) { sqrt(1-x^2) }')
   i = r('integrate(integrand, lower=0, upper=1)')
   i

The output is ``0.7853983 with absolute error < 0.00011.``

Plotting Data
-------------

We generate 100 random normally distributed numbers, sort the numbers,
and then make a stem and leaf plot.

::

   r.eval('rn = rnorm(100)')
   print r.eval('sort(rn)')
   p = r.eval('stem(rn)')
   p

The stem and leaf plot is stored as a string in ``p``.

::

   The decimal point is at the |

   -3 | 4
   -2 | 72
   -1 | 9665544332100
   -0 | 999888877776665555444433333332222110
    0 | 001111222222333334455566667788999
    1 | 000224455569
    2 | 036

We illustrate the r.call() and making a histogram of data.

::

   r.quartz() # to plot on Mac OS X
   data = r.call('rnorm', 100)
   r.call('hist', data)

The data for the histogram is below.
The histogram plot is in :numref:`figRhistplot`.

::

   $breaks
   [1] -3.0 -2.5 -2.0 -1.5 -1.0 -0.5  0.0  0.5  1.0  1.5  2.0  2.5
   $counts
   [1]  1  1  6  8  9 14 21 20 15  3  2
   $density
   [1] 0.02 0.02 0.12 0.16 0.18 0.28 0.42 0.40 0.30 0.06 0.04
   $mids
   [1] -2.75 -2.25 -1.75 -1.25 -0.75 -0.25  0.25  0.75  1.25  1.75  2.25

.. _figRhistplot:

.. figure:: ./figRhistplot.png
    :width: 75%
    :align: center

    A histogram of 100 random, normally distributed, data points.

Making a box plot of the data is straightforward.

::

   r.call('boxplot', data)

The data is shown below and the plot is in :numref:`figRboxplot`.

::

   $stats
              [,1]
   [1,] -2.0077439
   [2,] -0.4777200
   [3,]  0.2852553
   [4,]  0.9061189
   [5,]  2.2852012
   $n
   [1] 100
   $conf
              [,1]
   [1,] 0.06660879
   [2,] 0.50390188
   $out
   [1] -2.620645

.. _figRboxplot:

.. figure:: ./figRboxplot.png
    :width: 75%
    :align: center

    A box plot of 100 random, normally distributed, data points.

The middle line in the plot indicates the medium.
What is in the box are the points between the first and third quartile
of the data.  The horizontal lines at bottom and top show respectively
the minimum and maximum of the data.  Outliers are marked by circles.


Fitting Linear Models
---------------------

To fit linear models, we apply ``lm``.
As a test we generate 50 points in [0, 10] as ``x``.
We set ``y`` equal to ``x`` with some random noise added.
Observe the assignment in ``r``.

::

   r('x <- 10*runif(50)')
   r('y <- x + rnorm(50)')
   r.quartz()
   r('plot(x,y)')

The plot is shown in :numref:`figRpointsplot`.

.. _figRpointsplot:

.. figure:: ./figRpointsplot.png
    :width: 75%
    :align: center

    A plot of 50 points on a line with random noise added.

Now we do the fit with the command ``lm``.

::

   r('fit <- lm(y ~ x)')

For the output we can read off the intercept ``b`` and the slope ``a``
for the linear fit ``y = a*x + b``.  See below:

::

   Call:
   lm(formula = y ~ x)

   Coefficients:
   (Intercept)            x  
        0.1891       0.9573

We can ask for a summary.

::

   r('summary(fit)')

The summary of the ``fit`` is below.

::

   Call:
   lm(formula = y ~ x)

   Residuals:
        Min       1Q   Median       3Q      Max 
   -1.84906 -0.75229 -0.07965  0.40908  2.46077 

   Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
   (Intercept)  0.18907    0.28316   0.668    0.508    
   x            0.95734    0.04729  20.245   <2e-16 ***
   ---
   Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

   Residual standard error: 1.014 on 48 degrees of freedom
   Multiple R-squared:  0.8952,	Adjusted R-squared:  0.893 
   F-statistic: 409.8 on 1 and 48 DF,  p-value: < 2.2e-16