R: A Simple introduction

R: A Brief Introduction

A. Colin Cameron, Dept. of Economics, Univ. of Calif. - Davis

This March 2016 help sheet gives information on installing and using R

WHAT IS R?

R is a free matrix programming language and software environment that is widely used among statisticians for developing statistical software and data analysis. It is based on an earlier language S that became the commercial product S-Plus. It runs on Windows, Mac and Linux. Unlike S-Plus, R is free. See http://en.wikipedia.org/wiki/R_(programming_language)

For regression analysis one needs

Necessary: The latest base version of R. This includes some basic regression commands as part of the R Stats package.
Strongly Recommended: RStudio which is a simpler front-end to R like R-commander and/or R-Studio.
Perhaps useful for beginner:R Commander which is a GUI interface for which you don't need to know R commands in advance
(It shows the commands produced, so you know for future. But it only covers a restricted number of commands.).
As Needed: Relevant user-written programs called packages that include additional commands (functions) that are not part of the base version of R.

INSTALL R

You need to initially have R installed.

Go to http://www.r-project.org and install the latest version of R (install base)

INSTALL R-STUDIO

Go to http://www.rstudio.com and install the front-end RStudio

If you install RStudio then from now on initiate R Studio, not R.
(Note: R Commander runs both under R and under R Studio).

OPTIONAL: INSTALL R COMMANDER A GUI FRONT-END FOR R

Once in R give the command install.packages("Rcmdr", dependencies = TRUE) and then command library(Rcmdr)

If you install RCommander then from now on initiate R Commander by the command library(Rcmdr) once in RStudio

INSTALL OTHER PACKAGES

Go to http://cran.r-project.org/web/views to see lists of potentially useful packages organized for convenience by task. Tasks include Econometrics, Finance, SocialSciences and TimeSeries.
Install a package. For example, to install package np using RStudio, open RStudio, go to the Install Packages window, search for np, and click on np.

FURTHER INFORMATION

The repository for R packages is CRAN (The Comprehensive R Archive Network).
Packages are only put on CRAN if they pass quality insurance checks, especially stability. Packages that are also published in journals like
These can be accessed from http://cran.r-project.org/web/packages/ either by individual package or by Task.
Basic R documentation is at https://cran.r-project.org/manuals.html
See especially "An Introduction to R". https://cran.r-project.org/doc/manuals/r-release/R-intro.pdf
R Commander documentation is installed in the Rcmdr directory
RStudio has useful documentation at http://www.rstudio.org/docs/ including http://www.rstudio.org/docs/help_with_r
A useful introduction for econometricians is Jeff Racine and Rob Hyndman (2002), "Using R to Teach Econometrics," Journal of Applied Econometrics, 17, 175-189.
Jeff Racine has useful general information on getting going in R (in addition to his own package np). See https://socialsciences.mcmaster.ca/racinej/Gallery/Home.html
The website http://www.statmethods.net/ is useful.

SOME R BASICS

> is the R prompt
<= is the R assignment ("equality") operator (often can instead use = )
( , ) R function arguments are given in parentheses and are comma separated
? is the help prefix e.g. ?lm gives help on the lm function

For some sample programs see http://cameron.econ.ucdavis.edu/R/R.html

R EXAMPLE: LINEAR REGRESSION

First remove all variables from the workspace

> rm(list=ls())

Consider linear regression of y on x with five observations (y,x)=(1,1),(2,2),(2,3),(2,4),(2,5).
To type in data use the c function (here > is the prompt from R and is not typed).

> y = c(1,2,2,2,3)
> x = c(1,2,3,4,5)

To see the data (the [1] is because y is a column vector that is a 5x1 vector and the first column of this vector is being listed - here as a row)

> y
[1] 1 2 2 2 3
> x
[1] 1 2 3 4 5

To see the mean of y

> mean(y)
[1] 2

To summarize y

> summary(y)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1 2 2 2 2 3

To plot y against x

> plot(y,x)
.... output omitted ...

To OLS regress y on x and have coefficients reported.

> lm(y~x)
Call:
lm(formula = y ~x)
Coefficients:
(Intercept) x
0.8 0.4

To regress y on x and obtain more complete regression output, first save the results in lm.cars and then use summary.lm to print out complete results.

> lm.cars <- lm(y~x)
> summary(lm.cars)
Call:
lm(formula = y ~x)
Residuals:
         1          2          3          4          5
-2.000e-01 4.000e-01 -6.855e-17 -4.000e-01 2.000e-01
Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   0.8000     0.3830   2.089   0.1279
x             0.4000     0.1155   3.464   0.0405 *
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
Residual standard error: 0.3651 on 3 degrees of freedom
Multiple R-squared:   0.8,      Adjusted R-squared: 0.7333
F-statistic:    12 on 1 and 3 DF, p-value: 0.04052

MORE REGRESSION

To obtain heteroskedastic-consistent (sandwich) standard errors use function vcovHC in package sandwich that may first need to be installed.
(Or can use coeftest in package lmtest)

> library(sandwich)
Error in library(sandwich): there is no package called `sandwich'
> install.packages("sandwich")
.... output omitted
> library(sandwich)
> model <- lm(y ~x)
> vcovHC(model)
(Intercept) x
(Intercept) 0.28489796 -0.07959184
x -0.07959184 0.02653061

Function vcovHC gave the variance matrix. We need to get the standard errors, the square root of the diagonal entries. These are 0.5337 and 0.1628 compared to earlier default standard errors of 0.3830 and 0.1155.

> sqrt(diag(vcovHC(model)))
(Intercept) x
0.5337583 0.1628822

To produce the original OLS results (using defaults standard errors) as a Latex table

> install.packages("xtable")
> library(xtable)
> xtable(model)

OLS WITH MATRIX ALGEBRA

To do regression manually using matrix commands. First create matrices X (including intercept) and y and then form inverse of X'X times X'y.

> x
[1] 1 2 3 4 5
> X <- cbind(1,x)
> X
x
[1,] 1 1
[2,] 1 2
[3,] 1 3
[4,] 1 4
[5,] 1 5
> bhat <- solve(t(X)%*%X)%*%t(X)%*%y
> bhat
[,1]
0.8
x 0.4

Here t(X) transposes X, %*% denotes matrix multiplication, solve forms the matrix inverse.

READ IN A COMMA SEPARATED DATASET

To read in a comma-separated values file

mydata <- read.csv("http://cameron.econ.ucdavis.edu/excel/carsdata.csv")

> summary(mydata)

CARS HH.SIZE

Min. :1 Min. :1

1st Qu.:2 1st Qu.:2

Median :2 Median :3

Mean :2 Mean :3

3rd Qu.:2 3rd Qu.:4

Max. :3 Max. :5

READ IN A STATA DATASET (Can't be too recent a version. Version 12 or earlier is okay).

> install.packages("foreign")
> library(foreign)
> mydata2 <- read.dta("http://cameron.econ.ucdavis.edu/stata/carsdata.dta")
> summary(mydata2)