Monday, December 6, 2010

the R Project for Statistical Computing

R (r-project.org) is a free (and open source) language and software environment for statistical computing and graphics. If you need to perform statistical analysis or create visualizations of data sets, check out R. It works on most operating systems; ask Google for any one of the many fine installation tutorials.

The biggest barrier to entry to using R comes from its lack of a graphical user interface (GUI). Instead of seeing and manipulating data and selecting commands from a menu as in Microsoft Excel, users of R type in functions at the command line. It looks just like the DOS prompt that so few people use now. Does this mean you have to be a programmer to use it? No, but when you start using it, you'll be creating mini computer programs without realizing it. The rest of this article contains information about a few things I found useful when learning R.

R Commander

require(Rcmdr)
R Commander is the closest thing that users have to a GUI. It is probably the best way to start learning how to use R, because as a user selects commands to run from the menu, she can see the result entered into the command line. However, one should quickly move toward manipulating R using the command line directly; it will become immediately evident how much more powerful the command line really is.
R Commander. Image source: John Fox
To start using R Commander, use it to play around with some of the sample data that comes with a standard installation of R. See below.

Included data sets

data()
R comes with many sample data sets pre-loaded. IPSUR (see below) uses many of these small data sets to illustrate concepts. To see a list of all the example data sets included by default with R, simply type in data() at the command prompt. To see a sample data set, simply type in its name. For example, type in cars to see the data about speed and stopping distance of cars. Playing with these data sets using R Commander is a great way to start learning R syntax. For example, try the command hist(cars$speed) to see R generate a histogram of data from the "speed" column of the data set called "cars."

Introduction to Probability and Statistics Using R

To get started with R, I'd recommend the freely available textbook Introduction to Probability and Statistics Using R (IPSUR [PDF]). Like so many other great internet resources, it's free and open source. The author, G. Jay Kerns of Youngstown State University, has compiled the whole book in LaTeX (also open source), producing a publication-quality textbook based specifically on using the R language and environment. This book makes use of the many example data sets included in R. It includes installation instructions and walks the reader through the mathematics of statistics while using R to perform the calculations.

I did not use IPSUR to learn statistics; I only used it to apply what I already knew about statistics to the R environment. As such, I can't comment on it as a statistics book, but only as an R book. It may be that the typical user should start with a introductory statistics book to learn the vocabulary and motivation for statistics, and then move to IPSUR and R with that foundation in place.

The lattice package

library(lattice)
I found the lattice package to be particularly useful in creating quality graphics. Make it available to R by typing library(lattice) at the command line. With the lattice package loaded, you have more graphical functions available. For example, instead of hist(), you also histogram(). Try histogram(cars$speed) and compare it to the default hist(cars$speed).

Data sets with two variables can be displayed easily using a simple XY chart, or the Cartesian plane familiar from high school algebra. The relationship is visualized by plotting each variable along one of the two axes. Lattice gets most of its power from its ability to meaningfully display data sets with more than two variables.

For example, look at the "environmental" data set (loaded along with the lattice package) by typing in the command head(environmental). Notice that it has four variables: Ozone, radiation, temperature, and wind. (You could see this more directly with command names(environmental).) To visualize the relationship of all possible pairs of these variables together, use the command splom(~environmental, data=environmental).

Some useful commands:

  • ?lattice Open the help file for the command "lattice"; replace "lattice" with any R command to see its help file.
  • head(cars) View the column headings and the first five rows of the data set "cars"; replace "cars" with any data set to see a truncated listing, which will avoid printing out possibly thousands of lines of data in the command window. Useful for seeing what a particular data set looks like.
  • neuse$month_integer <- with(neuse, as.POSIXlt(neuse$DATE)$mon)
  • boxplot(Log_SURFCHLA~SOURCE + month_integer, data=neuse)
  • splom(~ cbind(Murder, Assault, Rape), data = USArrests)
  • histogram(~Log_SURFCHLA|month_integer, data = neuse)

Other statistics packages

R isn't the only option. Below are some other software packages that have robust statistical computational capacity. Unfortunately, all are closed-source and proprietary.
  • Minitab ($1,395; Minitab 16)
  • Stata ($1,245; Stata IC)
  • SAS ($8,100; SAS Analytics Pro)
  • Excel ($139.99; Excel 2010)

1 comment:

  1. If you need to perform statistical analysis or create visualizations of data sets, check out R. It works on most operating systems; ask Google for any one of the many fine installation tutorials. fengshui

    ReplyDelete