R programming: first steps

R is a (very) high level programming language that is used for statistical analysis. It’s open source and free, and is used in many industrial-strength applications, so if you’re planning on doing any data analysis, it’s worth a look.

Here, I’ll start with installing R and looking at a few basic concepts.

Installing R and RStudio

R runs on all major computing platforms (Windows, Mac, Linux) but I’ll restrict myself to Windows, since that’s all I have. Installing R is quite straightforward. Visit the CRAN website and download the R package, then install it in the usual way (on Windows, the install file is an exe, so just run it).

Although R on its own runs from the command line and many tutorials assume this is the environment you’re using, if you’re used to an IDE for your programming in other languages I’d highly recommend that you now install RStudio, which is a free (for non-commercial use) graphical interface for R development. You can get it here. From now on, I’ll assume this is the environment we’re using. RStudio should find your R installation, but if it doesn’t, or you want to change the version of R it uses, open Tools –> Global Options and select the General tab.

R is an object-oriented, functional language that contains most of the usual features such as arithmetic operators, if statements and loops. Since these don’t differ much from other languages such as Java or C#, we won’t dwell on them here. Rather, we’ll start off by examining some simple operations on data sets, which is what R is designed for.

One feature of R that is at once powerful and frustrating is that there is a pre-written function to do almost anything you can think of. It’s frustrating because the sheer number of such functions makes it virtually impossible to remember them (so remember Google is your friend) and even if you do, their usage is often far from obvious. There is a built-in ‘help’ facility in R, but, at least for novices, it’s often far from helpful.

Anyway, open up RStudio and follow along. You should find that that there is a Console window in the lower left (or possibly covering the entire left side) into which you can type R commands.

Data types

R comes with several built-in primitive data types such as ‘logical’, ‘character’, ‘double’, ‘numeric’, ‘integer’ and ‘complex’. R doesn’t require you to declare your variables before using them; rather the type of the variable is determined by how it is used, and the same variable name can be reassigned to different data types within the same program. The current type of a variable can be found with the typeof() function:

> x = 12 
> typeof(x) 
[1] "double" 

A note about the assignment operator: in older versions of R the backwards arrow <- was the only acceptable assignment operator so the x = 12 statement above would be written x <- 12. More recent versions of R allow both <- and the more intuitive = for assignment. I'll use = here since it's what I'm used to from other languages.

Vectors

For anything beyond trivial commands, we’ll be dealing with collections of data so we need to see how R handles these. There are four main types of object that are used for storing data: vectors, lists, matrices and data frames. The simplest of these is the vector.

We can define a vector using the range operator (a colon : ) if we want, say, a sequence of integers:

 
> v = 1:3 
> v [1] 1 2 3 

(Typing a variable on a line by itself prints out the value of that variable.)

Another way to create a vector is to use the c() function (‘c’ for ‘concatenate’), which produces a vector from its arguments. Thus we could have written the above vector as

 
> v = c(1,2,3) 
> v [1] 1 2 3 

There is, however, a subtle distinction between the two, which can be seen by using typeof(v). In the first example, the range operator : produces a vector of integers from 1 to 3 so typeof(v) produces ‘integer’. In the second example just listing the numbers 1, 2, 3 makes R think they are doubles.

Notice that applying typeof() to a vector produces the type of its elements, and not the type of the vector itself (which is ‘vector’).

If you’ve been wondering what the [1] at the start of the output line means, it indicates that the first element printed in that line is element 1 of the vector. Printing out a longer vector causes the output to appear on several lines, and the index of the first element in each line is printed at the start of that line. Try it yourself by generating a vector with the integers 1 through 100 and then print it.

The elements of a vector can be accessed individually using square bracket notation, so that v[1] is the first element of v, v[2] is the second element, and so on. Note that vector indexes begin at 1, not 0 as in many other languages like Java and C#.

Coercion

We can apply c() to a list of any types of arguments, even mixed types. However, the elements of a vector must be all the same type, so what happens if we try something like

 
> u = c(42, "Hello", TRUE) 
> typeof(u) 
[1] "character" 
> u 
[1] "42" "Hello" "TRUE" 

This illustrates coercion; each element is coerced into the most general data type. In this case 42 is double, “Hello” is character and TRUE is logical. A character string, in general, can’t be interpreted as a number or a logical variable (true or false), while the other elements can be interpreted as just character strings rather than as actual values. So in this case, c() coerces all the elements to be of type character. When we print u we see all its elements are in quotes, indicating that they are merely strings, not values.

R scripts

For anything more than the odd isolated command, typing R statements at the command line can get tedious, especially if you need to repeat several commands. It’s easier to create an R script file and run that. In RStudio, click on the ‘New’ icon in the top left (a blank sheet with a green circle with a plus sign) and create a new R Script. An empty window will appear in the top left. Any code entered in this window can be run by clicking the Source icon at the top right of this window (or press Ctrl + Shift + S). This just runs the code but doesn’t produce any output. If you want to see the output, you can open the drop-down menu to the right of the Source icon and select “Source with echo” (or press Ctrl + Shift + Enter). This will echo the code as well as the output in the console.

It’s also worth noting that any R objects created by your code (either in the console or from running a script) remain in your environment until you clear it. They are visible in RStudio in the top right panel under the Environment tab.

Anyway, back to vectors. Here are a few things you can do with vectors that illustrate some of the built-in R functions and operators.

 
x = 1:10
y = seq(2, 20, by = 2)
x
y
x + y    # Add corresponding elements
x - y    # Subtract corresponding elements
x * y    # Multiply corresponding elements
x / y    # Divide corresponding elements
x %*% y  # Inner product (produces 1 value)
crossprod(x, y)  #Caution! Same as x %*% y
sqrt(x)  # Square root of each element
x^3      # Cube of each element
sum(x)   # Add up all elements
mean(x)  # Mean of elements
var(x)   # Variance of elements
x = c(x, 11:15) # Add 11:15 to end of x and reassign x to result
y = c(y, seq(22, 30, by = 2))  # Extend y similarly
x
y
z = 1:5
x + z    # Add vectors of different lengths
q = 1:6
y + q    # Longer length not a multiple of shorter length

Try copying and pasting this code into RStudio and run it to see what you get. Line 2 shows the seq() function which generalizes the colon operator by allowing you to specify a step size with ‘by’. The four standard arithmetic operators each operate on every element in the two vectors, so x + y performs x[1] + y[1], x[2] + y[2] and so on.

The %*% operator on line 9 performs an inner product, which is essentially the same thing as a dot or scalar product between the two vectors, equal to x[1] * y[1] + x[2] * y[2] + … There is also a function called, confusingly, crossprod() which does the same thing. This is NOT the cross or vector product that you may be familiar with from linear algebra! As far as I can tell, if you want the cross product you’ll need to write an R function to do that yourself (though it’s not hard).

The caret ^ is the exponentiation operator, and operates on each vector element separately. sum(), mean() and var() calculate the sum, mean and variance of the elements in the vector, so each returns a single value.

It’s possible to extend a vector using c() as shown in lines 17 and 18. We add 11:15 to the end of x and then reassign x to be this longer vector.

Finally, it’s worth noting what R does if we use vectors of different length in the arithmetic operations. We create a vector z of length 5 and add it to x, which is now length 15. R repeats the shorter vector enough times to make it fit the longer vector, so that x + z produces a vector with elements [1+1, 2+2, 3+3, 4+4, 5+5, 6+1, 7+2, 8+3,…]. If the longer vector’s length is a multiple of the shorter vector’s length, R will do the computation silently, but if this isn’t the case, as with y+q, it will still wrap the shorter vector enough times to match the longer one, but you’ll get a warning that the length of the longer vector isn’t a multiple of that of the shorter.

Some commands won’t work on vectors of different lengths. For example, the %*% operator requires two vectors of equal length.

Advertisements
Post a comment or leave a trackback: Trackback URL.

Trackbacks

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: