Data frames in R: basic operations

Creating data frames

The data frame in R is a two dimensional data structure. The data within each column in a data frame must be all of the same type, but separate columns can contain data of different types. It is probably the most commonly used data type in R, as its structure resembles that of a spreadsheet. We can create a simple data frame using R’s built in data editor. We’ll build a data frame that contains the ASCII codes of a few letters. First, we create an empty data frame with two columns; the first column contains the letters and is of type character, and the second column contains the codes and is of type integer. After that, we invoke the data editor using the edit() function:

> ascii = data.frame(Symbol = character(), Code = integer(),
+                    stringsAsFactors = F)
> ascii = edit(ascii)
> ascii
  Symbol Code
1      A   65
2      B   66
3      C   67
4      D   68
5      E   69
6      F   70

Some things to note here:

  1. The function for creating a data frame is data.frame(). If you come from a more traditional object oriented background, you might think we are calling a function named frame() from an object named data. However, in R, the period or full stop ‘.’ has no special meaning and is just another character that is allowed in variable and function names. Thus the name data.frame is a single function name and doesn’t refer to any object.
  2. The names Symbol and Code are the labels of the two columns. When defining a data frame, we list the column names and their associated data types; there’s no need to put the names in quotes.
  3. By default, a character column in a data frame is interpreted as defining factors, which are basically labels we can use to categorize the rows in a data frame (more on this later). If we don’t want strings to be factors, we need to explicitly switch this behaviour off, which is what stringsAsFactors = F does.
  4. When opening the data editor with edit(ascii) make sure to assign the result to a data frame variable, otherwise all your edits will be lost! The edit() function should pop up a separate window in RStudio. Just type in the values you want and then close the window by clicking the little X icon in the upper right.
  5. When R prints out the contents of a data frame, it provides names for the rows if you didn’t specify them yourself. Here the row names are just the numbers 1 through 6; they aren’t part of the data stored in the data frame.

Row and column names

If we want to change (or add) the row or column names we can use rownames() or colnames():

> rownames(ascii) = c("alpha", "beta", "gamma", "delta", "epsilon", "zeta")
> colnames(ascii) = c("letter", "asciiCode")
> ascii
        letter asciiCode
alpha        A        65
beta         B        66
gamma        C        67
delta        D        68
epsilon      E        69
zeta         F        70

The row and column names can be used to access individual elements, but somewhat confusingly, these names must now be enclosed in quotes:

> ascii["beta","letter"]
[1] "B"
> ascii["beta","asciiCode"]
[1] 66

We can use the $ notation as with lists to refer to columns, but not rows. A few examples of selecting rows and columns:

> beta = ascii["beta",]
> beta
     letter asciiCode
beta      B        66
> str(beta)
'data.frame':	1 obs. of  2 variables:
 $ letter   : chr "B"
 $ asciiCode: num 66
> ascii$letter
[1] "A" "B" "C" "D" "E" "F"
> str(ascii$letter)
 chr [1:6] "A" "B" "C" "D" "E" "F"

On line 1, we select row beta. The str() function shows the structure of a variable; in this case we see that the isolated row is itself a data frame. However, on line 9 we isolate the letter column, and we see that it is a character vector, not a data frame.

Adding rows and columns

We can add extra rows or columns using rbind() or cbind():

> x = cbind(ascii,Reverse = c(6,5,4,3,2,1))
> x
        letter asciiCode Reverse
alpha        A        65       6
beta         B        66       5
gamma        C        67       4
delta        D        68       3
epsilon      E        69       2
zeta         F        70       1
> x = rbind(ascii,eta = c("G",71))
> x
        letter asciiCode
alpha        A        65
beta         B        66
gamma        C        67
delta        D        68
epsilon      E        69
zeta         F        70
eta          G        71

These functions create and return a new data frame by adding to an existing data frame, so don’t forget to save the result in a variable.

Deleting rows and columns

Deleting rows and columns using the index numbers of the desired rows and columns is done using the – operator:

> x[-2,]
        letter asciiCode
alpha        A        65
gamma        C        67
delta        D        68
epsilon      E        69
zeta         F        70
eta          G        71
> x[-c(2:4),]
        letter asciiCode
alpha        A        65
epsilon      E        69
zeta         F        70
eta          G        71
> x[,-1]
[1] "65" "66" "67" "68" "69" "70" "71"

The first expression deletes row 2, the second deletes rows 2 through 4, and the third deletes the first column, leaving only column 2 which is a vector. All these commands create and return a new data frame (or vector) without modifying the original.

Deleting using row and column names is a bit trickier. For columns, we can delete by setting the column to NULL, but be warned that this deletes the column from the original data frame! If you want to save the original and produce a new data frame with the column deleted, create a copy of the original first:

> xsave = x
> x$letter = NULL
> x
        asciiCode
alpha          65
beta           66
gamma          67
delta          68
epsilon        69
zeta           70
eta            71
> xsave
        letter asciiCode
alpha        A        65
beta         B        66
gamma        C        67
delta        D        68
epsilon      E        69
zeta         F        70
eta          G        71

We copy x to xsave and then delete the letter column. We see that x has this column deleted but xsave remains unaltered. We can’t use the $ notation to delete rows.

A safer, non-destructive method that works for both rows and columns is as shown:

> y = cbind(ascii,Reverse = c(6,5,4,3,2,1))
> y
        letter asciiCode Reverse
alpha        A        65       6
beta         B        66       5
gamma        C        67       4
delta        D        68       3
epsilon      E        69       2
zeta         F        70       1
> !colnames(y) %in% c("letter", "Reverse")
[1] FALSE  TRUE FALSE
> y[,!colnames(y) %in% c("letter", "Reverse")]
[1] 65 66 67 68 69 70
> y
        letter asciiCode Reverse
alpha        A        65       6
beta         B        66       5
gamma        C        67       4
delta        D        68       3
epsilon      E        69       2
zeta         F        70       1

We first show y as it starts out. On line 10, there is a rather cryptic statement which produces the logical vector on line 11. The colnames() function is a vector of the column names of y. The %in% operator tests each element in its left operand to see if it is present in its right operand and generates a logical vector with TRUE if the item is present and FALSE if it isn’t. If we then pass a logical vector as the column index for y on line 12, only those columns in a TRUE location will be saved, so the result is that the letter and Reverse columns are deleted. Finally we print out y to show that it’s unaltered. The same technique works for rows with rownames(y) replacing colnames(y).

Advertisements
Post a comment or leave a trackback: Trackback URL.

Trackbacks

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: