Reading an R data frame from a file; Customized coercion for date-times

Reading a data file into a data frame

For any realistic use of data frames, we’ll be dealing with large sets of data, usually stored in an external file. R has a number of methods for reading data from various file types, but we’ll look at one of the simplest here, which is reading from .csv (comma-separated values) files. CSV files are produced by many applications, including popular spreadsheets such as Excel and LibreOffice. Data in a CSV file are given in rows with each row consisting of a fixed number of columns separated by commas. For illustration, I’ll use a data file containing weather readings for April 2014 taken from my weather station. There are 25 columns in this file, giving data on things like temperature, rainfall, wind speed and direction and so on. We’ll load this file into R and then do a few manipulations of the data.

> april2014 = read.csv("april2014.csv", stringsAsFactors = F)
> str(april2014)
'data.frame':	2880 obs. of  25 variables:
 $ dateTime       : chr  "2014-04-01 00:00:00" "2014-04-01 00:15:00" "2014-04-01 00:30:00" "2014-04-01 00:45:00" ...
 $ archiveInterval: int  15 15 15 15 15 15 15 15 15 15 ...
 $ iconFlags      : int  0 0 0 0 0 0 0 0 0 0 ...
 $ moreFlags      : int  1 1 1 1 1 1 1 1 1 1 ...
 $ packedTime     : int  15 30 45 60 75 90 105 120 135 150 ...
 $ outsideTemp    : num  6.72 6.67 6.67 6.61 6.56 ...
 $ hiOutsideTemp  : num  6.78 6.72 6.67 6.67 6.61 ...
 $ lowOutsideTemp : num  6.72 6.67 6.61 6.61 6.56 ...
 $ insideTemp     : num  21.1 20.9 20.8 20.8 20.5 ...
 $ barometer      : num  1014 1014 1014 1014 1014 ...
 $ outsideHum     : int  94 94 94 94 94 94 94 94 94 94 ...
 $ insideHum      : int  53 53 53 53 53 53 53 53 53 53 ...
 $ rain           : num  0 0 0 0 0 0 0 0 0 0 ...
 $ hiRainRate     : num  0 0 0 0 0 0 0 0 0 0 ...
 $ windSpeed      : num  6.44 6.44 6.44 8.05 8.05 ...
 $ hiWindSpeed    : num  17.7 12.9 16.1 17.7 19.3 ...
 $ windDirection  : int  3 3 3 3 3 3 3 3 3 2 ...
 $ hiWindDirection: int  4 4 3 5 3 1 4 4 3 2 ...
 $ numWindSamples : int  342 341 341 343 343 343 342 342 342 342 ...
 $ solarRad       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ hiSolarRad     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ UV             : num  0 0 0 0 0 0 0 0 0 0 ...
 $ hiUV           : num  0 0 0 0 0 0 0 0 0 0 ...
 $ DayTime        : num  1 1.01 1.02 1.03 1.04 ...
 $ Year           : int  2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ...

We use the read.csv() function to read the file into a data frame. All the data are numeric with the exception of the dateTime column which contains the date and time as a character string, so we want to prevent R from interpreting dateTime as a factor. We can see the structure of the resulting data frame. The weather station records data every 15 minutes, so dateTime starts at midnight on April 1 and advances in 15 minute intervals.

Converting a date-time string to a date-time object

It’s useful to convert the character strings giving the date and time to a proper date-time object. Unfortunately, the functions for doing this have non-intuitive names. There is a function called as.Date() but it returns only the date part, ignoring the time. If we want a proper date-time variable, we can use as.POSIXct() (I told you it was non-intuitive!). The acronym POSIX stands for Portable Operating System Interface (I don’t know what the X is for) and is a collection of IEEE standards. The ‘ct’ stands (I think) for ‘calendar time’. We can convert the dateTime column to POSIXct as follows:

> april2014[,"dateTime"] = as.POSIXct(april2014[,"dateTime"])
> str(april2014$dateTime)
 POSIXct[1:2880], format: "2014-04-01 00:00:00" "2014-04-01 00:15:00" ...

As our date-time data are already in a standard format, we don’t need to specify the format for as.POSIXct(). If the date-time is in some other format, we can specify it explicitly, as in

> april2014[,"dateTime"] = as.POSIXct(april2014[,"dateTime"], format="%Y-%m-%d %H:%M:%S")

Other date formats are possible; the R help entry for strptime gives the details. [To get help for this command, type ?strptime at the R console prompt in RStudio. The help will appear in the lower right panel.]

Reading data by specifying column classes

There is another way of reading the data that avoids the necessity of converting character strings to POSIXct date-time objects after reading. We can specify the classes (data types) of the columns in the CSV file as part of the read.csv command. In our example with the weather data, we know that all columns contain numerical data except the first which is a date-time in POSIXct format, so we can create a vector specifying these data types and pass it to read.csv.

> classes = c("POSIXct", rep("numeric", times = 24))
> april2014 = read.csv("april2014.csv", colClasses = classes) 

We’ve used the rep() function to generate a vector containing 24 strings, all saying "numeric" and concatenated it onto a "POSIXct" string.

This gives a slightly different structure to the data frame apr2014, as all columns except the first are now of type numeric rather than some being numeric and some being integer, but we can fine-tune the data types by giving a more detailed classes vector if we wanted to.

We cheated a bit here, since this works only if the date-times are in the default POSIXct format as shown above. It is possible to tell read.csv the format of a date-time that isn’t in the default form, but it’s a bit tricky.

The technique relies on the fact that what read.csv does when given a colClasses vector is try to coerce the raw character string read from the CSV file into the data type specified for that column. In order for this to work, there needs to be what is known as an 'as' function that performs this coercion (like the as.POSIXct() function we used above to coerce the string to a POSIXct object). R provides as functions for all the basic data types like numeric and also a few other data types like POSIXct. However, it’s possible to create your own data type and write an as function that coerces a string (or, indeed any other data type) into that new data type. We can use this method to read date-times in a non-standard format. Here’s the code:

> setClass("myDateTime")
> setAs("character","myDateTime", function(from) 
+ as.POSIXct(from, format="%Y-%m-%d %H:%M:%S") )
> customClasses = classes = c("myDateTime", rep("numeric", times = 24))
> april2014 = read.csv("april2014.csv", colClasses = customClasses)
> str(april2014)
'data.frame':	2880 obs. of  25 variables:
 $ dateTime       : POSIXct, format: "2014-04-01 00:00:00" "2014-04-01 00:15:00" "2014-04-01 00:30:00" ...

First we call setClass() to define a new class called myDateTime. Then we use setAs() to define a coercion from character to myDateTime. setAs() takes 3 arguments (in its most basic form). The first is the data type we want to coerce from, the second is the data type to coerce to, and the third is a function that takes a single argument which must be an instance of the ‘from’ data type. This function returns an instance of the to data type. In this case, the function uses the built-in as.POSIXct() function to coerce the date-time string with the given format to a POSIXct object. In R, functions can be passed as parameters to other functions, and the last statement in a function is that function’s return value.

As can be seen in the structure of april2014, the dateTime column in the data frame has the POSIXct data type.

Clearly there are a lot of techniques that we’ve glossed over here, but we’ll hopefully return to these in later posts for a more thorough understanding of how R handles classes and functions.

Advertisements
Post a comment or leave a trackback: Trackback URL.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: