Exercise 1: Brief introduction to R

Preliminaries

R is at the same time a statistical software, a scientific computing environment, and a programming language. R is distributed free and open source.

R can be intimidating, but most users of R are not doing R programming. However use of R requires that you enter commands into a command-line interface.

The two main components of an R session are objects and functions. Objects are variables, data, and results that we have input, uploaded, or created in memory. Functions are special types of objects that take one or more arguments and then do something (e.g., create a new object, make a visualization, or write a file).

R is built on contributed packages, which mostly contain libraries of new, related R functions. Most contributed packages are stored in a public repository called CRAN (the Comprehensive R Archive Network).

Basic R commands

The best way to learn how to use R is by doing, so let's open a new R session and proceed. For all the code chunks below it should be possible to copy & paste from this window into your R session to reproduce the operations we have conducted here. Of course you can also retype the commands if you prefer. This helps build a familiarity with the environment if you are new to R.

We can create objects using the assign operator, a symbol that points towards the object that is being created:

n <- 15
n ## this tells us some information about the object, in this case its value

## [1] 15

## we can assign objects in either direction
5 -> m
m

## [1] 5

## objects are case sensitive
x <- 1
X <- 10
x

## [1] 1

## [1] 10

## all basic mathematical operations are available in R
(10 + 2) * 5

## [1] 60

(X / 500) * 12

## [1] 0.24

There are many functions that we can use to facilitate our session in R, but one very useful one is ls which lists all the R objects in the current session:

name <- "Liam"
n1<-10
n2<-100
m<-0.5
ls()

## [1] "m"    "n"    "n1"   "n2"   "name" "x"    "X"

We can also supply it with a pattern:

ls(pat="m") ## lists all objects whose names contain "m"

## [1] "m"    "name"

## the function ls.str also displays some information about
## the objects we have in memory
ls.str()

## m :  num 0.5
## n :  num 15
## n1 :  num 10
## n2 :  num 100
## name :  chr "Liam"
## x :  num 1
## X :  num 10

How did we know which arguments could be input into ls or ls.str? Well, if these were functions that we were already familiar with & we just wanted to remind ourselves, we can use the very helpful function args:

args(ls)

## function (name, pos = -1L, envir = as.environment(pos), all.names = FALSE, 
##     pattern, sorted = TRUE) 
## NULL

args(ls.str)

## function (pos = -1, name, envir, all.names = FALSE, pattern, 
##     mode = "any") 
## NULL

args(args)

## function (name) 
## NULL

However, more typically, we might want to examine the help pages for the function. All packages (at least all packages stored on CRAN) have a help page for each function in the package. In many cases (but not always) documentation is extensive - but it may be somewhat unfamiliar to new users. Let's look at a page:

help(ls)

## starting httpd help server ...

##  done

help.search("anova")
help.search("phylogeny")

Types of data objects in R

There are five main types of data object in R: vector, factor, matrix, data frame, and list. All data objects have attributes and values.

Vector: a vector is a series of elements of the same type. It has two attributes: mode and length. Let's look a few vectors of different types.

# mode "numeric"
x<-1:5
x

## [1] 1 2 3 4 5

mode(x)

## [1] "numeric"

length(x)

## [1] 5

# mode "logical"
y<-c(FALSE,TRUE)
y

## [1] FALSE  TRUE

mode(y)

## [1] "logical"

length(y)

## [1] 2

# logical vectors can also result from logical operations
x>=3

## [1] FALSE FALSE  TRUE  TRUE  TRUE

# mode "character"
z<-c("order","superfamily","family","genus","species")
mode(z)

## [1] "character"

length(z)

## [1] 5

## [1] "order"       "superfamily" "family"      "genus"       "species"

We can access individual elements in a vector using numerical indexing. For example:

z[2]

## [1] "superfamily"

z[c(1,3)]

## [1] "order"  "family"

i<-c(4,5)
z[i]

## [1] "genus"   "species"

z[c(1,1,1)]

## [1] "order" "order" "order"

# negative index removes the corresponding element
z[-2]

## [1] "order"   "family"  "genus"   "species"

or with logical indexing, for example:

z[c(TRUE,FALSE,TRUE,TRUE,TRUE)]

## [1] "order"   "family"  "genus"   "species"

Logical indexing combined with the function which is a useful way to select some data from a vector, but not others:

x<-runif(n=6,min=0,max=10)
x

## [1] 2.737364 3.270049 9.766799 5.868191 7.553235 7.995842

x>=5

## [1] FALSE FALSE  TRUE  TRUE  TRUE  TRUE

which(x>=5)

## [1] 3 4 5 6

x[x>=5]

## [1] 9.766799 5.868191 7.553235 7.995842

x[which(x>=5)]

## [1] 9.766799 5.868191 7.553235 7.995842

Vectors (and other R objects) sometimes but don't always have a third attribute, names. In phylogenetic analysis, our vectors will very frequently have names!

x<-1:5
names(x)<-z
x

##       order superfamily      family       genus     species 
##           1           2           3           4           5

A factor is derived from a vector, but it has the additional attribute of levels:

f<-c("Male","Male","Female","Female","Female")
f<-factor(f)
f

## [1] Male   Male   Female Female Female
## Levels: Female Male

# or we could do equivalently
f<-c(0,0,1,1,1)
f<-factor(f)
levels(f)<-c("Male","Female")
table(f)

## f
##   Male Female 
##      2      3

summary(f)

##   Male Female 
##      2      3

A matrix is vector arranged in a tabular way. It has the additional attribute dim. This can be seen using the following example:

X<-matrix(1:9,3,3)
X

##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9

X<-1:9
dim(X)<-c(3,3)
X

##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9

# this is also handy
X<-matrix(1:9,3,3,byrow=TRUE)
X

##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
## [3,]    7    8    9

We can call elements of a matrix using numerical indexing as well, in row/column order:

X[3,2]

## [1] 8

X[,3] # the third column

## [1] 3 6 9

X[2,] # the second row

## [1] 4 5 6

A data frame is a very important type of object in R as well. It looks like a matrix (although it's actually stored as a list, see below). It is the type of data object that is created by reading (say) a spreadsheet from a file.

Y<-data.frame(z,y=1:5,x=5:1)
Y

##             z y x
## 1       order 1 5
## 2 superfamily 2 4
## 3      family 3 3
## 4       genus 4 2
## 5     species 5 1

Finally, a list is the most general data structure. It is the way we store all kinds of custom data types (including, and most importantly - for our purposes, phylogenetic trees). A list can be seen as a vector where each of the elements can be any kind of object. For example:

L<-list(z=z,1:2,Y)
length(L)

## [1] 3

names(L)

## [1] "z" ""  ""

There are multiple ways that we can access the elements of a list:

L[[1]]

## [1] "order"       "superfamily" "family"      "genus"       "species"

L$z

## [1] "order"       "superfamily" "family"      "genus"       "species"

Reading & writing data from & to files

Many utilities in R exist to read & write data from & to files. We'll cover reading & writing phylogenetic data in a subsequent section, but let's first just try reading & writing a couple of different input & ouput formats.

First, let's use read.csv to read in a .csv (comma delimited) file from memory. To follow along, download the file anole.data.csv.

## this is what comma delimited text looks like:
cat(readLines("anole.data.csv",10),sep="\n")

## "","SVL","HL","HLL","FLL","LAM","TL"
## "ahli",4.03913,2.88266,3.96202,3.34498,2.8662,4.504
## "alayoni",3.8157,2.70212,3.2795,2.80245,3.07527,4.07265
## "alfaroi",3.52665,2.37816,3.30542,2.48366,2.73387,4.41601
## "aliniger",4.03656,2.89884,3.64623,3.15908,3.15677,4.54173
## "allisoni",4.37539,3.35896,3.96069,3.4462,3.23921,5.05911
## "allogus",4.04014,2.86103,3.94018,3.33829,2.80827,4.52189
## "altitudinalis",3.84299,2.85273,3.25665,2.88466,3.19846,4.16762
## "alumina",3.58894,2.41783,3.44101,2.63466,2.69425,4.66676
## "alutaceus",3.55489,2.43405,3.3511,2.56994,2.78245,4.55473

## read it in
X<-read.csv("anole.data.csv",header=TRUE,row.names=1)
## let's look at a bit of this object
head(X)

##              SVL      HL     HLL     FLL     LAM      TL
## ahli     4.03913 2.88266 3.96202 3.34498 2.86620 4.50400
## alayoni  3.81570 2.70212 3.27950 2.80245 3.07527 4.07265
## alfaroi  3.52665 2.37816 3.30542 2.48366 2.73387 4.41601
## aliniger 4.03656 2.89884 3.64623 3.15908 3.15677 4.54173
## allisoni 4.37539 3.35896 3.96069 3.44620 3.23921 5.05911
## allogus  4.04014 2.86103 3.94018 3.33829 2.80827 4.52189

Basic R scripting

Finally, we can explore some basics of R programming/scripting. For instance, let's use a for loop to compute the average value for each trait across species:

averages<-vector(mode="numeric",length=ncol(X))
for(i in 1:ncol(X)) averages[i]<-mean(X[,i])
names(averages)<-colnames(X)
averages

##      SVL       HL      HLL      FLL      LAM       TL 
## 4.096289 2.961037 3.826542 3.253969 3.018961 4.723669

R also lets us do this in multiple ways. For instance, we can also use apply family functions:

averages<-apply(X,2,mean)
averages

##      SVL       HL      HLL      FLL      LAM       TL 
## 4.096289 2.961037 3.826542 3.253969 3.018961 4.723669

Finally, R has custom functions that can be handy here:

averages<-colMeans(X)
averages

##      SVL       HL      HLL      FLL      LAM       TL 
## 4.096289 2.961037 3.826542 3.253969 3.018961 4.723669

In the above example, we used a for loop around very simple code, but we can also loop over multiple lines of code. Just for demonstrative purposes, let's print out the averages we have computed after each loop:

averages<-setNames(vector(mode="numeric",length=ncol(X)),colnames(X))
for(i in 1:ncol(X)){
    averages[i]<-mean(X[,i])
    print(averages[1:i])
}

##      SVL 
## 4.096289 
##      SVL       HL 
## 4.096289 2.961037 
##      SVL       HL      HLL 
## 4.096289 2.961037 3.826542 
##      SVL       HL      HLL      FLL 
## 4.096289 2.961037 3.826542 3.253969 
##      SVL       HL      HLL      FLL      LAM 
## 4.096289 2.961037 3.826542 3.253969 3.018961 
##      SVL       HL      HLL      FLL      LAM       TL 
## 4.096289 2.961037 3.826542 3.253969 3.018961 4.723669

averages

##      SVL       HL      HLL      FLL      LAM       TL 
## 4.096289 2.961037 3.826542 3.253969 3.018961 4.723669

A different family of functions in addition to for that can be used to iterate over columns or rows of a matrix, or elements of a vector or list, is the apply family of functions. For instance:

averages<-apply(X,2,mean)
averages

##      SVL       HL      HLL      FLL      LAM       TL 
## 4.096289 2.961037 3.826542 3.253969 3.018961 4.723669

The way we interpret this function call is apply to X over the second dimension of X (the columns) the function mean. apply family functions take some getting used to - but they are very helpful.

Finally, it is straightforward to write custom functions within R to perform idiosyncratic tasks. For instance, let's imagine that colMeans does not exist, & create a new function col_means to duplicate it's operation:

col_means<-function(x,na.rm=TRUE){
        obj<-vector(mode="numeric",length=ncol(x))
        for(i in 1:ncol(x))
                obj[i]<-sum(x[,i],na.rm=na.rm)/sum(!is.na(x[,i]))
        setNames(obj,colnames(x))
}
averages<-col_means(X)
averages

##      SVL       HL      HLL      FLL      LAM       TL 
## 4.096289 2.961037 3.826542 3.253969 3.018961 4.723669

Neat.

Developed by Liam J. Revell. Last updated 27 Jun. 2016.