Exercise 1: Brief introduction to R

Preliminaries

R is at the same time a statistical software, a scientific computing environment, and a programming language. R is distributed free and open source.

R can be intimidating, but most users of R are not doing R programming. However use of R requires that you enter commands into a command-line interface.

The two main components of an R session are objects and functions. Objects are variables, data, and results that we have input, uploaded, or created in memory. Functions are special types of objects that take one or more arguments and then do something (e.g., create a new object, make a visualization, or write a file).

R is built on contributed packages, which mostly contain libraries of new, related R functions. Most contributed packages are stored in a public repository called CRAN (the Comprehensive R Archive Network).

Basic R commands

The best way to learn how to use R is by doing, so let's open a new R session and proceed. For all the code chunks below it should be possible to copy & paste from this window into your R session to reproduce the operations we have conducted here. Of course you can also retype the commands if you prefer. This helps build a familiarity with the environment if you are new to R.

We can create objects using the assign operator, a combination of symbols that looks like an arrow (<- or ->) and points towards the object that is being created.

For instance, here we create a variable (“object”) with the name n and value 15:

n <- 15

Entering the name of a variable we have created in the R interface normally gives us some information about the object:

## [1] 15

Use of the assignment operator (<- or ->) also allows us to create objects from either direction. For instance, the following operation creates an object m with value 5:

5 -> m
m

## [1] 5

All objects are case-sensitive. That means that x is different from X and so on:

x <- 1
X <- 10
x

## [1] 1

## [1] 10

All basic mathematical operations are available in R, such as multiplication, division, addition, etc.:

(10 + 2) * 5

## [1] 60

(X / 500) * 12

## [1] 0.24

There are many functions that we can use to facilitate our session in R. Any time we use a function, we enter the name of the function and a pair of round parentheses (( and )) that encircle the arguments taken by the function. Some functions do not require any arguments. This includes the very useful function ls which (in its most basic operation) lists all the objects in the current R session. For instance, here we create some objects (in addition to those we have already created), and then we can use ls to list them:

name <- "Luke"
n1<-10
n2<-100
m<-0.5
ls()

## [1] "m"    "n"    "n1"   "n2"   "name" "x"    "X"

The function ls can also take various input arguments. For instance we can supply it with a pattern so that it lists only objects whose names include the specified pattern, as follows:

ls(pat="m")

## [1] "m"    "name"

A closely related function, ls.str, lists all the objects in the current session, but also includes some information about their attributes. For instance:

ls.str()

## m :  num 0.5
## n :  num 15
## n1 :  num 10
## n2 :  num 100
## name :  chr "Luke"
## x :  num 1
## X :  num 10

shows us not only the object names, but also their modes (which we will learn more about below) and values.

How did we know which arguments could be input into ls or ls.str? Well, if these were functions that we were already familiar with & we just wanted to remind ourselves, we can use the very helpful function args:

args(ls)

## function (name, pos = -1L, envir = as.environment(pos), all.names = FALSE, 
##     pattern, sorted = TRUE) 
## NULL

args(ls.str)

## function (pos = -1, name, envir, all.names = FALSE, pattern, 
##     mode = "any") 
## NULL

args(args)

## function (name) 
## NULL

More typically, we might want to examine the help pages for the function. All packages have a help page for each function in the package. In many cases documentation is extensive - but it may be somewhat unfamiliar to new users.

Let's look at the help page for ls:

help(ls)

## starting httpd help server ...

##  done

We can see that each help page contains a section describing the function (Description), a section giving general usage of the function (Usage), and a section listing & describing all the arguments that the function takes (Arguments). Help pages may also include sections with additional information (Details), a section containing information about the object (if any) returned by the function (Value), and a section containing worked examples of the function in use (Examples), as well as sections containing information about the function authors and references.

Note that help(ls) is exactly equivalent to ?ls, so I prefer to use the latter since it is more compact.

If we don't even know the name of the function we want to use, we can search for help using the function help.search. Unlike help and ? which look up help pages only for currently loaded packages, help.search automatically searches across all currently installed packages. For instance:

help.search("anova")
help.search("phylogeny")

Note that you may have different results from these help searches because we have different sets of installed packages.

Types of data objects in R

There are five main types of data object in R: vector, factor, matrix, data frame, and list. All data objects have attributes and values.

Vector: a vector is a series of elements of the same type. It has two attributes: mode and length. Let's look a few vectors of different types.

First, we can create a simple vector of mode numeric containing the values 1 through 5

x<-c(1,2,3,4,5)
x

## [1] 1 2 3 4 5

mode(x)

## [1] "numeric"

length(x)

## [1] 5

Note that this vector could have equivalently been created as follows using the operator :, or the function seq, showing that there is usually many different ways to accomplish the same operation in R:

x<-1:5
x

## [1] 1 2 3 4 5

x<-seq(from=1,to=5,by=1)
x

## [1] 1 2 3 4 5

Next, here is a vector of mode "logical", meaning it contains values of TRUE or FALSE:

y<-c(FALSE,TRUE)
y

## [1] FALSE  TRUE

mode(y)

## [1] "logical"

length(y)

## [1] 2

Although we just created this vector using the function c, logical vectors most often result from logical operations - that is operations whose result is either a TRUE or FALSE. For instance:

x>=3

## [1] FALSE FALSE  TRUE  TRUE  TRUE

Next, here is a vector of mode "character". Vectors of this mode can contain elements that are single characters or strings. For instance:

z<-c("order","superfamily","family","genus","species")
mode(z)

## [1] "character"

length(z)

## [1] 5

## [1] "order"       "superfamily" "family"      "genus"       "species"

For vectors of any mode, we can access individual elements in a vector using numerical indexing. For example:

z[2]

## [1] "superfamily"

z[c(1,3)]

## [1] "order"  "family"

i<-c(4,5)
z[i]

## [1] "genus"   "species"

z[c(1,1,1)]

## [1] "order" "order" "order"

Positive indices, even repeating ones, access individual elements of the vector. By contrast, negative indices removes the corresponding elements. For example:

z[-c(2,3)]

## [1] "order"   "genus"   "species"

obj<-z[-2]
obj

## [1] "order"   "family"  "genus"   "species"

Logical indexing combined with the function which is a useful way to select some data from a vector, but not others:

x<-runif(n=6,min=0,max=10)
x

## [1] 9.6259945 8.0330775 0.6469662 9.4720569 0.8363865 1.2417605

x>=5

## [1]  TRUE  TRUE FALSE  TRUE FALSE FALSE

which(x>=5)

## [1] 1 2 4

x[x>=5]

## [1] 9.625995 8.033078 9.472057

x[which(x>=5)]

## [1] 9.625995 8.033078 9.472057

Vectors (and other R objects) sometimes but don't always have a third attribute, names. In phylogenetic analysis, our vectors will very frequently have names!

x<-1:5
names(x)<-z
x

##       order superfamily      family       genus     species 
##           1           2           3           4           5

A factor is derived from a vector, but it has the additional attribute of levels. For example, here I'll create a factor with two levels Male and Female.

f<-c("Male","Male","Female","Female","Female")
f<-factor(f)
f

## [1] Male   Male   Female Female Female
## Levels: Female Male

table(f)

## f
## Female   Male 
##      3      2

summary(f)

## Female   Male 
##      3      2

Columns in a data file that we read into R that contain non-numeric values will usually be interpreted as factors.

A matrix is vector arranged in a tabular way. It has the additional attribute dim: the dimensions of the matrix. Because a matrix is just a tabularly arranged vector, all the elements of a matrix must have the same mode. This can be seen using the following example:

X<-matrix(1:9,3,3)
X

##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9

## this is equivalent
X<-1:9
dim(X)<-c(3,3)
X

##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9

We can call elements of a matrix using numerical indexing as well. When working with matrices, it is very important to remember that the indexing is in row/column order:

X[3,2]

## [1] 6

Similarly, we can access individual rows or columns (themselves vectors) in the following way:

X[,3] # the third column

## [1] 7 8 9

X[2,] # the second row

## [1] 2 5 8

A data frame is a very important type of object in R as well. It looks like a matrix (although it's actually stored as a list, see below). It is the type of data object that is created by reading (say) a spreadsheet from a file.

Y<-data.frame(z,y=1:5,x=5:1)
Y

##             z y x
## 1       order 1 5
## 2 superfamily 2 4
## 3      family 3 3
## 4       genus 4 2
## 5     species 5 1

Finally, a list is the most general data structure. It is the way we store all kinds of custom data types (including, and most importantly - for our purposes, phylogenetic trees). A list can be seen as a vector where each of the elements can be any kind of object (including another list!). For example:

L<-list(z=z,1:2,Y)
length(L)

## [1] 3

## $z
## [1] "order"       "superfamily" "family"      "genus"       "species"    
## 
## [[2]]
## [1] 1 2
## 
## [[3]]
##             z y x
## 1       order 1 5
## 2 superfamily 2 4
## 3      family 3 3
## 4       genus 4 2
## 5     species 5 1

Note that the elements of a list can also have names, though they don't have to:

names(L)

## [1] "z" ""  ""

There are multiple ways that we can access the elements of a list. For instance, if we know the position in the list of the object we're interested it we can use numerical indexing - but here the double square brackets ([[ and ]]):

L[[3]]

##             z y x
## 1       order 1 5
## 2 superfamily 2 4
## 3      family 3 3
## 4       genus 4 2
## 5     species 5 1

If we know the object's name, we can use the $ operator as follows:

L$z

## [1] "order"       "superfamily" "family"      "genus"       "species"

or it's name & square brackets as follows:

L[["z"]]

## [1] "order"       "superfamily" "family"      "genus"       "species"

Reading & writing data from & to files

Many utilities in R exist to read & write data from & to files. We'll cover reading & writing phylogenetic data in a subsequent section, but let's first just try reading & writing a couple of different input & ouput formats.

First, let's use read.csv to read in a .csv (comma delimited) file from memory. To follow along, download the file anole.data.csv.

Comma delimited text is a very common file format for these types of data. Note that usually we will have to convert our file to a 'plain text' format like this one before reading it into R! To see what comma delimited text looks like, you can open the file you have just downloaded - or you can copy & paste the following code to view the first few lines of the file:

cat(readLines("anole.data.csv",10),sep="\n")

## "","SVL","HL","HLL","FLL","LAM","TL"
## "ahli",4.03913,2.88266,3.96202,3.34498,2.8662,4.504
## "alayoni",3.8157,2.70212,3.2795,2.80245,3.07527,4.07265
## "alfaroi",3.52665,2.37816,3.30542,2.48366,2.73387,4.41601
## "aliniger",4.03656,2.89884,3.64623,3.15908,3.15677,4.54173
## "allisoni",4.37539,3.35896,3.96069,3.4462,3.23921,5.05911
## "allogus",4.04014,2.86103,3.94018,3.33829,2.80827,4.52189
## "altitudinalis",3.84299,2.85273,3.25665,2.88466,3.19846,4.16762
## "alumina",3.58894,2.41783,3.44101,2.63466,2.69425,4.66676
## "alutaceus",3.55489,2.43405,3.3511,2.56994,2.78245,4.55473

Now let's read the file. To read it you can use the function read.csv:

X<-read.csv("anole.data.csv",header=TRUE,row.names=1)
class(X)

## [1] "data.frame"

Here is the object we have just created:

##                     SVL      HL     HLL     FLL     LAM      TL
## ahli            4.03913 2.88266 3.96202 3.34498 2.86620 4.50400
## alayoni         3.81570 2.70212 3.27950 2.80245 3.07527 4.07265
## alfaroi         3.52665 2.37816 3.30542 2.48366 2.73387 4.41601
## aliniger        4.03656 2.89884 3.64623 3.15908 3.15677 4.54173
## allisoni        4.37539 3.35896 3.96069 3.44620 3.23921 5.05911
## allogus         4.04014 2.86103 3.94018 3.33829 2.80827 4.52189
## altitudinalis   3.84299 2.85273 3.25665 2.88466 3.19846 4.16762
## alumina         3.58894 2.41783 3.44101 2.63466 2.69425 4.66676
## alutaceus       3.55489 2.43405 3.35110 2.56994 2.78245 4.55473
## angusticeps     3.78860 2.69018 3.23424 2.74548 2.86502 4.17223
## argenteolus     3.97131 2.75342 3.85752 3.26076 3.25025 4.75503
## argillaceus     3.75787 2.61359 3.35404 2.89988 3.02318 4.37048
## armouri         4.12168 3.06894 3.98845 3.37523 2.86689 4.71407
## bahorucoensis   3.82745 2.77540 3.69687 2.91831 2.80896 4.76335
## baleatus        5.05306 3.87847 4.73633 4.24528 3.44588 5.70632
## baracoae        5.04278 3.92857 4.72552 4.19824 3.69051 5.78458
## barahonae       5.07696 3.88092 4.76629 4.25025 3.43950 5.69561
## barbatus        5.00395 4.07397 4.51634 4.19162 3.39410 5.03053
## barbouri        3.66393 2.46801 3.45810 2.84025 2.41364 4.55342
## bartschi        4.28055 3.09878 4.18102 3.56553 3.28801 5.06832
## bremeri         4.11337 2.86044 3.90039 3.30585 2.97009 4.72996
## breslini        4.05111 3.00184 3.94958 3.33550 2.74677 4.82870
## brevirostris    3.87415 2.63920 3.66806 3.19585 3.03143 4.34161
## caudalis        3.91174 2.69233 3.64868 3.19965 2.94333 4.36260
## centralis       3.69794 2.52205 3.27249 2.80047 2.87894 4.38583
## chamaeleonides  5.04235 4.12856 4.52661 4.18213 3.51925 5.03155
## chlorocyanus    4.27545 3.07128 3.92347 3.37616 3.32593 4.97128
## christophei     3.88465 2.69607 3.74315 3.24156 2.98497 4.52612
## clivicola       3.75873 2.62919 3.59003 2.91615 2.85538 4.60138
## coelestinus     4.29797 3.12764 3.96332 3.41641 3.20854 5.04332
## confusus        3.93844 2.75838 3.76970 3.18221 2.85608 4.42371
## cooki           4.09154 2.93022 3.92244 3.30786 3.01479 4.89025
## cristatellus    4.18982 3.02450 4.01065 3.45393 3.08046 4.76218
## cupeyalensis    3.46201 2.31221 3.17127 2.43147 2.56376 4.52332
## cuvieri         4.87501 3.78908 4.69174 4.13141 3.39324 5.65807
## cyanopleurus    3.63016 2.46678 3.48687 2.67912 2.72007 4.74826
## cybotes         4.21098 3.16044 4.08094 3.50367 2.91083 4.85735
## darlingtoni     4.30204 3.21286 3.80977 3.33861 3.13549 4.49981
## distichus       3.92880 2.71507 3.74769 3.28819 3.03557 4.33953
## dolichocephalus 3.90855 2.88824 3.71678 2.89803 2.94351 4.87903
## equestris       5.11399 3.97656 4.75764 4.26757 3.72262 5.75012
## etheridgei      3.65799 2.50607 3.68383 2.89028 2.76044 4.51655
## eugenegrahami   4.12850 2.81780 4.08143 3.49302 3.22573 4.86316
## evermanni       4.16561 3.00815 3.95661 3.43032 3.31310 4.76766
## fowleri         4.28878 3.09388 4.07720 3.53726 2.97009 5.30472
## garmani         4.76947 3.60137 4.51390 3.97805 3.30958 5.52986
## grahami         4.15427 3.04139 3.92669 3.38041 3.28017 4.81155
## guafe           3.87746 2.67639 3.70675 3.15604 2.88914 4.35295
## guamuhaya       5.03695 4.11969 4.58159 4.14393 3.59647 5.15329
## guazuma         3.76388 2.65309 3.05933 2.58081 2.93307 3.72491
## gundlachi       4.18810 3.05589 4.10680 3.48066 2.79925 4.88040
## haetianus       4.31654 3.29590 4.22252 3.61242 2.93190 5.03960
## hendersoni      3.85983 2.81146 3.69235 2.89748 2.90119 4.86071
## homolechis      4.03281 2.83492 3.84678 3.29391 2.93592 4.52981
## imias           4.09969 2.85293 3.98565 3.41402 2.94375 4.65242
## inexpectatus    3.53744 2.42414 3.32925 2.46873 2.74432 4.60069
## insolitus       3.80047 2.66088 3.30379 2.75910 2.75087 3.93131
## isolepis        3.65709 2.60933 3.10660 2.67816 3.00959 3.84291
## jubar           3.95260 2.76726 3.75630 3.21232 2.91083 4.41338
## krugi           3.88650 2.75824 3.74280 3.05050 2.92595 4.89789
## lineatopus      4.12861 3.08489 4.00563 3.40064 2.86179 4.83357
## longitibialis   4.24210 3.13074 4.16673 3.59074 2.85046 4.97288
## loysiana        3.70124 2.55961 3.29042 2.86581 2.90001 4.07280
## lucius          4.19891 3.01269 4.08725 3.54050 3.31943 4.89860
## luteogularis    5.10109 3.95631 4.76912 4.27415 3.69997 5.73405
## macilentus      3.71576 2.50716 3.50014 2.69124 2.70694 4.67887
## marcanoi        4.07948 2.98856 3.93748 3.33499 2.93827 4.81486
## marron          3.83181 2.66263 3.62530 3.15137 2.92108 4.28998
## mestrei         3.98715 2.78904 3.84586 3.25122 2.89939 4.51303
## monticola       3.77061 2.61602 3.77539 2.99206 2.85608 4.71892
## noblei          5.08347 3.98972 4.77654 4.26465 3.74571 5.81288
## occultus        3.66305 2.52932 2.94088 2.49856 2.75722 3.64910
## olssoni         3.79390 2.53732 3.62697 2.82685 2.68045 4.93759
## opalinus        3.83838 2.67525 3.61507 3.07685 3.03601 4.41582
## ophiolepis      3.63796 2.43599 3.33231 2.72045 2.61274 4.49759
## oporinus        3.84567 2.71668 3.26919 2.87864 3.04452 4.14313
## paternus        3.80296 2.68938 3.35633 2.82450 2.99523 4.38717
## placidus        3.77397 2.62394 3.03639 2.66671 2.79536 3.84734
## poncensis       3.82038 2.63844 3.54908 2.90001 2.63391 4.78582
## porcatus        4.25899 3.23143 3.83813 3.29191 3.35154 4.94983
## porcus          5.03803 4.10583 4.51046 4.14064 3.50572 5.20247
## pulchellus      3.79902 2.72121 3.55535 2.90369 2.85477 4.76540
## pumilis         3.46686 2.28442 3.03187 2.59615 2.81442 4.13262
## quadriocellifer 3.90162 2.69606 3.70555 3.15652 2.84395 4.41491
## reconditus      4.48261 3.35907 4.41143 3.79420 3.05076 5.23530
## ricordii        5.01396 3.84360 4.72415 4.20138 3.41388 5.65557
## rubribarbus     4.07847 2.89425 3.96135 3.35641 2.86751 4.56108
## sagrei          4.06716 2.83515 3.85786 3.24267 2.91872 4.77603
## semilineatus    3.69663 2.56225 3.56689 2.71825 2.70627 4.81083
## sheplani        3.68292 2.49486 2.85877 2.45531 2.71918 3.74943
## shrevei         3.98300 2.91355 3.81600 3.20757 2.76989 4.65644
## singularis      4.05800 2.91768 3.68946 3.17729 3.14975 4.59542
## smallwoodi      5.03510 3.95492 4.75021 4.19821 3.73134 5.74107
## strahmi         4.27427 3.14616 4.21885 3.64446 2.99069 4.94861
## stratulus       3.86988 2.71712 3.57823 3.09671 3.09993 4.36548
## valencienni     4.32152 3.19328 3.83121 3.38065 3.20445 4.61083
## vanidicus       3.62621 2.49031 3.31618 2.53310 2.57858 4.54037
## vermiculatus    4.80285 3.65602 4.61175 3.92550 3.30124 5.55271
## websteri        3.91655 2.69056 3.65170 3.19690 2.97354 4.31559
## whitemani       4.09748 3.04389 3.97375 3.36546 2.78226 4.83677

Basic R scripting

Finally, we can explore some basics of R programming/scripting.

Most users of R are not R programmers. Nonetheless it can be very useful to understand some basics of scripting in order to make your use of R more efficient and more reproducible.

To start, we can write a basic for loop. A for loop is useful when we want to repeat an operation a certain, fixed number of times - for instance, when performing the same operation over the rows or columns of a matrix, or across all the phylogenies in a dataset (such as a posterior sample of trees from a Bayesian analysis).

Here, we can use a for loop to compute the average value for each trait (column in our data frame) across species.

The way to read the following code is just that we first create a vector, averages, that is empty of content; then we iterate from 1 through the number of columns in X (ncol(X)), each time computing the average value (using the function mean) for each column of the object. Next, we set the names of averages to match the names of each column in X and print averages to screen:

averages<-vector(mode="numeric",length=ncol(X))
for(i in 1:ncol(X)) averages[i]<-mean(X[,i])
names(averages)<-colnames(X)
averages

##      SVL       HL      HLL      FLL      LAM       TL 
## 4.096289 2.961037 3.826542 3.253969 3.018961 4.723669

R lets us do this in multiple ways. For instance, we can also use apply family functions. apply family function tend to work with matrices, data frames, and lists and tend to take the form of “apply to X in dimension DIMENSION function FUN”, for instance:

averages<-apply(X=X,MARGIN=2,FUN=mean)

means apply to X in dimension 2 (the columns) the function mean. The result should match what we obtained with our for loop, above.

averages

##      SVL       HL      HLL      FLL      LAM       TL 
## 4.096289 2.961037 3.826542 3.253969 3.018961 4.723669

apply family functions take some getting used to - but they are very helpful.

Finally, R has custom functions that can be handy here:

averages<-colMeans(X)
averages

##      SVL       HL      HLL      FLL      LAM       TL 
## 4.096289 2.961037 3.826542 3.253969 3.018961 4.723669

In the above example, we used a for loop around very simple code, but we can also loop over multiple lines of code using curly brackets ({ and }) to encircle the code chunk we'd like to iterate. Just for demonstrative purposes, let's print out the name of the trait that we've computed the average for in each loop:

averages<-setNames(vector(mode="numeric",length=ncol(X)),colnames(X))
for(i in 1:ncol(X)){
    averages[i]<-mean(X[,i])
    if(i==1) cat("------\n")
    cat(paste("Average for",colnames(X)[i],":",averages[i],"\n"))
    if(i==ncol(X)) cat("------\nDone!\n")
}

## ------
## Average for SVL : 4.0962894 
## Average for HL : 2.9610366 
## Average for HLL : 3.8265419 
## Average for FLL : 3.2539689 
## Average for LAM : 3.0189606 
## Average for TL : 4.7236695 
## ------
## Done!

averages

##      SVL       HL      HLL      FLL      LAM       TL 
## 4.096289 2.961037 3.826542 3.253969 3.018961 4.723669

Finally, it is straightforward to write custom functions within R to perform idiosyncratic tasks. For instance, let's imagine that colMeans does not exist, & create a new function col_means to duplicate its operation:

col_means<-function(x,na.rm=TRUE){
        obj<-vector(mode="numeric",length=ncol(x))
        for(i in 1:ncol(x))
                obj[i]<-sum(x[,i],na.rm=na.rm)/sum(!is.na(x[,i]))
        setNames(obj,colnames(x))
}
averages<-col_means(X)
averages

##      SVL       HL      HLL      FLL      LAM       TL 
## 4.096289 2.961037 3.826542 3.253969 3.018961 4.723669

Neat.

Developed by Liam J. Revell & Luke J. Harmon. Last updated 31 July 2017.