R is at the same time a statistical software, a scientific computing environment, and a programming language. R is distributed free and open source.
R can be intimidating, but most users of R are not doing R programming. However use of R requires that you enter commands into a command-line interface.
The two main components of an R session are objects and functions. Objects are variables, data, and results that we have input, uploaded, or created in memory. Functions are special types of objects that take one or more arguments and then do something (e.g., create a new object, make a visualization, or write a file).
R is built on contributed packages, which mostly contain libraries of new, related R functions. Most contributed packages are stored in a public repository called CRAN (the Comprehensive R Archive Network).
The best way to learn how to use R is by doing, so let's open a new R session and proceed. For all the code chunks below it should be possible to copy & paste from this window into your R session to reproduce the operations we have conducted here. Of course you can also retype the commands if you prefer. This helps build a familiarity with the environment if you are new to R.
We can create objects using the assign operator, a combination of
symbols that looks like an arrow (<-
or ->
) and
points towards the object that is being created.
For instance, here we create a variable (“object”) with the name
n
and value 15
:
n <- 15
Entering the name of a variable we have created in the R interface normally gives us some information about the object:
n
## [1] 15
Use of the assignment operator (<-
or ->
) also
allows us to create objects from either direction. For instance, the following
operation creates an object m
with value 5
:
5 -> m
m
## [1] 5
All objects are case-sensitive. That means that x
is
different from X
and so on:
x <- 1
X <- 10
x
## [1] 1
X
## [1] 10
All basic mathematical operations are available in R, such as multiplication, division, addition, etc.:
(10 + 2) * 5
## [1] 60
(X / 500) * 12
## [1] 0.24
There are many functions that we can use to facilitate our session in R. Any
time we use a function, we enter the name of the function and a pair of round
parentheses ((
and )
) that encircle the arguments
taken by the function. Some functions do not require any arguments. This
includes the very useful function ls
which (in its most basic
operation) lists all the objects in the current R session. For instance, here
we create some objects (in addition to those we have already created), and
then we can use ls
to list them:
name <- "Luke"
n1<-10
n2<-100
m<-0.5
ls()
## [1] "m" "n" "n1" "n2" "name" "x" "X"
The function ls
can also take various input arguments. For
instance we can supply it with a pattern so that it lists only objects
whose names include the specified pattern, as follows:
ls(pat="m")
## [1] "m" "name"
A closely related function, ls.str
, lists all the objects in the
current session, but also includes some information about their attributes. For
instance:
ls.str()
## m : num 0.5
## n : num 15
## n1 : num 10
## n2 : num 100
## name : chr "Luke"
## x : num 1
## X : num 10
shows us not only the object names, but also their modes (which we will learn more about below) and values.
How did we know which arguments could be input into ls
or
ls.str
? Well, if these were functions that we were already
familiar with & we just wanted to remind ourselves, we can use the very
helpful function args
:
args(ls)
## function (name, pos = -1L, envir = as.environment(pos), all.names = FALSE,
## pattern, sorted = TRUE)
## NULL
args(ls.str)
## function (pos = -1, name, envir, all.names = FALSE, pattern,
## mode = "any")
## NULL
args(args)
## function (name)
## NULL
More typically, we might want to examine the help pages for the function. All packages have a help page for each function in the package. In many cases documentation is extensive - but it may be somewhat unfamiliar to new users.
Let's look at the help page for ls
:
help(ls)
## starting httpd help server ...
## done
We can see that each help page contains a section describing the function (Description), a section giving general usage of the function (Usage), and a section listing & describing all the arguments that the function takes (Arguments). Help pages may also include sections with additional information (Details), a section containing information about the object (if any) returned by the function (Value), and a section containing worked examples of the function in use (Examples), as well as sections containing information about the function authors and references.
Note that help(ls)
is exactly equivalent to ?ls
,
so I prefer to use the latter since it is more compact.
If we don't even know the name of the function we want to use, we can search
for help using the function help.search
. Unlike help
and ?
which look up help pages only for currently loaded packages,
help.search
automatically searches across all currently installed
packages. For instance:
help.search("anova")
help.search("phylogeny")
Note that you may have different results from these help searches because we have different sets of installed packages.
There are five main types of data object in R: vector, factor, matrix, data frame, and list. All data objects have attributes and values.
Vector: a vector is a series of elements of the same type. It has two attributes: mode and length. Let's look a few vectors of different types.
First, we can create a simple vector of mode numeric containing the values
1
through 5
x<-c(1,2,3,4,5)
x
## [1] 1 2 3 4 5
mode(x)
## [1] "numeric"
length(x)
## [1] 5
Note that this vector could have equivalently been created as follows
using the operator :
, or the function seq
, showing
that there is usually many different ways to accomplish the same operation
in R:
x<-1:5
x
## [1] 1 2 3 4 5
x<-seq(from=1,to=5,by=1)
x
## [1] 1 2 3 4 5
Next, here is a vector of mode "logical"
, meaning it contains
values of TRUE
or FALSE
:
y<-c(FALSE,TRUE)
y
## [1] FALSE TRUE
mode(y)
## [1] "logical"
length(y)
## [1] 2
Although we just created this vector using the function c
, logical
vectors most often result from logical operations - that is operations whose
result is either a TRUE
or FALSE
. For instance:
x>=3
## [1] FALSE FALSE TRUE TRUE TRUE
Next, here is a vector of mode "character"
. Vectors of this mode
can contain elements that are single characters or strings. For instance:
z<-c("order","superfamily","family","genus","species")
mode(z)
## [1] "character"
length(z)
## [1] 5
z
## [1] "order" "superfamily" "family" "genus" "species"
For vectors of any mode, we can access individual elements in a vector using numerical indexing. For example:
z[2]
## [1] "superfamily"
z[c(1,3)]
## [1] "order" "family"
i<-c(4,5)
z[i]
## [1] "genus" "species"
z[c(1,1,1)]
## [1] "order" "order" "order"
Positive indices, even repeating ones, access individual elements of the vector. By contrast, negative indices removes the corresponding elements. For example:
z[-c(2,3)]
## [1] "order" "genus" "species"
obj<-z[-2]
obj
## [1] "order" "family" "genus" "species"
Logical indexing combined with the function which
is a useful way to
select some data from a vector, but not others:
x<-runif(n=6,min=0,max=10)
x
## [1] 9.6259945 8.0330775 0.6469662 9.4720569 0.8363865 1.2417605
x>=5
## [1] TRUE TRUE FALSE TRUE FALSE FALSE
which(x>=5)
## [1] 1 2 4
x[x>=5]
## [1] 9.625995 8.033078 9.472057
x[which(x>=5)]
## [1] 9.625995 8.033078 9.472057
Vectors (and other R objects) sometimes but don't always have a third attribute, names. In phylogenetic analysis, our vectors will very frequently have names!
x<-1:5
names(x)<-z
x
## order superfamily family genus species
## 1 2 3 4 5
A factor is derived from a vector, but it has the additional attribute of levels.
For example, here I'll create a factor with two levels Male
and
Female
.
f<-c("Male","Male","Female","Female","Female")
f<-factor(f)
f
## [1] Male Male Female Female Female
## Levels: Female Male
table(f)
## f
## Female Male
## 3 2
summary(f)
## Female Male
## 3 2
Columns in a data file that we read into R that contain non-numeric values will usually be interpreted as factors.
A matrix is vector arranged in a tabular way. It has the additional attribute
dim
: the dimensions of the matrix. Because a matrix is just a tabularly
arranged vector, all the elements of a matrix must have the same mode. This can be
seen using the following example:
X<-matrix(1:9,3,3)
X
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
## this is equivalent
X<-1:9
dim(X)<-c(3,3)
X
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
We can call elements of a matrix using numerical indexing as well. When working with matrices, it is very important to remember that the indexing is in row/column order:
X[3,2]
## [1] 6
Similarly, we can access individual rows or columns (themselves vectors) in the following way:
X[,3] # the third column
## [1] 7 8 9
X[2,] # the second row
## [1] 2 5 8
A data frame is a very important type of object in R as well. It looks like a matrix (although it's actually stored as a list, see below). It is the type of data object that is created by reading (say) a spreadsheet from a file.
Y<-data.frame(z,y=1:5,x=5:1)
Y
## z y x
## 1 order 1 5
## 2 superfamily 2 4
## 3 family 3 3
## 4 genus 4 2
## 5 species 5 1
Finally, a list is the most general data structure. It is the way we store all kinds of custom data types (including, and most importantly - for our purposes, phylogenetic trees). A list can be seen as a vector where each of the elements can be any kind of object (including another list!). For example:
L<-list(z=z,1:2,Y)
length(L)
## [1] 3
L
## $z
## [1] "order" "superfamily" "family" "genus" "species"
##
## [[2]]
## [1] 1 2
##
## [[3]]
## z y x
## 1 order 1 5
## 2 superfamily 2 4
## 3 family 3 3
## 4 genus 4 2
## 5 species 5 1
Note that the elements of a list can also have names, though they don't have to:
names(L)
## [1] "z" "" ""
There are multiple ways that we can access the elements of a list. For instance,
if we know the position in the list of the object we're interested it we can use
numerical indexing - but here the double square brackets ([[
and
]]
):
L[[3]]
## z y x
## 1 order 1 5
## 2 superfamily 2 4
## 3 family 3 3
## 4 genus 4 2
## 5 species 5 1
If we know the object's name, we can use the $
operator as follows:
L$z
## [1] "order" "superfamily" "family" "genus" "species"
or it's name & square brackets as follows:
L[["z"]]
## [1] "order" "superfamily" "family" "genus" "species"
Many utilities in R exist to read & write data from & to files. We'll cover reading & writing phylogenetic data in a subsequent section, but let's first just try reading & writing a couple of different input & ouput formats.
First, let's use read.csv
to read in a .csv (comma delimited) file
from memory. To follow along, download the file
anole.data.csv.
Comma delimited text is a very common file format for these types of data. Note that usually we will have to convert our file to a 'plain text' format like this one before reading it into R! To see what comma delimited text looks like, you can open the file you have just downloaded - or you can copy & paste the following code to view the first few lines of the file:
cat(readLines("anole.data.csv",10),sep="\n")
## "","SVL","HL","HLL","FLL","LAM","TL"
## "ahli",4.03913,2.88266,3.96202,3.34498,2.8662,4.504
## "alayoni",3.8157,2.70212,3.2795,2.80245,3.07527,4.07265
## "alfaroi",3.52665,2.37816,3.30542,2.48366,2.73387,4.41601
## "aliniger",4.03656,2.89884,3.64623,3.15908,3.15677,4.54173
## "allisoni",4.37539,3.35896,3.96069,3.4462,3.23921,5.05911
## "allogus",4.04014,2.86103,3.94018,3.33829,2.80827,4.52189
## "altitudinalis",3.84299,2.85273,3.25665,2.88466,3.19846,4.16762
## "alumina",3.58894,2.41783,3.44101,2.63466,2.69425,4.66676
## "alutaceus",3.55489,2.43405,3.3511,2.56994,2.78245,4.55473
Now let's read the file. To read it you can use the function read.csv
:
X<-read.csv("anole.data.csv",header=TRUE,row.names=1)
class(X)
## [1] "data.frame"
Here is the object we have just created:
X
## SVL HL HLL FLL LAM TL
## ahli 4.03913 2.88266 3.96202 3.34498 2.86620 4.50400
## alayoni 3.81570 2.70212 3.27950 2.80245 3.07527 4.07265
## alfaroi 3.52665 2.37816 3.30542 2.48366 2.73387 4.41601
## aliniger 4.03656 2.89884 3.64623 3.15908 3.15677 4.54173
## allisoni 4.37539 3.35896 3.96069 3.44620 3.23921 5.05911
## allogus 4.04014 2.86103 3.94018 3.33829 2.80827 4.52189
## altitudinalis 3.84299 2.85273 3.25665 2.88466 3.19846 4.16762
## alumina 3.58894 2.41783 3.44101 2.63466 2.69425 4.66676
## alutaceus 3.55489 2.43405 3.35110 2.56994 2.78245 4.55473
## angusticeps 3.78860 2.69018 3.23424 2.74548 2.86502 4.17223
## argenteolus 3.97131 2.75342 3.85752 3.26076 3.25025 4.75503
## argillaceus 3.75787 2.61359 3.35404 2.89988 3.02318 4.37048
## armouri 4.12168 3.06894 3.98845 3.37523 2.86689 4.71407
## bahorucoensis 3.82745 2.77540 3.69687 2.91831 2.80896 4.76335
## baleatus 5.05306 3.87847 4.73633 4.24528 3.44588 5.70632
## baracoae 5.04278 3.92857 4.72552 4.19824 3.69051 5.78458
## barahonae 5.07696 3.88092 4.76629 4.25025 3.43950 5.69561
## barbatus 5.00395 4.07397 4.51634 4.19162 3.39410 5.03053
## barbouri 3.66393 2.46801 3.45810 2.84025 2.41364 4.55342
## bartschi 4.28055 3.09878 4.18102 3.56553 3.28801 5.06832
## bremeri 4.11337 2.86044 3.90039 3.30585 2.97009 4.72996
## breslini 4.05111 3.00184 3.94958 3.33550 2.74677 4.82870
## brevirostris 3.87415 2.63920 3.66806 3.19585 3.03143 4.34161
## caudalis 3.91174 2.69233 3.64868 3.19965 2.94333 4.36260
## centralis 3.69794 2.52205 3.27249 2.80047 2.87894 4.38583
## chamaeleonides 5.04235 4.12856 4.52661 4.18213 3.51925 5.03155
## chlorocyanus 4.27545 3.07128 3.92347 3.37616 3.32593 4.97128
## christophei 3.88465 2.69607 3.74315 3.24156 2.98497 4.52612
## clivicola 3.75873 2.62919 3.59003 2.91615 2.85538 4.60138
## coelestinus 4.29797 3.12764 3.96332 3.41641 3.20854 5.04332
## confusus 3.93844 2.75838 3.76970 3.18221 2.85608 4.42371
## cooki 4.09154 2.93022 3.92244 3.30786 3.01479 4.89025
## cristatellus 4.18982 3.02450 4.01065 3.45393 3.08046 4.76218
## cupeyalensis 3.46201 2.31221 3.17127 2.43147 2.56376 4.52332
## cuvieri 4.87501 3.78908 4.69174 4.13141 3.39324 5.65807
## cyanopleurus 3.63016 2.46678 3.48687 2.67912 2.72007 4.74826
## cybotes 4.21098 3.16044 4.08094 3.50367 2.91083 4.85735
## darlingtoni 4.30204 3.21286 3.80977 3.33861 3.13549 4.49981
## distichus 3.92880 2.71507 3.74769 3.28819 3.03557 4.33953
## dolichocephalus 3.90855 2.88824 3.71678 2.89803 2.94351 4.87903
## equestris 5.11399 3.97656 4.75764 4.26757 3.72262 5.75012
## etheridgei 3.65799 2.50607 3.68383 2.89028 2.76044 4.51655
## eugenegrahami 4.12850 2.81780 4.08143 3.49302 3.22573 4.86316
## evermanni 4.16561 3.00815 3.95661 3.43032 3.31310 4.76766
## fowleri 4.28878 3.09388 4.07720 3.53726 2.97009 5.30472
## garmani 4.76947 3.60137 4.51390 3.97805 3.30958 5.52986
## grahami 4.15427 3.04139 3.92669 3.38041 3.28017 4.81155
## guafe 3.87746 2.67639 3.70675 3.15604 2.88914 4.35295
## guamuhaya 5.03695 4.11969 4.58159 4.14393 3.59647 5.15329
## guazuma 3.76388 2.65309 3.05933 2.58081 2.93307 3.72491
## gundlachi 4.18810 3.05589 4.10680 3.48066 2.79925 4.88040
## haetianus 4.31654 3.29590 4.22252 3.61242 2.93190 5.03960
## hendersoni 3.85983 2.81146 3.69235 2.89748 2.90119 4.86071
## homolechis 4.03281 2.83492 3.84678 3.29391 2.93592 4.52981
## imias 4.09969 2.85293 3.98565 3.41402 2.94375 4.65242
## inexpectatus 3.53744 2.42414 3.32925 2.46873 2.74432 4.60069
## insolitus 3.80047 2.66088 3.30379 2.75910 2.75087 3.93131
## isolepis 3.65709 2.60933 3.10660 2.67816 3.00959 3.84291
## jubar 3.95260 2.76726 3.75630 3.21232 2.91083 4.41338
## krugi 3.88650 2.75824 3.74280 3.05050 2.92595 4.89789
## lineatopus 4.12861 3.08489 4.00563 3.40064 2.86179 4.83357
## longitibialis 4.24210 3.13074 4.16673 3.59074 2.85046 4.97288
## loysiana 3.70124 2.55961 3.29042 2.86581 2.90001 4.07280
## lucius 4.19891 3.01269 4.08725 3.54050 3.31943 4.89860
## luteogularis 5.10109 3.95631 4.76912 4.27415 3.69997 5.73405
## macilentus 3.71576 2.50716 3.50014 2.69124 2.70694 4.67887
## marcanoi 4.07948 2.98856 3.93748 3.33499 2.93827 4.81486
## marron 3.83181 2.66263 3.62530 3.15137 2.92108 4.28998
## mestrei 3.98715 2.78904 3.84586 3.25122 2.89939 4.51303
## monticola 3.77061 2.61602 3.77539 2.99206 2.85608 4.71892
## noblei 5.08347 3.98972 4.77654 4.26465 3.74571 5.81288
## occultus 3.66305 2.52932 2.94088 2.49856 2.75722 3.64910
## olssoni 3.79390 2.53732 3.62697 2.82685 2.68045 4.93759
## opalinus 3.83838 2.67525 3.61507 3.07685 3.03601 4.41582
## ophiolepis 3.63796 2.43599 3.33231 2.72045 2.61274 4.49759
## oporinus 3.84567 2.71668 3.26919 2.87864 3.04452 4.14313
## paternus 3.80296 2.68938 3.35633 2.82450 2.99523 4.38717
## placidus 3.77397 2.62394 3.03639 2.66671 2.79536 3.84734
## poncensis 3.82038 2.63844 3.54908 2.90001 2.63391 4.78582
## porcatus 4.25899 3.23143 3.83813 3.29191 3.35154 4.94983
## porcus 5.03803 4.10583 4.51046 4.14064 3.50572 5.20247
## pulchellus 3.79902 2.72121 3.55535 2.90369 2.85477 4.76540
## pumilis 3.46686 2.28442 3.03187 2.59615 2.81442 4.13262
## quadriocellifer 3.90162 2.69606 3.70555 3.15652 2.84395 4.41491
## reconditus 4.48261 3.35907 4.41143 3.79420 3.05076 5.23530
## ricordii 5.01396 3.84360 4.72415 4.20138 3.41388 5.65557
## rubribarbus 4.07847 2.89425 3.96135 3.35641 2.86751 4.56108
## sagrei 4.06716 2.83515 3.85786 3.24267 2.91872 4.77603
## semilineatus 3.69663 2.56225 3.56689 2.71825 2.70627 4.81083
## sheplani 3.68292 2.49486 2.85877 2.45531 2.71918 3.74943
## shrevei 3.98300 2.91355 3.81600 3.20757 2.76989 4.65644
## singularis 4.05800 2.91768 3.68946 3.17729 3.14975 4.59542
## smallwoodi 5.03510 3.95492 4.75021 4.19821 3.73134 5.74107
## strahmi 4.27427 3.14616 4.21885 3.64446 2.99069 4.94861
## stratulus 3.86988 2.71712 3.57823 3.09671 3.09993 4.36548
## valencienni 4.32152 3.19328 3.83121 3.38065 3.20445 4.61083
## vanidicus 3.62621 2.49031 3.31618 2.53310 2.57858 4.54037
## vermiculatus 4.80285 3.65602 4.61175 3.92550 3.30124 5.55271
## websteri 3.91655 2.69056 3.65170 3.19690 2.97354 4.31559
## whitemani 4.09748 3.04389 3.97375 3.36546 2.78226 4.83677
Finally, we can explore some basics of R programming/scripting.
Most users of R are not R programmers. Nonetheless it can be very useful to understand some basics of scripting in order to make your use of R more efficient and more reproducible.
To start, we can write a basic for
loop. A for
loop
is useful when we want to repeat an operation a certain, fixed number of times -
for instance, when performing the same operation over the rows or columns of
a matrix, or across all the phylogenies in a dataset (such as a posterior sample
of trees from a Bayesian analysis).
Here, we can use a for
loop to compute the average value for each
trait (column in our data frame) across species.
The way to read the following code is just that we first create a vector,
averages
, that is empty of content; then we iterate from 1 through
the number of columns in X
(ncol(X)
), each time
computing the average value (using the function mean
) for each
column of the object. Next, we set the names of averages
to match
the names of each column in X
and print averages
to
screen:
averages<-vector(mode="numeric",length=ncol(X))
for(i in 1:ncol(X)) averages[i]<-mean(X[,i])
names(averages)<-colnames(X)
averages
## SVL HL HLL FLL LAM TL
## 4.096289 2.961037 3.826542 3.253969 3.018961 4.723669
R lets us do this in multiple ways. For instance, we can also use
apply
family functions. apply
family function tend to
work with matrices, data frames, and lists and tend to take the form of
“apply to X in dimension DIMENSION function FUN”, for instance:
averages<-apply(X=X,MARGIN=2,FUN=mean)
means apply to X
in dimension 2
(the columns) the
function mean
. The result should match what we obtained with our
for
loop, above.
averages
## SVL HL HLL FLL LAM TL
## 4.096289 2.961037 3.826542 3.253969 3.018961 4.723669
apply
family functions take some getting used to - but they are very
helpful.
Finally, R has custom functions that can be handy here:
averages<-colMeans(X)
averages
## SVL HL HLL FLL LAM TL
## 4.096289 2.961037 3.826542 3.253969 3.018961 4.723669
In the above example, we used a for loop around very simple code, but we can also
loop over multiple lines of code using curly brackets ({
and
}
) to encircle the code chunk we'd like to iterate. Just for
demonstrative purposes, let's print out the name of the trait that we've computed
the average for in each loop:
averages<-setNames(vector(mode="numeric",length=ncol(X)),colnames(X))
for(i in 1:ncol(X)){
averages[i]<-mean(X[,i])
if(i==1) cat("------\n")
cat(paste("Average for",colnames(X)[i],":",averages[i],"\n"))
if(i==ncol(X)) cat("------\nDone!\n")
}
## ------
## Average for SVL : 4.0962894
## Average for HL : 2.9610366
## Average for HLL : 3.8265419
## Average for FLL : 3.2539689
## Average for LAM : 3.0189606
## Average for TL : 4.7236695
## ------
## Done!
averages
## SVL HL HLL FLL LAM TL
## 4.096289 2.961037 3.826542 3.253969 3.018961 4.723669
Finally, it is straightforward to write custom functions within R to perform
idiosyncratic tasks. For instance, let's imagine that colMeans
does
not exist, & create a new function col_means
to duplicate its
operation:
col_means<-function(x,na.rm=TRUE){
obj<-vector(mode="numeric",length=ncol(x))
for(i in 1:ncol(x))
obj[i]<-sum(x[,i],na.rm=na.rm)/sum(!is.na(x[,i]))
setNames(obj,colnames(x))
}
averages<-col_means(X)
averages
## SVL HL HLL FLL LAM TL
## 4.096289 2.961037 3.826542 3.253969 3.018961 4.723669
Neat.
Developed by Liam J. Revell & Luke J. Harmon. Last updated 31 July 2017.