Introduction to phylogenies in R

This tutorial gives a basic introduction to phylogenies in the R language and statistical computing environment.

We can use the package "ctv" (i.e., CRAN Task View) to automatically install & update all the packages for R phylogenetic analysis that are available and listed in the Task View.

## install CRAN Task View for phylogenetics install.packages('ctv')
## library('ctv') install.views('Phylogenetics')
## update.views('Phylogenetics')

Now we've installed (and/or updated) pretty much every known phylogeny package for R that is available on CRAN - by barely lifting a finger. The most important core package for phylogenies in R is called "ape", which stands for Analysis of Phylogenetics and Evolution in R.

## load ape
library(ape)

"ape" does many different things. To get started let's simulate and plot a phylogeny, which ape can do under different models.

## simulate a phylogeny
tree <- rtree(n = 20)
plot(tree, edge.width = 2)
plot of chunk unnamed-chunk-3

The object created in memory when we simulate or estimate a phylogeny, or read one from an input file, is a list of class "phylo". A list is just a customizable object type that can combine different objects of different types. For instance, a list might have a vector of real numbers (with class "numeric) as its first element; and then a vector of strings (with class "character") as its second element; and so on.

An object of class "phylo" has at least 4 parts. These are normally hidden, for instance, just typing the name of your "phylo" object does not give you the structure in memory, as it does for many R objects:

tree
## 
## Phylogenetic tree with 20 tips and 19 internal nodes.
## 
## Tip labels:
## 	t1, t20, t2, t7, t15, t17, ...
## 
## Rooted; includes branch lengths.

However, we can see the structure using:

str(tree)
## List of 4
##  $ edge       : int [1:38, 1:2] 21 22 23 24 24 23 25 26 27 27 ...
##  $ tip.label  : chr [1:20] "t1" "t20" "t2" "t7" ...
##  $ edge.length: num [1:38] 0.0729 0.8743 0.1989 0.5346 0.1056 ...
##  $ Nnode      : int 19
##  - attr(*, "class")= chr "phylo"
##  - attr(*, "order")= chr "cladewise"

To get at the core of how an object of class "phylo" encodes phylogenetic, let's use a simple case: a tree with 5 tips & no edge lengths.

tree <- read.tree(text = "(((A,B),(C,D)),E);")
plot(tree, type = "cladogram", edge.width = 2)
plot of chunk unnamed-chunk-6

The object, in this case, consists of three components and a class attribute:

tree$edge
##      [,1] [,2]
## [1,]    6    7
## [2,]    7    8
## [3,]    8    1
## [4,]    8    2
## [5,]    7    9
## [6,]    9    3
## [7,]    9    4
## [8,]    6    5

The matrix edge contains the beginning and ending node number for all the nodes and tips in the tree. By convention, the tips of the tree are numbered 1 through n for n tips; and the nodes are numbered n + 1 through n + m for m nodes. m = n - 1 for a fully bifurcating tree. This is just to keep track of which nodes are internal and which are leaves.

tree$tip.label
## [1] "A" "B" "C" "D" "E"

The vector tip.label contains the labels for all the tips in the tree. The order of tip.label is the order of the tips numbered 1 through n in edge.

tree$Nnode
## [1] 4

The integer Nnode contains the number of internal nodes in the tree, including the root of the tree if the tree is rooted.

If it seems difficult to imagine that this object could contain all the information in the tree, here is a plotted tree with the edge numbers overlain:

plot(tree, edge.width = 2, label.offset = 0.1, type = "cladogram")
nodelabels()
tiplabels()
plot of chunk unnamed-chunk-10

We can see that all of the relationship information among the taxa in the tree is containing in the starting & ending nodes for each edge. Edges that share a common starting node number are descended from an immediate common ancestor, etc.

An object of class "phylo" also (by definition) has at least one attribute - its class. This is just a value to tell various methods in R what to do with an object of this type. For instance, if we call the method plot, R knows to use plot.phylo in the R package "ape".

"phylo" can also have other components, the most common of which are edge.length (a vector of class "numeric" containing all the edge lengths of the tree in the same order as the rows in edge; and root.edge, a numeric value giving the length of the root edge, if one exists. In addition, other elements & attributes can be added for special types of phylogenetic trees.

It's important to keep in mind that this is neither the only way to store a phylogeny in computer memory, nor even the only way to store a tree in R! However the advantage of storing trees in the same way across different packages is that it improves package interoperability - which is a fancy way of saying that developers can concentrate on developing methods that build on existing functionality, rather than reinventing the wheel anew for each project.

For instance, one of many reasons to build on the structures developed for the "ape" package, is that there are many different utility functions in "ape" and the other R packages for phylogenetics, especially my package "phytools", for reading, writing, and manipulating phylogenetic trees stored in memory using this structure. There are a huge number of "utility" functions of this type for phylogenies, so I'm just going to review some of the more important & useful ones.

Writing & reading phylogenetic trees

We can easily write & read trees to & from files, for example:

write.tree(tree, "example.tre")
cat(readLines("example.tre"))
## (((A,B),(C,D)),E);
## load phytools
library(phytools)
writeNexus(tree, "example.nex")
cat(readLines("example.nex"), sep = "\n")
## #NEXUS
## [R-package PHYTOOLS, Wed Aug 07 10:27:09 2013]
## 
## BEGIN TAXA;
## 	DIMENSIONS NTAX = 5;
## 	TAXLABELS
## 		A
## 		B
## 		C
## 		D
## 		E
## 	;
## END;
## BEGIN TREES;
## 	TRANSLATE
## 		1	A,
## 		2	B,
## 		3	C,
## 		4	D,
## 		5	E
## 	;
## 	TREE * UNTITLED = [&R] (((1,2),(3,4)),5);
## END;

Simulating, plotting, extracting clades, & dropping tips

We can also simulate trees, plot them, extract clades, & drop tips from the tree:

## (I'm going to first set the seed for repeatability)
set.seed(1)
## simulate a birth-death tree using phytools
tree <- pbtree(b = 1, d = 0.2, n = 40)
## stopping criterion is 40 extant species, in this case
plotTree(tree, setEnv = TRUE)
## setEnv=TRUE is experimental. please be patient with bugs
nodelabels()
plot of chunk unnamed-chunk-12
## ok, now extract the clade descended from node #62
tt62 <- extract.clade(tree, 62)
plotTree(tt62)
plot of chunk unnamed-chunk-12
## now drop 10 tips from the tree (I'm going to pick them at random)
dtips <- sample(tree$tip.label, 10)
dt <- drop.tip(tree, dtips)
plotTree(dt)
## we could also, say, drop all tips that go extinct before the present
## this is a fun way, but not the only way to do this:
et <- fancyTree(tree, type = "droptip", tip = getExtinct(tree), cex = 0.7)
plot of chunk unnamed-chunk-12
plot of chunk unnamed-chunk-12
print(et)
## 
## Phylogenetic tree with 40 tips and 39 internal nodes.
## 
## Tip labels:
## 	t39, t40, t37, t38, t20, t22, ...
## 
## Rooted; includes branch lengths.

Binary & polytomous trees

"ape" and most other phylogenetics packages are equipped to handle phylogenies that are binary or multifurcating; however not all functions will be. We can easily check if our tree is binary, and convert between binary & multifurcating trees.

t1 <- read.tree(text = "((A,B,C),D);")
plot(t1, type = "cladogram")
plot of chunk unnamed-chunk-13
## check if binary
is.binary.tree(t1)
## [1] FALSE
## randomly resolve polytomies
t2 <- multi2di(t1)
plot(t2, type = "cladogram")
plot of chunk unnamed-chunk-14
is.binary.tree(t2)
## [1] TRUE

Miscellaneous (rotating nodes, re-rooting, etc...)

Lots of other manipulations are possible in R, but here are some simple ones:

## rotating nodes
plotTree(tree, node.numbers = T)
plot of chunk unnamed-chunk-16
## first, rotate about node #50
rt.50 <- rotate(tree, 50)
plotTree(rt.50)
plot of chunk unnamed-chunk-16
## now rotate all nodes
rt.all <- rotateNodes(tree, "all")
plotTree(rt.all)
plot of chunk unnamed-chunk-16
## ok, now let's re-root the tree at node #67
rr.67 <- root(tree, node = 67)
plotTree(rr.67)
plot of chunk unnamed-chunk-16
## this creates a trifurcation at the root we could instead re-root at
## along an edge
rr.62 <- reroot(tree, 62, position = 0.5 * tree$edge.length[which(tree$edge[, 
    2] == 62)])
plotTree(rr.62)
plot of chunk unnamed-chunk-16

Comparing trees

We can, for example, check to see if our trees are equal. Trees with rotated nodes are equal. Re-rooted trees are not; however they are the same if unrooted. For example:

## check if tree & rt.all are equal
all.equal(tree, rt.all)
## [1] TRUE
## check if tree & rr.62 are equal
all.equal(tree, rr.62)
## [1] FALSE
## check if unrooted tree & rr.62 are equal
all.equal(unroot(tree), unroot(rr.62))
## [1] TRUE

Multiple trees

Finally, it is sometimes useful to store multiple phylogenies in a single object. This would be the case, for example, if we had a set of trees in a posterior sample from Bayesian phylogeny inference; or if we wanted to replicate a simulation analysis across a large number of phylogenies.

Multiple trees are stored as an object of class "multiPhylo". This is just a list of objects of class "phylo", with the class attribute "multiPhylo". Many, but not all, functions in "ape", "phytools", and other R packages can be applied to both "phylo" and "multiPhylo" objects. For instance:

## simulate 10 pure-birth trees
trees <- pbtree(n = 6, nsim = 10, scale = 1)
print(trees)
## 10 phylogenetic trees
## round the edge lengths of the tree to 3 digits
trees <- roundBranches(trees, 1)
## write to file
write.tree(trees, file = "example.trees")
cat(readLines("example.trees"), sep = "\n")
## (t1:1,((t3:0.2,t4:0.2):0.6,(t2:0.7,(t5:0.2,t6:0.2):0.5):0.1):0.2);
## (t1:1,(((t4:0.2,(t5:0.1,t6:0.1):0.1):0,t3:0.2):0.2,t2:0.4):0.6);
## ((t1:0.4,t2:0.4):0.6,(t3:0.3,(t4:0.2,(t5:0,t6:0):0.1):0.2):0.7);
## ((t3:0.2,t4:0.2):0.8,(t1:1,(t2:0.8,(t5:0.1,t6:0.1):0.8):0.2):0);
## ((t3:0.2,t4:0.2):0.8,(((t5:0.1,t6:0.1):0.1,t2:0.3):0.5,t1:0.8):0.2);
## ((t3:0.3,t4:0.3):0.7,((t5:0,t6:0):0.8,(t1:0.5,t2:0.5):0.2):0.2);
## ((t2:0.3,t3:0.3):0.7,(t1:0.3,(t4:0.1,(t5:0,t6:0):0.1):0.2):0.7);
## (((t2:0.4,(t3:0.4,t4:0.4):0.1):0.1,(t5:0.3,t6:0.3):0.3):0.4,t1:1);
## (((t2:0.4,(t3:0.4,t4:0.4):0):0,t1:0.5):0.5,(t5:0.1,t6:0.1):0.9);
## (((t3:0.4,t4:0.4):0.5,t1:0.9):0.1,(t2:0.8,(t5:0.3,t6:0.3):0.5):0.2);

Many other tasks are also possible in R phylogenetics. For an overview, read the CRAN Phylogenetics Task View. Also check out my blog: http://blog.phytools.org, or review the PDF manuals of different R packages such as ape, phytools, phangorn, and geiger.

Written by Liam J. Revell. Last updated Aug. 7, 2013