This tutorial gives a basic introduction to phylogenies in the R language and statistical computing environment.
We can use the package "ctv"
(i.e., CRAN Task View) to automatically
install & update all the packages for R phylogenetic analysis that are available
and listed in the Task View.
## install CRAN Task View for phylogenetics install.packages('ctv') ## library('ctv') install.views('Phylogenetics') ## update.views('Phylogenetics')
Now we've installed (and/or updated) pretty much every known phylogeny package for
R that is available on CRAN - by barely lifting a finger. The most important core
package for phylogenies in R is called "ape"
, which stands for Analysis
of Phylogenetics and Evolution in R.
## load ape library(ape)
"ape"
does many different things. To get started let's simulate and
plot a phylogeny, which ape can do under different models.
## simulate a phylogeny tree <- rtree(n = 20) plot(tree, edge.width = 2)
The object created in memory when we simulate or estimate a phylogeny, or read one from
an input file, is a list of class "phylo"
. A list is just a
customizable object type that can combine different objects of different types. For
instance, a list might have a vector of real numbers (with class "numeric
) as
its first element; and then a vector of strings (with class "character"
) as
its second element; and so on.
An object of class "phylo"
has at least 4 parts. These are normally
hidden, for instance, just typing the name of your "phylo"
object does
not give you the structure in memory, as it does for many R objects:
tree
## ## Phylogenetic tree with 20 tips and 19 internal nodes. ## ## Tip labels: ## t1, t20, t2, t7, t15, t17, ... ## ## Rooted; includes branch lengths.
However, we can see the structure using:
str(tree)
## List of 4 ## $ edge : int [1:38, 1:2] 21 22 23 24 24 23 25 26 27 27 ... ## $ tip.label : chr [1:20] "t1" "t20" "t2" "t7" ... ## $ edge.length: num [1:38] 0.0729 0.8743 0.1989 0.5346 0.1056 ... ## $ Nnode : int 19 ## - attr(*, "class")= chr "phylo" ## - attr(*, "order")= chr "cladewise"
To get at the core of how an object of class "phylo"
encodes phylogenetic,
let's use a simple case: a tree with 5 tips & no edge lengths.
tree <- read.tree(text = "(((A,B),(C,D)),E);") plot(tree, type = "cladogram", edge.width = 2)
The object, in this case, consists of three components and a class attribute:
tree$edge
## [,1] [,2] ## [1,] 6 7 ## [2,] 7 8 ## [3,] 8 1 ## [4,] 8 2 ## [5,] 7 9 ## [6,] 9 3 ## [7,] 9 4 ## [8,] 6 5
The matrix edge
contains the beginning and ending node number for
all the nodes and tips in the tree. By convention, the tips of the tree are
numbered 1 through n for n tips; and the nodes are numbered n
+ 1 through n + m for m nodes. m = n - 1 for a
fully bifurcating tree. This is just to keep track of which nodes are internal and
which are leaves.
tree$tip.label
## [1] "A" "B" "C" "D" "E"
The vector tip.label
contains the labels for all the tips in the
tree. The order of tip.label
is the order of the tips numbered 1
through n in edge
.
tree$Nnode
## [1] 4
The integer Nnode
contains the number of internal nodes in the
tree, including the root of the tree if the tree is rooted.
If it seems difficult to imagine that this object could contain all the information in the tree, here is a plotted tree with the edge numbers overlain:
plot(tree, edge.width = 2, label.offset = 0.1, type = "cladogram") nodelabels() tiplabels()
We can see that all of the relationship information among the taxa in the tree is containing in the starting & ending nodes for each edge. Edges that share a common starting node number are descended from an immediate common ancestor, etc.
An object of class "phylo"
also (by definition) has at least one
attribute - its class. This is just a value to tell various methods in R what to do
with an object of this type. For instance, if we call the method plot
,
R knows to use plot.phylo
in the R package "ape"
.
"phylo"
can also have other components, the most common of which
are edge.length
(a vector of class "numeric"
containing
all the edge lengths of the tree in the same order as the rows in edge
;
and root.edge
, a numeric value giving the length of the root edge, if
one exists. In addition, other elements & attributes can be added for special types
of phylogenetic trees.
It's important to keep in mind that this is neither the only way to store a phylogeny in computer memory, nor even the only way to store a tree in R! However the advantage of storing trees in the same way across different packages is that it improves package interoperability - which is a fancy way of saying that developers can concentrate on developing methods that build on existing functionality, rather than reinventing the wheel anew for each project.
For instance, one of many reasons to build on the structures developed for the
"ape"
package, is that there are many different utility functions in
"ape"
and the other R packages for phylogenetics, especially my
package "phytools"
, for reading, writing, and manipulating
phylogenetic trees stored in memory using this structure. There are a huge number of
"utility" functions of this type for phylogenies, so I'm just going to review some of
the more important & useful ones.
We can easily write & read trees to & from files, for example:
write.tree(tree, "example.tre") cat(readLines("example.tre"))
## (((A,B),(C,D)),E);
## load phytools library(phytools) writeNexus(tree, "example.nex") cat(readLines("example.nex"), sep = "\n")
## #NEXUS ## [R-package PHYTOOLS, Wed Aug 07 10:27:09 2013] ## ## BEGIN TAXA; ## DIMENSIONS NTAX = 5; ## TAXLABELS ## A ## B ## C ## D ## E ## ; ## END; ## BEGIN TREES; ## TRANSLATE ## 1 A, ## 2 B, ## 3 C, ## 4 D, ## 5 E ## ; ## TREE * UNTITLED = [&R] (((1,2),(3,4)),5); ## END;
We can also simulate trees, plot them, extract clades, & drop tips from the tree:
## (I'm going to first set the seed for repeatability) set.seed(1) ## simulate a birth-death tree using phytools tree <- pbtree(b = 1, d = 0.2, n = 40) ## stopping criterion is 40 extant species, in this case plotTree(tree, setEnv = TRUE)
## setEnv=TRUE is experimental. please be patient with bugs
nodelabels()
## ok, now extract the clade descended from node #62 tt62 <- extract.clade(tree, 62) plotTree(tt62)
## now drop 10 tips from the tree (I'm going to pick them at random) dtips <- sample(tree$tip.label, 10) dt <- drop.tip(tree, dtips) plotTree(dt) ## we could also, say, drop all tips that go extinct before the present ## this is a fun way, but not the only way to do this: et <- fancyTree(tree, type = "droptip", tip = getExtinct(tree), cex = 0.7)
print(et)
## ## Phylogenetic tree with 40 tips and 39 internal nodes. ## ## Tip labels: ## t39, t40, t37, t38, t20, t22, ... ## ## Rooted; includes branch lengths.
"ape"
and most other phylogenetics packages are equipped to handle
phylogenies that are binary or multifurcating; however not all functions will be.
We can easily check if our tree is binary, and convert between binary &
multifurcating trees.
t1 <- read.tree(text = "((A,B,C),D);") plot(t1, type = "cladogram")
## check if binary is.binary.tree(t1)
## [1] FALSE
## randomly resolve polytomies t2 <- multi2di(t1) plot(t2, type = "cladogram")
is.binary.tree(t2)
## [1] TRUE
Lots of other manipulations are possible in R, but here are some simple ones:
## rotating nodes plotTree(tree, node.numbers = T)
## first, rotate about node #50 rt.50 <- rotate(tree, 50) plotTree(rt.50)
## now rotate all nodes rt.all <- rotateNodes(tree, "all") plotTree(rt.all)
## ok, now let's re-root the tree at node #67 rr.67 <- root(tree, node = 67) plotTree(rr.67)
## this creates a trifurcation at the root we could instead re-root at ## along an edge rr.62 <- reroot(tree, 62, position = 0.5 * tree$edge.length[which(tree$edge[, 2] == 62)]) plotTree(rr.62)
We can, for example, check to see if our trees are equal. Trees with rotated nodes are equal. Re-rooted trees are not; however they are the same if unrooted. For example:
## check if tree & rt.all are equal all.equal(tree, rt.all)
## [1] TRUE
## check if tree & rr.62 are equal all.equal(tree, rr.62)
## [1] FALSE
## check if unrooted tree & rr.62 are equal all.equal(unroot(tree), unroot(rr.62))
## [1] TRUE
Finally, it is sometimes useful to store multiple phylogenies in a single object. This would be the case, for example, if we had a set of trees in a posterior sample from Bayesian phylogeny inference; or if we wanted to replicate a simulation analysis across a large number of phylogenies.
Multiple trees are stored as an object of class "multiPhylo"
. This
is just a list of objects of class "phylo"
, with the class attribute
"multiPhylo"
. Many, but not all, functions in "ape"
,
"phytools"
, and other R packages can be applied to both
"phylo"
and "multiPhylo"
objects. For instance:
## simulate 10 pure-birth trees trees <- pbtree(n = 6, nsim = 10, scale = 1) print(trees)
## 10 phylogenetic trees
## round the edge lengths of the tree to 3 digits trees <- roundBranches(trees, 1) ## write to file write.tree(trees, file = "example.trees") cat(readLines("example.trees"), sep = "\n")
## (t1:1,((t3:0.2,t4:0.2):0.6,(t2:0.7,(t5:0.2,t6:0.2):0.5):0.1):0.2); ## (t1:1,(((t4:0.2,(t5:0.1,t6:0.1):0.1):0,t3:0.2):0.2,t2:0.4):0.6); ## ((t1:0.4,t2:0.4):0.6,(t3:0.3,(t4:0.2,(t5:0,t6:0):0.1):0.2):0.7); ## ((t3:0.2,t4:0.2):0.8,(t1:1,(t2:0.8,(t5:0.1,t6:0.1):0.8):0.2):0); ## ((t3:0.2,t4:0.2):0.8,(((t5:0.1,t6:0.1):0.1,t2:0.3):0.5,t1:0.8):0.2); ## ((t3:0.3,t4:0.3):0.7,((t5:0,t6:0):0.8,(t1:0.5,t2:0.5):0.2):0.2); ## ((t2:0.3,t3:0.3):0.7,(t1:0.3,(t4:0.1,(t5:0,t6:0):0.1):0.2):0.7); ## (((t2:0.4,(t3:0.4,t4:0.4):0.1):0.1,(t5:0.3,t6:0.3):0.3):0.4,t1:1); ## (((t2:0.4,(t3:0.4,t4:0.4):0):0,t1:0.5):0.5,(t5:0.1,t6:0.1):0.9); ## (((t3:0.4,t4:0.4):0.5,t1:0.9):0.1,(t2:0.8,(t5:0.3,t6:0.3):0.5):0.2);
Many other tasks are also possible in R phylogenetics. For an overview, read the CRAN Phylogenetics Task View. Also check out my blog: http://blog.phytools.org, or review the PDF manuals of different R packages such as ape, phytools, phangorn, and geiger.
Written by Liam J. Revell. Last updated Aug. 7, 2013