For better or for worse, the estimation of phenotypic trait values for ancestral nodes in the tree continues to be an important goal in phylogenetic comparative biology.
This tutorial focuses on the estimation of ancestral character states under Brownian motion, including when some values for internal nodes are known.
Let's start by estimating ancestral states for a a continuous character, in this case overall body size, on a phylogeny of Anolis lizards from the Caribbean.
To estimate the states for a continuously valued character at ancestral nodes, we need find the states that have the maximum probability of having arisen under our model. These will be our maximum likelihood estimates.
Ancestral character esimation is implemented in a variety of different R functions. The most commonly used is ace
. Here, we start with the phytools function fastAnc
.
The files you will need are here:
If you have a good web connection, it is also always possible to read in your files directly from the web, too.
## load libraries
library(phytools)
## Loading required package: ape
## Loading required package: maps
##
## # maps v3.1: updated 'world': all lakes moved to separate new #
## # 'lakes' database. Type '?world' or 'news(package="maps")'. #
## read tree from file
tree<-read.tree("Anolis.tre")
## or
tree<-read.tree("http://www.phytools.org/SanJuan2016/data/Anolis.tre")
## plot tree
plotTree(tree,type="fan",ftype="i")
## read data
svl<-read.csv("svl.csv",row.names=1)
## or
svl<-read.csv("http://www.phytools.org/SanJuan2016/data/svl.csv",
row.names=1)
## change this into a vector
svl<-setNames(svl$svl,rownames(svl))
svl
## ahli alayoni alfaroi aliniger
## 4.039125 3.815705 3.526655 4.036557
## allisoni allogus altitudinalis alumina
## 4.375390 4.040138 3.842994 3.588941
## alutaceus angusticeps argenteolus argillaceus
## 3.554891 3.788595 3.971307 3.757869
## armouri bahorucoensis baleatus baracoae
## 4.121684 3.827445 5.053056 5.042780
## barahonae barbatus barbouri bartschi
## 5.076958 5.003946 3.663932 4.280547
## bremeri breslini brevirostris caudalis
## 4.113371 4.051111 3.874155 3.911743
## centralis chamaeleonides chlorocyanus christophei
## 3.697941 5.042349 4.275448 3.884652
## clivicola coelestinus confusus cooki
## 3.758726 4.297965 3.938442 4.091535
## cristatellus cupeyalensis cuvieri cyanopleurus
## 4.189820 3.462014 4.875012 3.630161
## cybotes darlingtoni distichus dolichocephalus
## 4.210982 4.302036 3.928796 3.908550
## equestris etheridgei eugenegrahami evermanni
## 5.113994 3.657991 4.128504 4.165605
## fowleri garmani grahami guafe
## 4.288780 4.769473 4.154274 3.877457
## guamuhaya guazuma gundlachi haetianus
## 5.036953 3.763884 4.188105 4.316542
## hendersoni homolechis imias inexpectatus
## 3.859835 4.032806 4.099687 3.537439
## insolitus isolepis jubar krugi
## 3.800471 3.657088 3.952605 3.886500
## lineatopus longitibialis loysiana lucius
## 4.128612 4.242103 3.701240 4.198915
## luteogularis macilentus marcanoi marron
## 5.101085 3.715765 4.079485 3.831810
## mestrei monticola noblei occultus
## 3.987147 3.770613 5.083473 3.663049
## olssoni opalinus ophiolepis oporinus
## 3.793899 3.838376 3.637962 3.845670
## paternus placidus poncensis porcatus
## 3.802961 3.773967 3.820378 4.258991
## porcus pulchellus pumilis quadriocellifer
## 5.038034 3.799022 3.466860 3.901619
## reconditus ricordii rubribarbus sagrei
## 4.482607 5.013963 4.078469 4.067162
## semilineatus sheplani shrevei singularis
## 3.696631 3.682924 3.983003 4.057997
## smallwoodi strahmi stratulus valencienni
## 5.035096 4.274271 3.869881 4.321524
## vanidicus vermiculatus websteri whitemani
## 3.626206 4.802849 3.916546 4.097479
Now we can estimate ancestral states. We will also compute variances & 95% confidence intervals for each node:
fit<-fastAnc(tree,svl,vars=TRUE,CI=TRUE)
print(fit,printlen=10)
## Ancestral character estimates using fastAnc:
## 101 102 103 104 105 106 107 108
## 4.065918 4.045601 4.047993 4.066923 4.036743 4.049514 4.054528 4.045218
## 109 110
## 3.979493 3.95244 ....
##
## Variances on ancestral states:
## 101 102 103 104 105 106 107 108
## 0.011775 0.00841 0.009452 0.01187 0.01446 0.018113 0.011686 0.007269
## 109 110
## 0.014488 0.012814 ....
##
## Lower & upper 95% CIs:
## lower upper
## 101 3.853231 4.278604
## 102 3.865861 4.225341
## 103 3.857439 4.238548
## 104 3.853382 4.280465
## 105 3.801056 4.27243
## 106 3.785731 4.313298
## 107 3.842649 4.266406
## 108 3.878112 4.212323
## 109 3.743577 4.215408
## 110 3.730566 4.174314
## .... ....
## as we discussed in class, 95% CIs can be broad
range(svl) ## compare to root node
## [1] 3.462014 5.113994
We will learn more about this later, but there are also some methods for visualizing the ancestral state reconstructions for continuous traits on the tree. For example:
## projection of the reconstruction onto the edges of the tree
obj<-contMap(tree,svl,plot=FALSE)
obj
## Object of class "contMap" containing:
##
## (1) A phylogenetic tree with 100 tips and 99 internal nodes.
##
## (2) A mapped continuous trait on the range (3.462014, 5.113994).
plot(obj,type="fan",legend=0.7*max(nodeHeights(tree)))
Next, we can explore some of the properties of ancestral state reconstruction of continuous traits in general, starting wih our Brownian model.
First, let's simulate some data:
## simulate a tree & some data
tree<-pbtree(n=26,scale=1,tip.label=LETTERS[26:1])
## simulate with ancestral states
x<-fastBM(tree,internal=TRUE)
## ancestral states
a<-x[as.character(1:tree$Nnode+Ntip(tree))]
## tip data
x<-x[tree$tip.label]
Now, let's estimate ancestral states using fastAnc
which uses the re-rooting algorithm discussed in class:
fit<-fastAnc(tree,x,CI=TRUE)
print(fit,printlen=6)
## Ancestral character estimates using fastAnc:
## 27 28 29 30 31 32
## -0.195128 0.137528 -0.361822 -0.65777 -0.660149 -0.666084 ....
##
## Lower & upper 95% CIs:
## lower upper
## 27 -0.886855 0.496599
## 28 -0.540217 0.815274
## 29 -0.948743 0.225099
## 30 -1.114127 -0.201412
## 31 -1.110236 -0.210061
## 32 -1.111167 -0.221
## .... ....
We can easily compare these estimates to the (normally unknown) generating values for simulated data:
plot(a,fit$ace,xlab="true states",ylab="estimated states")
lines(range(c(x,a)),range(c(x,a)),lty="dashed",col="red") ## 1:1 line
One of the most common critiques of ancestral state estimation is that the uncertainty around ancestral states can be large; and, furthermore, that this uncertainty is often ignored. Let's address this by obtaining 95% CIs on ancestral values, and then let's show that the 95% CIs include the generating values around 95% of the time:
## first, let's go back to our previous dataset
print(fit)
## Ancestral character estimates using fastAnc:
## 27 28 29 30 31 32 33
## -0.195128 0.137528 -0.361822 -0.657770 -0.660149 -0.666084 -0.800669
## 34 35 36 37 38 39 40
## -0.938376 -0.793030 -0.583928 -0.509131 -0.614028 -0.839242 -0.476535
## 41 42 43 44 45 46 47
## -0.115802 -0.609347 -0.670228 -0.702526 -0.803653 -0.269509 -0.270716
## 48 49 50 51
## -0.381415 -0.595714 -0.376348 -0.412522
##
## Lower & upper 95% CIs:
## lower upper
## 27 -0.886855 0.496599
## 28 -0.540217 0.815274
## 29 -0.948743 0.225099
## 30 -1.114127 -0.201412
## 31 -1.110236 -0.210061
## 32 -1.111167 -0.221000
## 33 -1.197824 -0.403515
## 34 -1.298520 -0.578231
## 35 -1.022320 -0.563739
## 36 -0.883963 -0.283892
## 37 -0.537165 -0.481096
## 38 -1.032292 -0.195764
## 39 -1.192058 -0.486425
## 40 -0.835417 -0.117653
## 41 -0.454910 0.223305
## 42 -0.949638 -0.269057
## 43 -0.827204 -0.513252
## 44 -0.834681 -0.570372
## 45 -1.067799 -0.539506
## 46 -0.680942 0.141923
## 47 -0.681679 0.140247
## 48 -0.781315 0.018484
## 49 -0.971147 -0.220281
## 50 -0.633324 -0.119372
## 51 -0.625281 -0.199763
mean(((a>=fit$CI95[,1]) + (a<=fit$CI95[,2]))==2)
## [1] 0.72
One small tree doesn't tell us much, so let's repeat for a sample of trees & reconstructions:
## custom function that conducts a simulation, estimates ancestral
## states, & returns the fraction on 95% CI
foo<-function(){
tree<-pbtree(n=100)
x<-fastBM(tree,internal=TRUE)
fit<-fastAnc(tree,x[1:length(tree$tip.label)],CI=TRUE)
mean(((x[1:tree$Nnode+length(tree$tip.label)]>=fit$CI95[,1]) +
(x[1:tree$Nnode+length(tree$tip.label)]<=fit$CI95[,2]))==2)
}
## conduct 100 simulations
pp<-replicate(100,foo())
mean(pp)
## [1] 0.9485859
Cool. Well, this shows us that although CIs can be large, when the model is correct they will include the generating value (1-α)% of the time.
We can theoretically fix the state for any internal node during MLE of ancestral states. In practice, we would do this by attaching a terminal edge of zero length to the internal node we wanted to fix, and then treat it like another tip value in our analyses.
This method is implemented in the phytools function fastAnc
.
anc.states<-a[as.character(c(27,31,35))]
anc.states
## 27 31 35
## 0.0000000 -0.1594359 -0.8463845
fit<-fastAnc(tree,x,anc.states=anc.states,CI=TRUE)
plot(a,fit$ace,xlab="true states",ylab="estimated states")
lines(range(c(x,a)),range(c(x,a)),lty="dashed",col="red") ## 1:1 line
Another possibility, which also allows for the possibility that ancestral states are uncertain, is to use an informative prior probability density on one or multiple states at internal nodes, and then estimate ancestral states using Bayesian MCMC.
Let's try this using a really bad case for ancestral character estimation from contemporaneous tip data: Brownian evolution with a trend. Note that although we theoretically could do so - we are not fitting a trended model here.
tree<-pbtree(n=100,scale=1)
## simulate data with a trend
x<-fastBM(tree,internal=TRUE,mu=3)
phenogram(tree,x,ftype="off",spread.labels=FALSE)
a<-x[as.character(1:tree$Nnode+Ntip(tree))]
x<-x[tree$tip.label]
## let's see how bad we do if we ignore the trend
plot(a,fastAnc(tree,x),xlab="true values",
ylab="estimated states under BM")
lines(range(c(x,a)),range(c(x,a)),lty="dashed",col="red")
title("estimated without prior information")
## incorporate prior knowledge
pm<-setNames(c(1000,rep(0,tree$Nnode)),
c("sig2",1:tree$Nnode+length(tree$tip.label)))
## the root & two randomly chosen nodes
nn<-as.character(c(length(tree$tip.label)+1,
sample(2:tree$Nnode+length(tree$tip.label),2)))
pm[nn]<-a[as.character(nn)]
## prior variance
pv<-setNames(c(1000^2,rep(1000,length(pm)-1)),names(pm))
pv[as.character(nn)]<-1e-100
## run MCMC
mcmc<-anc.Bayes(tree,x,ngen=100000,
control=list(pr.mean=pm,pr.var=pv,
a=pm[as.character(length(tree$tip.label)+1)],
y=pm[as.character(2:tree$Nnode+length(tree$tip.label))]))
## Control parameters (set by user or default):
## List of 7
## $ sig2 : num 0.933
## $ a : Named num 0
## ..- attr(*, "names")= chr "101"
## $ y : Named num [1:98] 0 0 0 0 0 0 0 0 0 0 ...
## ..- attr(*, "names")= chr [1:98] "102" "103" "104" "105" ...
## $ pr.mean: Named num [1:100] 1000 0 0 0 0 0 0 0 0 0 ...
## ..- attr(*, "names")= chr [1:100] "sig2" "101" "102" "103" ...
## $ pr.var : Named num [1:100] 1e+06 1e-100 1e+03 1e+03 1e+03 ...
## ..- attr(*, "names")= chr [1:100] "sig2" "101" "102" "103" ...
## $ prop : num [1:100] 0.00933 0.00933 0.00933 0.00933 0.00933 ...
## $ sample : num 100
## Starting MCMC...
## Done MCMC.
anc.est<-colMeans(mcmc[201:1001,
as.character(1:tree$Nnode+length(tree$tip.label))])
plot(a[names(anc.est)],anc.est,xlab="true values",
ylab="estimated states using informative prior")
lines(range(c(x,a)),range(c(x,a)),lty="dashed",col="red")
title("estimated using informative prior")
Written by Liam J. Revell. Last updated 29 Jun. 2015