Exercise 7: Reconstructing ancestral states for continuous characters

For better or for worse, the estimation of phenotypic trait values for ancestral nodes in the tree continues to be an important goal in phylogenetic comparative biology.

This tutorial focuses on the estimation of ancestral character states under Brownian motion, including when some values for internal nodes are known.

Anole body size

Let's start by estimating ancestral states for a a continuous character, in this case overall body size, on a phylogeny of Anolis lizards from the Caribbean.

To estimate the states for a continuously valued character at ancestral nodes, we need find the states that have the maximum probability of having arisen under our model. These will be our maximum likelihood estimates.

Ancestral character esimation is implemented in a variety of different R functions. The most commonly used is ace. Here, we start with the phytools function fastAnc.

The files you will need are here:

Anolis.tre
svl.csv.

If you have a good web connection, it is also always possible to read in your files directly from the web, too.

## load libraries
library(phytools)

## Loading required package: ape

## Loading required package: maps

## 
##  # maps v3.1: updated 'world': all lakes moved to separate new #
##  # 'lakes' database. Type '?world' or 'news(package="maps")'.  #

## read tree from file
tree<-read.tree("Anolis.tre")

## or
tree<-read.tree("http://www.phytools.org/SanJuan2016/data/Anolis.tre")

## plot tree
plotTree(tree,type="fan",ftype="i")

plot of chunk unnamed-chunk-3

## read data
svl<-read.csv("svl.csv",row.names=1)

## or
svl<-read.csv("http://www.phytools.org/SanJuan2016/data/svl.csv",
    row.names=1)

## change this into a vector
svl<-setNames(svl$svl,rownames(svl))
svl

##            ahli         alayoni         alfaroi        aliniger 
##        4.039125        3.815705        3.526655        4.036557 
##        allisoni         allogus   altitudinalis         alumina 
##        4.375390        4.040138        3.842994        3.588941 
##       alutaceus     angusticeps     argenteolus     argillaceus 
##        3.554891        3.788595        3.971307        3.757869 
##         armouri   bahorucoensis        baleatus        baracoae 
##        4.121684        3.827445        5.053056        5.042780 
##       barahonae        barbatus        barbouri        bartschi 
##        5.076958        5.003946        3.663932        4.280547 
##         bremeri        breslini    brevirostris        caudalis 
##        4.113371        4.051111        3.874155        3.911743 
##       centralis  chamaeleonides    chlorocyanus     christophei 
##        3.697941        5.042349        4.275448        3.884652 
##       clivicola     coelestinus        confusus           cooki 
##        3.758726        4.297965        3.938442        4.091535 
##    cristatellus    cupeyalensis         cuvieri    cyanopleurus 
##        4.189820        3.462014        4.875012        3.630161 
##         cybotes     darlingtoni       distichus dolichocephalus 
##        4.210982        4.302036        3.928796        3.908550 
##       equestris      etheridgei   eugenegrahami       evermanni 
##        5.113994        3.657991        4.128504        4.165605 
##         fowleri         garmani         grahami           guafe 
##        4.288780        4.769473        4.154274        3.877457 
##       guamuhaya         guazuma       gundlachi       haetianus 
##        5.036953        3.763884        4.188105        4.316542 
##      hendersoni      homolechis           imias    inexpectatus 
##        3.859835        4.032806        4.099687        3.537439 
##       insolitus        isolepis           jubar           krugi 
##        3.800471        3.657088        3.952605        3.886500 
##      lineatopus   longitibialis        loysiana          lucius 
##        4.128612        4.242103        3.701240        4.198915 
##    luteogularis      macilentus        marcanoi          marron 
##        5.101085        3.715765        4.079485        3.831810 
##         mestrei       monticola          noblei        occultus 
##        3.987147        3.770613        5.083473        3.663049 
##         olssoni        opalinus      ophiolepis        oporinus 
##        3.793899        3.838376        3.637962        3.845670 
##        paternus        placidus       poncensis        porcatus 
##        3.802961        3.773967        3.820378        4.258991 
##          porcus      pulchellus         pumilis quadriocellifer 
##        5.038034        3.799022        3.466860        3.901619 
##      reconditus        ricordii     rubribarbus          sagrei 
##        4.482607        5.013963        4.078469        4.067162 
##    semilineatus        sheplani         shrevei      singularis 
##        3.696631        3.682924        3.983003        4.057997 
##      smallwoodi         strahmi       stratulus     valencienni 
##        5.035096        4.274271        3.869881        4.321524 
##       vanidicus    vermiculatus        websteri       whitemani 
##        3.626206        4.802849        3.916546        4.097479

Now we can estimate ancestral states. We will also compute variances & 95% confidence intervals for each node:

fit<-fastAnc(tree,svl,vars=TRUE,CI=TRUE)
print(fit,printlen=10)

## Ancestral character estimates using fastAnc:
##       101      102      103      104      105      106      107      108
##  4.065918 4.045601 4.047993 4.066923 4.036743 4.049514 4.054528 4.045218
##       109     110     
##  3.979493 3.95244 ....
## 
## Variances on ancestral states:
##       101     102      103     104     105      106      107      108
##  0.011775 0.00841 0.009452 0.01187 0.01446 0.018113 0.011686 0.007269
##       109      110     
##  0.014488 0.012814 ....
## 
## Lower & upper 95% CIs:
##        lower    upper
## 101 3.853231 4.278604
## 102 3.865861 4.225341
## 103 3.857439 4.238548
## 104 3.853382 4.280465
## 105 3.801056  4.27243
## 106 3.785731 4.313298
## 107 3.842649 4.266406
## 108 3.878112 4.212323
## 109 3.743577 4.215408
## 110 3.730566 4.174314
##         ....     ....

## as we discussed in class, 95% CIs can be broad
range(svl) ## compare to root node

## [1] 3.462014 5.113994

We will learn more about this later, but there are also some methods for visualizing the ancestral state reconstructions for continuous traits on the tree. For example:

## projection of the reconstruction onto the edges of the tree
obj<-contMap(tree,svl,plot=FALSE)
obj

## Object of class "contMap" containing:
## 
## (1) A phylogenetic tree with 100 tips and 99 internal nodes.
## 
## (2) A mapped continuous trait on the range (3.462014, 5.113994).

plot(obj,type="fan",legend=0.7*max(nodeHeights(tree)))

plot of chunk unnamed-chunk-7

Properties of ancestral states of ancestral state reconstruction

Next, we can explore some of the properties of ancestral state reconstruction of continuous traits in general, starting wih our Brownian model.

First, let's simulate some data:

## simulate a tree & some data
tree<-pbtree(n=26,scale=1,tip.label=LETTERS[26:1])
## simulate with ancestral states
x<-fastBM(tree,internal=TRUE)
## ancestral states
a<-x[as.character(1:tree$Nnode+Ntip(tree))]
## tip data
x<-x[tree$tip.label]

Now, let's estimate ancestral states using fastAnc which uses the re-rooting algorithm discussed in class:

fit<-fastAnc(tree,x,CI=TRUE)
print(fit,printlen=6)

## Ancestral character estimates using fastAnc:
##         27       28        29       30        31        32     
##  -0.195128 0.137528 -0.361822 -0.65777 -0.660149 -0.666084 ....
## 
## Lower & upper 95% CIs:
##        lower     upper
## 27 -0.886855  0.496599
## 28 -0.540217  0.815274
## 29 -0.948743  0.225099
## 30 -1.114127 -0.201412
## 31 -1.110236 -0.210061
## 32 -1.111167    -0.221
##         ....      ....

We can easily compare these estimates to the (normally unknown) generating values for simulated data:

plot(a,fit$ace,xlab="true states",ylab="estimated states")
lines(range(c(x,a)),range(c(x,a)),lty="dashed",col="red") ## 1:1 line

plot of chunk unnamed-chunk-10

One of the most common critiques of ancestral state estimation is that the uncertainty around ancestral states can be large; and, furthermore, that this uncertainty is often ignored. Let's address this by obtaining 95% CIs on ancestral values, and then let's show that the 95% CIs include the generating values around 95% of the time:

## first, let's go back to our previous dataset
print(fit)

## Ancestral character estimates using fastAnc:
##        27        28        29        30        31        32        33 
## -0.195128  0.137528 -0.361822 -0.657770 -0.660149 -0.666084 -0.800669 
##        34        35        36        37        38        39        40 
## -0.938376 -0.793030 -0.583928 -0.509131 -0.614028 -0.839242 -0.476535 
##        41        42        43        44        45        46        47 
## -0.115802 -0.609347 -0.670228 -0.702526 -0.803653 -0.269509 -0.270716 
##        48        49        50        51 
## -0.381415 -0.595714 -0.376348 -0.412522 
## 
## Lower & upper 95% CIs:
##        lower     upper
## 27 -0.886855  0.496599
## 28 -0.540217  0.815274
## 29 -0.948743  0.225099
## 30 -1.114127 -0.201412
## 31 -1.110236 -0.210061
## 32 -1.111167 -0.221000
## 33 -1.197824 -0.403515
## 34 -1.298520 -0.578231
## 35 -1.022320 -0.563739
## 36 -0.883963 -0.283892
## 37 -0.537165 -0.481096
## 38 -1.032292 -0.195764
## 39 -1.192058 -0.486425
## 40 -0.835417 -0.117653
## 41 -0.454910  0.223305
## 42 -0.949638 -0.269057
## 43 -0.827204 -0.513252
## 44 -0.834681 -0.570372
## 45 -1.067799 -0.539506
## 46 -0.680942  0.141923
## 47 -0.681679  0.140247
## 48 -0.781315  0.018484
## 49 -0.971147 -0.220281
## 50 -0.633324 -0.119372
## 51 -0.625281 -0.199763

mean(((a>=fit$CI95[,1]) + (a<=fit$CI95[,2]))==2)

## [1] 0.72

One small tree doesn't tell us much, so let's repeat for a sample of trees & reconstructions:

## custom function that conducts a simulation, estimates ancestral
## states, & returns the fraction on 95% CI
foo<-function(){
    tree<-pbtree(n=100)
    x<-fastBM(tree,internal=TRUE)
    fit<-fastAnc(tree,x[1:length(tree$tip.label)],CI=TRUE)
    mean(((x[1:tree$Nnode+length(tree$tip.label)]>=fit$CI95[,1]) +
        (x[1:tree$Nnode+length(tree$tip.label)]<=fit$CI95[,2]))==2)
}
## conduct 100 simulations
pp<-replicate(100,foo())    
mean(pp)

## [1] 0.9485859

Cool. Well, this shows us that although CIs can be large, when the model is correct they will include the generating value (1-α)% of the time.

Ancestral state estimates when some nodes are known

We can theoretically fix the state for any internal node during MLE of ancestral states. In practice, we would do this by attaching a terminal edge of zero length to the internal node we wanted to fix, and then treat it like another tip value in our analyses.

This method is implemented in the phytools function fastAnc.

anc.states<-a[as.character(c(27,31,35))]
anc.states

##         27         31         35 
##  0.0000000 -0.1594359 -0.8463845

fit<-fastAnc(tree,x,anc.states=anc.states,CI=TRUE)
plot(a,fit$ace,xlab="true states",ylab="estimated states")
lines(range(c(x,a)),range(c(x,a)),lty="dashed",col="red") ## 1:1 line

plot of chunk unnamed-chunk-13

Another possibility, which also allows for the possibility that ancestral states are uncertain, is to use an informative prior probability density on one or multiple states at internal nodes, and then estimate ancestral states using Bayesian MCMC.

Let's try this using a really bad case for ancestral character estimation from contemporaneous tip data: Brownian evolution with a trend. Note that although we theoretically could do so - we are not fitting a trended model here.

tree<-pbtree(n=100,scale=1)
## simulate data with a trend
x<-fastBM(tree,internal=TRUE,mu=3)
phenogram(tree,x,ftype="off",spread.labels=FALSE)

plot of chunk unnamed-chunk-14

a<-x[as.character(1:tree$Nnode+Ntip(tree))]
x<-x[tree$tip.label]
## let's see how bad we do if we ignore the trend
plot(a,fastAnc(tree,x),xlab="true values",
    ylab="estimated states under BM")
lines(range(c(x,a)),range(c(x,a)),lty="dashed",col="red")
title("estimated without prior information")

plot of chunk unnamed-chunk-14

## incorporate prior knowledge
pm<-setNames(c(1000,rep(0,tree$Nnode)),
    c("sig2",1:tree$Nnode+length(tree$tip.label)))
## the root & two randomly chosen nodes
nn<-as.character(c(length(tree$tip.label)+1,
    sample(2:tree$Nnode+length(tree$tip.label),2)))
pm[nn]<-a[as.character(nn)]
## prior variance
pv<-setNames(c(1000^2,rep(1000,length(pm)-1)),names(pm))
pv[as.character(nn)]<-1e-100
## run MCMC
mcmc<-anc.Bayes(tree,x,ngen=100000,
    control=list(pr.mean=pm,pr.var=pv,
    a=pm[as.character(length(tree$tip.label)+1)],
    y=pm[as.character(2:tree$Nnode+length(tree$tip.label))]))

## Control parameters (set by user or default):

## List of 7
##  $ sig2   : num 0.933
##  $ a      : Named num 0
##   ..- attr(*, "names")= chr "101"
##  $ y      : Named num [1:98] 0 0 0 0 0 0 0 0 0 0 ...
##   ..- attr(*, "names")= chr [1:98] "102" "103" "104" "105" ...
##  $ pr.mean: Named num [1:100] 1000 0 0 0 0 0 0 0 0 0 ...
##   ..- attr(*, "names")= chr [1:100] "sig2" "101" "102" "103" ...
##  $ pr.var : Named num [1:100] 1e+06 1e-100 1e+03 1e+03 1e+03 ...
##   ..- attr(*, "names")= chr [1:100] "sig2" "101" "102" "103" ...
##  $ prop   : num [1:100] 0.00933 0.00933 0.00933 0.00933 0.00933 ...
##  $ sample : num 100

## Starting MCMC...

## Done MCMC.

anc.est<-colMeans(mcmc[201:1001,
    as.character(1:tree$Nnode+length(tree$tip.label))])
plot(a[names(anc.est)],anc.est,xlab="true values",
    ylab="estimated states using informative prior")
lines(range(c(x,a)),range(c(x,a)),lty="dashed",col="red")
title("estimated using informative prior")

plot of chunk unnamed-chunk-14

Written by Liam J. Revell. Last updated 29 Jun. 2015