Studying the evolution of discrete & continuous characters using the threshold model

In this tutorial, we're going to cover three different analyses:

  1. We're going to estimate the evolutionary covariances between discrete & continuous characters under the threshold model using THRESHML via the R package Rphylip.

  2. We're going to estimate the correlation between discrete & continuous characters using Bayesian MCMC in the phytools function threshBayes.

  3. We're going to reconstruct the ancestral states of a multistate, naturally ordered character on the tree assuming the threshold model.

For the latter two parts, if you want to following along, but do not have a fast enough computer to run the MCMC, you can download and load the .Rdata file thresh.Rdata and then run everything except the MCMC.

## remove comment characters
## load("http://www.phytools.org/eqg2015/data/thresh.Rdata")

Discrete & continuous character evolution using THRESHML

In the first part, we will estimate the covariances between discrete and continuous characters using THRESHML, but run in R using the package Rphylip.

This exercise will use the following two file:

  1. threshold.data.csv

  2. threshold-tree.tre.

It also uses THRESHML which can be downloaded from its website. If you cannot uncompress the .zip file, you can also find the binaries here. If you are working in OSX or Linux you will have to delete the extension .osx (or as the case may be) after downloading the file to your current working directory. Rphylip, though on CRAN, should also be downloaded directly from its website too to get the most recent version:

install.packages("http://www.phytools.org/Rphylip/Rphylip_0.1-27.tar.gz",
    type="source",repos=NULL)
## Installing package into 'C:/Users/Liam/Documents/R/win-library/3.2'
## (as 'lib' is unspecified)
library(Rphylip)
## Loading required package: ape
packageVersion("Rphylip")
## [1] '0.1.27'
tree<-read.tree("http://www.phytools.org/eqg2015/data/threshold-tree.tre")
tree
## 
## Phylogenetic tree with 50 tips and 49 internal nodes.
## 
## Tip labels:
##  t1, t4, t5, t21, t22, t36, ...
## 
## Rooted; includes branch lengths.
X<-read.csv("http://www.phytools.org/eqg2015/data/threshold-data.csv",row.names=1)
head(X,20)
##              v1          v2 v3 v4          v5
## t1   0.23306960  0.34327039  A  A -0.17692784
## t4   0.66946987  1.53877935  B  A -0.01637280
## t5   0.99797049  2.11457033  B  B  0.63324808
## t21  1.19041251 -0.40707020  B  A -0.02923156
## t22  1.34913569  0.15694197  B  B -0.15965937
## t36  0.05208294 -1.48938136  B  B -1.50470947
## t37 -0.28370646 -1.77117587  A  B -1.65830140
## t26  0.15242352 -1.83272531  B  B -1.83731420
## t6   0.39157829 -0.30742614  B  B -1.16719631
## t12  0.67568316 -0.00825866  B  A -0.67593535
## t13  1.42193689  0.56928709  B  B -0.29758866
## t15  0.13407221 -0.12217047  B  A -0.70981908
## t16  0.78582147  0.41583799  B  B  0.05201095
## t9   1.19247059  1.15075296  B  B  0.50817894
## t7   2.36650908  1.95628960  B  A  0.56247386
## t41  2.41623554  0.16918743  B  A -0.35497459
## t42  2.54605619 -0.27074478  B  A -0.57023929
## t43  2.61563649  0.01028018  B  B -0.50709251
## t44  2.32194468  0.32693328  B  B -0.32591717
## t10  2.13656850  1.00232247  B  B -0.48040619
fit<-Rthreshml(tree,X,proposal=0.1,path=".",nchain=5)
## 
## Threshold character Maximum Likelihood method version 3.7a
## 
##  In this output the first 3 characters are continuous,
##  and the next 2 characters are discrete.
## 
##  Covariance matrix of continuous characters
##  and liabilities of discrete characters
##  (the continuous characters are first)
## 
##              1          2          3          4          5    
## 
##   1       0.98256    0.11758    0.13758    0.40052    0.08170
##   2       0.11758    2.72493    1.39526    0.05377    0.11475
##   3       0.13758    1.39526    1.00221    0.08303    0.02970
##   4       0.40052    0.05377    0.08303    1.00000   -0.13886
##   5       0.08170    0.11475    0.02970   -0.13886    1.00000
## 
## 
##  Transform from independent variables (columns)
##  to liabilities or characters (rows)
## 
##              1          2          3          4          5    
## 
##   1       0.99124    0.00000    0.00000    0.00000    0.00000
##   2       0.11862    1.64647    0.00000    0.00000    0.00000
##   3       0.13880    0.83742    0.53072    0.00000    0.00000
##   4       0.40406    0.00355    0.04518    0.91361    0.00000
##   5       0.08243    0.06376   -0.06619   -0.18542    0.97487
## 
## 
##    For            Variance of
##    variable       its change
##    --------       -----------
##        1            3.525106
##        2            1.389672
##        3            1.038796
##        4            0.539244
##        5            0.216876
## 
##  Transform from liabilities or characters (columns)
##  to continuous variables or liabilities (rows)
## 
##              1          2          3          4          5    
## 
##   1       0.07483    0.86832    0.48627    0.04387    0.04519
##   2      -0.66398    0.09915   -0.02304   -0.72607    0.14696
##   3       0.32835   -0.04654   -0.04388   -0.11594    0.93523
##   4      -0.66252    0.03561   -0.05198    0.67630    0.31578
##   5       0.08235    0.48245   -0.87085    0.00918   -0.04463
fit
## $Covariance_matrix
##         v1      v2       v3       v4      v5
## v1 0.98256 0.11758  0.40052  0.08170 0.13758
## v2 0.11758 2.72493  0.05377  0.11475 1.39526
## v3 0.40052 0.05377  1.00000 -0.13886 0.08303
## v4 0.08170 0.11475 -0.13886  1.00000 0.02970
## v5 0.13758 1.39526  0.08303  0.02970 1.00221
## 
## $Transform_indepvar_liab
##         v1      v2       v3      v4       v5
## v1 0.99124 0.00000  0.00000 0.00000  0.00000
## v2 0.11862 1.64647  0.00000 0.00000  0.00000
## v3 0.40406 0.00355  0.91361 0.00000  0.04518
## v4 0.08243 0.06376 -0.18542 0.97487 -0.06619
## v5 0.13880 0.83742  0.00000 0.00000  0.53072
## 
## $Var_change
##       v1       v2       v3       v4       v5 
## 3.525106 1.389672 0.539244 0.216876 1.038796 
## 
## $Transform_liab_cont
##          v1       v2       v3       v4       v5
## v1  0.07483  0.86832  0.04387  0.04519  0.48627
## v2 -0.66398  0.09915 -0.72607  0.14696 -0.02304
## v3 -0.66252  0.03561  0.67630  0.31578 -0.05198
## v4  0.08235  0.48245  0.00918 -0.04463 -0.87085
## v5  0.32835 -0.04654 -0.11594  0.93523 -0.04388

By chance, we know the true covariance, which in this case is:

true.Cov<-matrix(c(1,0,0.8,0,0,
0,2,0,0,0,
0.8,0,1,-0.5,1.2,
0,0,-0.5,1,0,
0,1.2,0,0,1),5,5)
colnames(true.Cov)<-rownames(true.Cov)<-colnames(X)
true.Cov
##     v1 v2   v3   v4  v5
## v1 1.0  0  0.8  0.0 0.0
## v2 0.0  2  0.0  0.0 1.2
## v3 0.8  0  1.0 -0.5 0.0
## v4 0.0  0 -0.5  1.0 0.0
## v5 0.0  0  1.2  0.0 1.0

We can compare it to our estimated matrix using Rthreshml:

plot(true.Cov,fit$Covariance_matrix,xlab="true covariances",
    ylab="estimated covariances")
lines(c(-1,2),c(-1,2),lty="dashed",col="red")

plot of chunk unnamed-chunk-4

Correlation between discrete & continuous characters under the threshold model

Next, we're going to use a Bayesian MCMC approach to do something similar. This is implemented in phytools; however it only permits two characters - so we will use the same data as before, but this time with only a subset of our data.

library(phytools)
## Loading required package: maps
## set some parameters for the MCMC
sample<-500
ngen<-200000 ## in 'real' studies this should be larger
burnin<-0.2*ngen
## unfortunately, we have to convert to a 0/1 character
xy<-cbind(X[,1],as.numeric(X[,3])-1)
colnames(xy)<-colnames(X)[c(1,3)]
rownames(xy)<-rownames(X)
head(xy,20)
##              v1 v3
## t1   0.23306960  0
## t4   0.66946987  1
## t5   0.99797049  1
## t21  1.19041251  1
## t22  1.34913569  1
## t36  0.05208294  1
## t37 -0.28370646  0
## t26  0.15242352  1
## t6   0.39157829  1
## t12  0.67568316  1
## t13  1.42193689  1
## t15  0.13407221  1
## t16  0.78582147  1
## t9   1.19247059  1
## t7   2.36650908  1
## t41  2.41623554  1
## t42  2.54605619  1
## t43  2.61563649  1
## t44  2.32194468  1
## t10  2.13656850  1
## we can start by running our Bayesian MCMC on the original data
mcmc_13<-threshBayes(tree,xy,types=c("cont","disc"),
    ngen=ngen,control=list(sample=sample))
## gen 1000
## gen 2000
## gen 3000
## gen 4000
## gen 5000
## gen 6000
## gen 7000
## gen 8000
## gen 9000
## gen 10000
## ....
## gen 199000
## gen 200000
## here is our "post burnin" mean from the posterior sample
mean(mcmc_13$par[(burnin/sample+1):nrow(mcmc_13$par),"r"])
## [1] 0.5215849
## plot our likelihood profile
plot(mcmc_13$par[,"gen"],mcmc_13$par[,"logL"],type="l",
    xlab="generation",ylab="logL")

plot of chunk unnamed-chunk-5

## plot our posterior density for the correlation
plot(density(mcmc_13$par[(burnin/sample+1):nrow(mcmc_13$par),
    "r"],bw=0.1),xlab="r",main="posterior density for r")
lines(rep(cov2cor(true.Cov)[1,3],2),par()$usr[3:4],lty="dashed")

plot of chunk unnamed-chunk-6

Next, let's analyze two binary characters under the same model. Remember, each time we're losing information about the underlying correlation between liabilities. Nonetheless….

xy<-cbind(as.numeric(X[,3])-1,as.numeric(X[,4])-1)
colnames(xy)<-colnames(X)[c(3,4)]
rownames(xy)<-rownames(X)
head(xy,20)
##     v3 v4
## t1   0  0
## t4   1  0
## t5   1  1
## t21  1  0
## t22  1  1
## t36  1  1
## t37  0  1
## t26  1  1
## t6   1  1
## t12  1  0
## t13  1  1
## t15  1  0
## t16  1  1
## t9   1  1
## t7   1  0
## t41  1  0
## t42  1  0
## t43  1  1
## t44  1  1
## t10  1  1
mcmc_34<-threshBayes(tree,xy,types=c("disc","disc"),
    ngen=ngen,control=list(sample=sample))
## gen 1000
## gen 2000
## gen 3000
## gen 4000
## gen 5000
## gen 6000
## gen 7000
## gen 8000
## gen 9000
## gen 10000
## ....
## gen 199000
## gen 200000
## what is our estimate of the correlation?
mean(mcmc_34$par[(burnin/sample+1):nrow(mcmc_34$par),"r"])
## [1] -0.2516828
## plot our likelihood profile
plot(mcmc_34$par[,"logL"],type="l",xlab="generation",
    ylab="logL")

plot of chunk unnamed-chunk-7

## plot our posterior sample for the correlation
plot(density(mcmc_34$par[(burnin/sample+1):nrow(mcmc_34$par),
    "r"],bw=0.1),xlab="r",main="posterior density for r")
lines(rep(cov2cor(true.Cov)[3,4],2),par()$usr[3:4],lty="dashed")

plot of chunk unnamed-chunk-8

So we can see that we can see that threshBayes does a reasonable job at estimating the correlation between binary & continuous characters under the threshold model.

Ancestral character estimation under the threshold model

We can also do ancestral character reconstruction for discrete characters using the threshold model.

## simulate tree & data
ngen<-100000
sample<-500
obj<-read.csv("http://www.phytools.org/eqg2015/data/ancThresh-data.csv",
    row.names=1)
x<-setNames(as.character(obj[,1]),rownames(obj))
mcmc<-ancThresh(tree,x,ngen=ngen,sequence=letters[1:4],
    control=list(sample=sample,plot=FALSE))
## MCMC starting....
## gen 1000
## gen 2000
## gen 3000
## gen 4000
## gen 5000
## gen 6000
## gen 7000
## gen 8000
## gen 9000
## gen 10000
## ....
## gen 99000
## gen 100000
plotTree(tree,ftype="off")
tiplabels(pie=to.matrix(x,letters[1:4]),
    piecol=palette()[1:4],cex=0.3)
nodelabels(pie=mcmc$ace,piecol=palette()[1:4],cex=0.8)

plot of chunk unnamed-chunk-9

Theoretically, we can even use this method to identify the relative positions of thresholds:

colMeans(mcmc$par[(0.2*ngen/sample):(ngen/sample)+1,
    letters[1:4]])
##         a         b         c         d 
## 0.0000000 0.8665975 1.2204302       Inf
par(mfrow=c(2,1))
par(mar=c(4.1,5.1,2.1,2.1))
plot(density(mcmc$par[(0.2*ngen/sample):(ngen/sample)+1,
    letters[2]],bw=0.5),xlab="threshold a->b",main="")
lines(c(0.5,0.5),c(0,1000),lty="dashed")
plot(density(mcmc$par[(0.2*ngen/sample):(ngen/sample)+1,
    letters[3]],bw=0.5),xlab="threshold b->c",main="")
lines(c(2,2),c(0,1000),lty="dashed")

plot of chunk unnamed-chunk-10

Save the workspace:

save.image(file="thresh.Rdata")

Written by Liam J. Revell. Last updated August 15, 2015