Phylogenetic analysis of the threshold model

In this tutorial, we're going to cover two different analyses:

1. We're going to estimate the correlation between two binary characters, as well as between a binary & a continuous character, assuming a threshold model of character evolution for our discrete character using both threshBayes in the phytools package in Rthreshml in Rphylip; and
2. We're going to reconstruct the ancestral states of a multistate, naturally ordered character on the tree assuming the threshold model.

If you want to following along, but do not have a fast enough computer to run the MCMC, you can download and load the .Rdata file thresh.Rdata and then run everything except the MCMC.

Correlation between discrete & continuous characters under the threshold model

First, we're going to use Bayesian MCMC method to sample the evolutionary parameters of our model from their joint posterior probability distribution. These parameters include evolutionary rates (σ2s) for each continuous character; and the correlation between our discrete and continuous character. Since this is a binary model with a single threshold - we don't have to worry about the relative positions of the thresholds.

We'll start by simulating a tree & two correlated Brownian variables on the tree.

For illustrative purposes, let's start by using threshBayes on two continuous characters & show that the result lines up closely with our MLE of the correlation.

## first, load phytools
library(phytools)
## Loading required package: ape
## Loading required package: maps
## Loading required package: rgl
tree <- pbtree(n = 50, scale = 1)
r <- 0.75  # simulate using a high correlation
V <- matrix(c(1, r, r, 1), 2, 2)
X <- sim.corrs(tree, V)
## set some parameters for the MCMC
sample <- 500
ngen <- 2e+05  ## in 'real' studies this should be larger
burnin <- 0.2 * ngen
## we can start by running our Bayesian MCMC on the original data
AA <- threshBayes(tree, X, types = c("cont", "cont"), ngen = ngen, control = list(sample = sample))
## [1] "gen 1000"
## [1] "gen 2000"
## [1] "gen 3000"
## [1] "gen 4000"
## [1] "gen 5000"
## [1] "gen 6000"
## ....
## [1] "gen 200000"
## we can compute our MLE of the correlation
V.mle <- phyl.vcv(X, vcv(tree), lambda = 1)$R
V.mle[1, 2]/sqrt(V.mle[1, 1] * V.mle[2, 2])
## [1] 0.756
## here is our 'post burnin' mean from the posterior sample
mean(AA$par[(burnin/sample + 1):nrow(AA$par), "r"])
## [1] 0.7467
## plot our likelihood profile
plot(AA$par[, "logL"], type = "l", xlab = "generation", ylab = "logL")
plot of chunk unnamed-chunk-1
## plot our posterior density for the correlation
plot(density(AA$par[(burnin/sample + 1):nrow(AA$par), "r"], bw = 0.1), xlab = "r",
    main = "posterior density for r")
lines(c(r, r), c(0, 1000), lty = "dashed")
plot of chunk unnamed-chunk-2

OK, this was just a proof-of-concept exercise. Now let's try the same thing after having converted one of our trait values into a threshold character.

## convert X[,2] to a binary character
X[, 2] <- as.numeric(sapply(X[, 2], threshState, thresholds = setNames(c(0,
    Inf), 0:1)))
BB <- threshBayes(tree, X, types = c("cont", "disc"), ngen = ngen, control = list(sample = sample))
## [1] "gen 1000"
## [1] "gen 2000"
## [1] "gen 3000"
## ....
## [1] "gen 200000"
## what is our estimate of the correlation?
mean(BB$par[(burnin/sample + 1):nrow(BB$par), "r"])
## [1] 0.7214
## plot our likelihood profile
plot(BB$par[, "logL"], type = "l", xlab = "generation", ylab = "logL")
plot of chunk unnamed-chunk-3
## plot our estimated posterior density for the correlation
plot(density(BB$par[(burnin/sample + 1):nrow(BB$par), "r"], bw = 0.1), xlab = "r",
    main = "posterior density for r")
lines(c(r, r), c(0, 1000), lty = "dashed")
plot of chunk unnamed-chunk-4

Finally, let's analyze two binary characters under the same model. Remember, each time we're losing information about the underlying correlation between liabilities. Nonetheless....

## convert X[,2] to a binary character
X[, 1] <- as.numeric(sapply(X[, 1], threshState, thresholds = setNames(c(0,
    Inf), 0:1)))
CC <- threshBayes(tree, X, types = c("disc", "disc"), ngen = ngen, control = list(sample = sample))
## [1] "gen 1000"
## [1] "gen 2000"
## [1] "gen 3000"
## ...
## [1] "gen 200000"
## what is our estimate of the correlation?
mean(CC$par[(burnin/sample + 1):nrow(CC$par), "r"])
## [1] 0.5179
## plot our likelihood profile
plot(CC$par[, "logL"], type = "l", xlab = "generation", ylab = "logL")
plot of chunk unnamed-chunk-5
## plot our posterior sample for the correlation
plot(density(CC$par[(burnin/sample + 1):nrow(CC$par), "r"], bw = 0.1), xlab = "r",
    main = "posterior density for r")
lines(c(r, r), c(0, 1000), lty = "dashed")
plot of chunk unnamed-chunk-6

So we can see that we can see that threshBayes does a reasonable job at estimating the correlation between binary & continuous characters under the threshold model.

Finally, in 'real' empirical studies, it is always wise to ensure that you have a 'good' sample from the posterior distribution - for example, by computing the effective sample size of your posterior sample.

## let's use coda to compute effective sample sizes & HPDs
require(coda)
## Loading required package: coda
## Loading required package: lattice
## first, analysis A
rA <- AA$par[(burnin/sample + 1):nrow(AA$par), "r"]
class(rA) <- "mcmc"
effectiveSize(rA)
##  var1 
## 313.2
HPDinterval(rA)
##       lower  upper
## var1 0.5998 0.8556
## attr(,"Probability")
## [1] 0.9502
## second, analysis B
rB <- BB$par[(burnin/sample + 1):nrow(AA$par), "r"]
class(rB) <- "mcmc"
effectiveSize(rB)
##  var1 
## 29.75
HPDinterval(rB)
##       lower  upper
## var1 0.3822 0.9493
## attr(,"Probability")
## [1] 0.9502
## finally, analysis C
rC <- CC$par[(burnin/sample + 1):nrow(AA$par), "r"]
class(rC) <- "mcmc"
effectiveSize(rC)
##  var1 
## 76.72
HPDinterval(rC)
##       lower  upper
## var1 0.1638 0.8465
## attr(,"Probability")
## [1] 0.9502

In addition to this, we can also use Rthreshml in the Rphylip package to call THRESHML from the PHYLIP package internally. This is much faster & should produce similar etimates of the evolutionary covariances under most circumstances. It can also compute the covariances between multiple traits.

Since THRESHML is not in the current version of PHYLIP, it is first necessary to download & install Felsenstein's software. I would advise putting the executable file into the same folder as all your other PHYLIP executables, or (if you have not previously installed PHYLIP), put THRESHML into your current working directory.

Now let's try to run Rthreshml on our dataset from before. We'll just use the default conditions & compare our estimated covariances to those obtained by threshBayes.

require(Rphylip)
## Loading required package: Rphylip
fit <- Rthreshml(tree, X, types = c("cont", "disc"))
## 
## Threshold character Maximum Likelihood method version 3.7a
## 
##  In this output the first character is continuous,
##  and the next character is discrete.
## 
##  Covariance matrix of continuous character
##  and liability of the discrete character
##  (the continuous characters are first)
## 
##              1          2    
## 
##   1       0.18911    0.22566
##   2       0.22566    1.00000
## 
## 
##  Transform from independent variables (columns)
##  to liabilities or characters (rows)
## 
##              1          2    
## 
##   1       0.43487    0.00000
##   2       0.51891    0.85483
## 
## 
##    For            Variance of
##    variable       its change
##    --------       -----------
##        1            1.058568
##        2            0.130546
## 
##  Transform from liabilities or characters (columns)
##  to continuous variables or liabilities (rows)
## 
##              1          2    
## 
##   1      -0.25122   -0.96793
##   2      -0.96793    0.25122
fit
## $Covariance_matrix
##        [,1]   [,2]
## [1,] 0.1891 0.2257
## [2,] 0.2257 1.0000
## 
## $Transform_indepvar_liab
##        [,1]   [,2]
## [1,] 0.4349 0.0000
## [2,] 0.5189 0.8548
## 
## $Var_change
## [1] 1.0586 0.1305
## 
## $Transform_liab_cont
##         [,1]    [,2]
## [1,] -0.2512 -0.9679
## [2,] -0.9679  0.2512

Ancestral character estimation under the threshold model

We can also do ancestral character reconstruction for discrete characters using the threshold model.

## simulate tree & data
n <- 50
ngen <- 1e+05
sample <- 500
tree <- pbtree(n = n, scale = 10)
x <- fastBM(tree, sig2 = 1, a = 0.5, internal = TRUE)
th <- sapply(x, threshState, thresholds = setNames(c(0, 0.5, 2, Inf), letters[1:4]))
mcmc <- ancThresh(tree, th[1:n], ngen = ngen, sequence = letters[1:4], control = list(sample = sample,
    plot = FALSE))
## MCMC starting....
## gen 1000
## gen 2000
## ...
## gen 100000
plotTree(tree, ftype = "off")
tiplabels(pie = to.matrix(th[1:n], letters[1:4]), piecol = palette()[1:4], cex = 0.3)
nodelabels(pie = mcmc$ace, piecol = palette()[1:4], cex = 0.8)
nodelabels(pie = to.matrix(th[1:tree$Nnode + n], letters[1:4]), piecol = palette()[1:4],
    cex = 0.4)
plot of chunk unnamed-chunk-9

Theoretically, we can even use this method to identify the relative positions of thresholds:

colMeans(mcmc$par[(0.2 * ngen/sample):(ngen/sample) + 1, letters[1:4]])
##      a      b      c      d 
## 0.0000 0.7229 2.3610    Inf
par(mfrow = c(2, 1))
par(mar = c(4.1, 5.1, 2.1, 2.1))
plot(density(mcmc$par[(0.2 * ngen/sample):(ngen/sample) + 1, letters[2]], bw = 0.5),
    xlab = "threshold a->b", main = "")
lines(c(0.5, 0.5), c(0, 1000), lty = "dashed")
plot(density(mcmc$par[(0.2 * ngen/sample):(ngen/sample) + 1, letters[3]], bw = 0.5),
    xlab = "threshold b->c", main = "")
lines(c(2, 2), c(0, 1000), lty = "dashed")
plot of chunk unnamed-chunk-10

Save the workspace:

save.image(file = "thresh.Rdata")

That's it!

Written by Liam J. Revell. Last updated May 27, 2014