Solution to Challenge Problem 1: Phylogenetically Independent Contrasts

This first challenge problem had two parts.

This first part asked you to download the following two files (pic-exercise-data.csv and pic-exercise-tree.tre) then fit an OLS (non-phylogenetic) regression model for y~x and a phylogenetic regression model using contrasts, as well as to explain what you found.

The second (optional) part asked you to (1) show that OLS for data simulated on the tree results in elevated type I error; and (2) to show that phylogenetic independent contrasts results in the correct type I error.

Part 1

First, load packages:

## load phytools
library(phytools)

## Loading required package: ape

## Loading required package: maps

Read the data and tree from file:

## load data
X<-read.csv("pic-exercise-data.csv",row.names=1)
## load tree
tree<-read.tree("pic-exercise-tree.tre")
tree

## 
## Phylogenetic tree with 100 tips and 99 internal nodes.
## 
## Tip labels:
##  t64, t4, t14, t18, t92, t35, ...
## 
## Rooted; includes branch lengths.

This is not strictly necessary, but it will make things easier if we pull out the columns of X as separate vectors:

x<-X[,1]
names(x)<-rownames(X)
## or
y<-setNames(X[,2],rownames(X))

Now, let's plot x & y:

plot(x,y,pch=21,bg="grey",cex=1.4)

plot of chunk unnamed-chunk-4

Looks like there is no relationship, right?

fit<-lm(y~x)
plot(x,y)
abline(fit)

plot of chunk unnamed-chunk-5

fit

## 
## Call:
## lm(formula = y ~ x)
## 
## Coefficients:
## (Intercept)            x  
##    -1.95546     -0.02282

summary(fit)

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.70042 -0.18339  0.04348  0.41484  0.95401 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.95546    0.12017 -16.272   <2e-16 ***
## x           -0.02282    0.06582  -0.347     0.73    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6289 on 98 degrees of freedom
## Multiple R-squared:  0.001225,   Adjusted R-squared:  -0.008967 
## F-statistic: 0.1202 on 1 and 98 DF,  p-value: 0.7296

anova(fit)

## Analysis of Variance Table
## 
## Response: y
##           Df Sum Sq Mean Sq F value Pr(>F)
## x          1  0.048 0.04753  0.1202 0.7296
## Residuals 98 38.763 0.39554

However, we cannot forget that our data are related by a phylogenetic tree. Let's account for that covariance among the data in x & y using the contrasts method:

## compute contrasts
pic.x<-pic(x,tree)
pic.y<-pic(y,tree)
## plot contrasts
plot(pic.x,pic.y,pch=21,bg="grey",cex=1.4)
## fit model without intercept
fit<-lm(pic.y~pic.x+0)
abline(fit)

plot of chunk unnamed-chunk-6

summary(fit)

## 
## Call:
## lm(formula = pic.y ~ pic.x + 0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.5957 -0.7257  0.2020  0.7100  2.3120 
## 
## Coefficients:
##       Estimate Std. Error t value Pr(>|t|)    
## pic.x  0.70387    0.09648   7.296 7.81e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9944 on 98 degrees of freedom
## Multiple R-squared:  0.352,  Adjusted R-squared:  0.3454 
## F-statistic: 53.23 on 1 and 98 DF,  p-value: 7.808e-11

How can we account for this? Well, if we project our tree into the phenotype space, we will see that within each clade there is a relationship, but our ability to measure this relationship breaks down due to differences between clades:

phylomorphospace(tree,cbind(x,y),label="off",node.size=c(0,0))
points(cbind(x,y),pch=21,cex=1.4,bg="grey")

plot of chunk unnamed-chunk-7

Part 2 (optional)

This part of the challenge was a little bit trickier.

The first task is to demonstrate that OLS has type I error above the nominal level (say, 0.05) when the data arose on a phylogeny.

The way I will do this is first generate 1000 random, “pure-birth” phylogenetic trees. Next, I will simulate the uncorrelated Brownian evolution of two traits (x and y) on each tree. Finally, I will fit a standard OLS bivariate regression model to each dataset. I will estimate the type I error rate as the fraction of fitted models that are significant at the α = 0.05 level.

Let's simulate 200 pure-birth trees using pbtree, along with datasets for x & y generated with no evolutionary correlation between characters.

N<-1000 # number of simulations
trees<-pbtree(n=100,scale=1,nsim=N)
x<-lapply(trees,fastBM)
y<-lapply(trees,fastBM)

I used an lapply call, but I could've equally well used a for loop, e.g.:

x<-list()
for(i in 1:length(trees)) x[[i]]<-fastBM(trees[[i]])

Next, we fit each model:

fits.ols<-list()
for(i in 1:length(trees)) fits.ols[[i]]<-lm(y[[i]]~x[[i]])

Finally, let's count the number that are significant at the α = 0.05 level:

p.ols<-sapply(fits.ols,function(x) anova(x)$"Pr(>F)"[1])
## this should be uniform
hist(p.ols,breaks=seq(0,1,by=0.05),col="grey")

plot of chunk unnamed-chunk-11

## this is our realized type I error rate
mean(p.ols<=0.05)

## [1] 0.457

From this we should see that type I error can be substantially increased if the phylogeny is not taken into account.

Now, I'll show that our type I errors go to the nominal level (α = 0.05) when we first compute phylogenetically independent contrasts following Felsenstein (1985), and then fit our regression model without an intercept term to the contrasts rather than to the original data. To do this, I will use the trees generated in the previous exercise. Conveniently, I also kept our simulated data from last time, so I can re-use that here for a more direct comparison!

First, here is an updated version of the custom function for type I error rate estimation, but this one uses the contrasts transformation before analysis, and also allows us to supply the data for x & y that we simulated before.

fits.pic<-list()
for(i in 1:length(trees)){
    pic.x<-pic(x[[i]],trees[[i]])
    pic.y<-pic(y[[i]],trees[[i]])
    fits.pic[[i]]<-lm(pic.y~pic.x+0)
}
p.pic<-sapply(fits.pic,function(x) anova(x)$"Pr(>F)"[1])
hist(p.pic,breaks=seq(0,1,by=0.1),col="grey")

plot of chunk unnamed-chunk-12

mean(p.pic<=0.05)

## [1] 0.059

We see the following in this exercise: (1) PIC gives a uniform distribution for our p-values under the null; and (2) PIC restores the type I error to at or about the nominal level (0.05).

Written by Liam J. Revell. Last updated 6 December 2016.