This first challenge problem had two parts.
This first part asked you to download the following two files
(pic-exercise-data.csv
and pic-exercise-tree.tre)
then fit an OLS (non-phylogenetic) regression model for y~x and
a phylogenetic regression model using contrasts, as well as to explain what
you found.
The second (optional) part asked you to (1) show that OLS for data simulated on the tree results in elevated type I error; and (2) to show that phylogenetic independent contrasts results in the correct type I error.
First, load packages:
## load phytools
library(phytools)
## Loading required package: ape
## Loading required package: maps
Read the data and tree from file:
## load data
X<-read.csv("pic-exercise-data.csv",row.names=1)
## load tree
tree<-read.tree("pic-exercise-tree.tre")
tree
##
## Phylogenetic tree with 100 tips and 99 internal nodes.
##
## Tip labels:
## t64, t4, t14, t18, t92, t35, ...
##
## Rooted; includes branch lengths.
This is not strictly necessary, but it will make things easier if we pull
out the columns of X as separate vectors:
x<-X[,1]
names(x)<-rownames(X)
## or
y<-setNames(X[,2],rownames(X))
Now, let's plot x & y:
plot(x,y,pch=21,bg="grey",cex=1.4)
Looks like there is no relationship, right?
fit<-lm(y~x)
plot(x,y)
abline(fit)
fit
##
## Call:
## lm(formula = y ~ x)
##
## Coefficients:
## (Intercept) x
## -1.95546 -0.02282
summary(fit)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.70042 -0.18339 0.04348 0.41484 0.95401
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.95546 0.12017 -16.272 <2e-16 ***
## x -0.02282 0.06582 -0.347 0.73
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6289 on 98 degrees of freedom
## Multiple R-squared: 0.001225, Adjusted R-squared: -0.008967
## F-statistic: 0.1202 on 1 and 98 DF, p-value: 0.7296
anova(fit)
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## x 1 0.048 0.04753 0.1202 0.7296
## Residuals 98 38.763 0.39554
However, we cannot forget that our data are related by a phylogenetic tree.
Let's account for that covariance among the data in x &
y using the contrasts method:
## compute contrasts
pic.x<-pic(x,tree)
pic.y<-pic(y,tree)
## plot contrasts
plot(pic.x,pic.y,pch=21,bg="grey",cex=1.4)
## fit model without intercept
fit<-lm(pic.y~pic.x+0)
abline(fit)
summary(fit)
##
## Call:
## lm(formula = pic.y ~ pic.x + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.5957 -0.7257 0.2020 0.7100 2.3120
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## pic.x 0.70387 0.09648 7.296 7.81e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9944 on 98 degrees of freedom
## Multiple R-squared: 0.352, Adjusted R-squared: 0.3454
## F-statistic: 53.23 on 1 and 98 DF, p-value: 7.808e-11
How can we account for this? Well, if we project our tree into the phenotype space, we will see that within each clade there is a relationship, but our ability to measure this relationship breaks down due to differences between clades:
phylomorphospace(tree,cbind(x,y),label="off",node.size=c(0,0))
points(cbind(x,y),pch=21,cex=1.4,bg="grey")
This part of the challenge was a little bit trickier.
The first task is to demonstrate that OLS has type I error above the nominal level (say, 0.05) when the data arose on a phylogeny.
The way I will do this is first generate 1000 random, “pure-birth” phylogenetic trees. Next, I will simulate the uncorrelated Brownian evolution of two traits (x and y) on each tree. Finally, I will fit a standard OLS bivariate regression model to each dataset. I will estimate the type I error rate as the fraction of fitted models that are significant at the α = 0.05 level.
Let's simulate 200 pure-birth trees using pbtree, along with datasets for
x & y generated with no evolutionary correlation between
characters.
N<-1000 # number of simulations
trees<-pbtree(n=100,scale=1,nsim=N)
x<-lapply(trees,fastBM)
y<-lapply(trees,fastBM)
I used an lapply call, but I could've equally well used a for
loop, e.g.:
x<-list()
for(i in 1:length(trees)) x[[i]]<-fastBM(trees[[i]])
Next, we fit each model:
fits.ols<-list()
for(i in 1:length(trees)) fits.ols[[i]]<-lm(y[[i]]~x[[i]])
Finally, let's count the number that are significant at the α = 0.05 level:
p.ols<-sapply(fits.ols,function(x) anova(x)$"Pr(>F)"[1])
## this should be uniform
hist(p.ols,breaks=seq(0,1,by=0.05),col="grey")
## this is our realized type I error rate
mean(p.ols<=0.05)
## [1] 0.457
From this we should see that type I error can be substantially increased if the phylogeny is not taken into account.
Now, I'll show that our type I errors go to the nominal level (α = 0.05) when we first compute phylogenetically independent contrasts following Felsenstein (1985), and then fit our regression model without an intercept term to the contrasts rather than to the original data. To do this, I will use the trees generated in the previous exercise. Conveniently, I also kept our simulated data from last time, so I can re-use that here for a more direct comparison!
First, here is an updated version of the custom function for type I error rate
estimation, but this one uses the contrasts transformation before analysis, and
also allows us to supply the data for x & y that we
simulated before.
fits.pic<-list()
for(i in 1:length(trees)){
pic.x<-pic(x[[i]],trees[[i]])
pic.y<-pic(y[[i]],trees[[i]])
fits.pic[[i]]<-lm(pic.y~pic.x+0)
}
p.pic<-sapply(fits.pic,function(x) anova(x)$"Pr(>F)"[1])
hist(p.pic,breaks=seq(0,1,by=0.1),col="grey")
mean(p.pic<=0.05)
## [1] 0.059
We see the following in this exercise: (1) PIC gives a uniform distribution for our p-values under the null; and (2) PIC restores the type I error to at or about the nominal level (0.05).
Written by Liam J. Revell. Last updated 6 December 2016.