Tuesday, November 22, 2016

Double standard

Nobody today would publish a phylogeny that has been inferred using a model that does not account for rate variation across sites, at least not a phylogeny over a large evolutionary scale (e.g. animals, eukaryotes, etc).

And indeed, using such rates-across-sites models, we usually do uncover substantial rate variation in empirical sequence alignments: the shape parameter of the gamma distribution (alpha) is typically less than 1, of the order of 0.7. Below is a plot of the density of the gamma for alpha=0.7. The 0.25 and 0.75 quantiles are 0.18 and 1.38 respectively. Thus, the 25% fastest evolving sites are evolving about 10 times faster than the 25% slowest sites.

Also, it is clear that much of this variation occurs within genes: in about any gene, you expect to find both slow and fast sites (you just need to fit the gamma model on each gene separately and observe that the estimated value for the alpha parameter is, again, typically less than 1).

Now, why is there so much variation across sites in the overall rate of substitution? Mostly because of selection. There is some contribution from varying mutation rates. However, this is probably minor. For the most part, what explains rate variation across sites is simply that some sites are highly conserved (strong purifying selection), other sites are less conserved, and a minor fraction is under positive selection.

But then, if the selective pressure is so different at different sites, in terms of its overall strength, don’t we expect that it should also be different in terms of which types amino-acids are preferred at each site?

What I am suggesting here is that using a model that allows for rate variation across sites, but imposes the same amino-acid replacement matrix across all sites of a given gene — or, worse, across an entire multi-gene alignment — amounts to applying a double standard with respect to the consequences of varying selection along the sequence.

And yet this is what many current phylogenetic reconstruction models still do.


  1. I think the reason this "double standard" of sorts exists has to do with computational feasibility and how well the literature has shown that incorporating rate heterogeneity in analyses (i.e. model + gamma) results in more accurate trees. My concerns, and I suspect the concerns of many, are that current implementations of site-heterogeneous models are often not computationally feasible and current implementations have not really been shown to be produce exceptionally accurate trees under a variety of situations.

    For example, both CAT models and other models have recovered ctenophores as the sister lineage to all other animals on a variety of published datasets. I think there are inherent problems associated with holding up the instances where CAT models recovered sponges as the sister lineage to all other animals and saying that is evidence that both CAT models are accurate and other instances fell prey to LBA in their placement of ctenophores. Such an argument entirely rests on an assumption about animal phylogeny, which very well may not be accurate. The argument also ignores many analyses with CAT models that disagree with the sponges-sister hypothesis.

  2. Nicolas, I absolutely agree. Your point has been clearly correct for at least the last 20 years. But I also think there is a completely unjustifiable double standard in relation to rate and amino acid preference across sites versus over time. Phylogenetic applications are all about predicting rates and patterns of homoplasy, and current models are pathetic at it.