## Tuesday, November 29, 2016

### Site-specific selection and phylogenetic accuracy

In my , I argued that classical amino-acid replacement matrices do not faithfully describe the long-term evolutionary process at typical positions of protein-coding sequences. More specifically, those matrices do not correctly capture the long-run restrictions induced by puriyfing selection on the spectrum of amino-acids that are accepted at a typical site -- they anticipate too many amino-acids in the long run. In the words of Claus Wilke, empirical amino-acid replacement matrices are derived by pooling sites, but in the end, no specific, individual site behaves like a ‘pooled’ site.

Now, this has very important consequences on phylogenetic accuracy. To illustrate this point, here is a case presented by Andrew Roger in his phyloseminar, concerning the position of Microsporidia in the phylogeny of eukaryotes. This is based on a phylogenomic dataset of Brinkmann et al (2005), with 133 genes ($\sim$ 24000 aligned positions). I have reanalyzed this dataset, using two models, one that uses the same amino-acid replacement matrix for all sites (LG), and another one that models site-specific amino-acid equilibrium frequencies (CAT-GTR).

The tree reconstructed by the LG model gives Microsporidia sister group to all other eukaryotes:

In contrast, the tree obtained once you account for site-specific amino-acid preference in the long run (using CAT-GTR) shows Microsporidia sister-group to Fungi:

We used to believe that Microsporidia were ‘early-emerging’ eukaryotes, as suggested here by LG. The fact that they lack mitochondria was interpreted as a primitive trait: supposedly, the split between Microsporidia and the rest of eukaryotes would have pre-dated the endosymbiotic event leading to the acquisition of the mitochondrion by the so-called ‘crown’ eukaryotes. However, there are now quite a few lines of evidence suggesting that Microsporidia are in fact closely related to Fungi, as suggested here by CAT-GTR; also, their lack of mitochondria is probably due to a secondary loss (or, more accurately, a secondary simplification, since Microsporidia have hydrogenosome-like organelles that might well be the vestige of their mitochondria).

In fact, the tree obtained under the LG model does not just show Microsporida sister to all other eukaryotes. It also gives a somewhat unusual phylogeny for those other eukaryotes, with Fungi branching first. However, if we ignore the rooting, this tree differs from the tree obtained by CAT-GTR by just moving Archaea from their branching point between unikonts and bikonts to a position where they are sister group to Microsporidia. Thus, here, it is not so much that Microsporidia ‘move down’ in the tree because they are attracted by the archaean outgroup. Instead, it is the outgroup itself which ‘goes up’ in the tree, being attracted by Microsporidia. In any case, whichever way you decide to see it, this is basically a long branch attracted by another long branch -- thus, a paradigmatic case of a long-branch attraction artifact.

So, we have a clear example here, where accounting for site-specific amino-acid preference appears to result in a greater robustness against tree reconstruction errors. See also Lartillot et al (2007) for other examples, concerning the phylogenetic position of nematodes and platyhelminths.

Why is it that models accounting for site-specific amino-acid restrictions are more accurate? The main reason is relatively simple and can be summarized as follows.

Convergent substitutions (or, equivalently, reverting substitutions) represent the primary source of tree reconstruction errors. Therefore, in order to correctly discriminate between shared-derived characters (signal) and convergent or reverting substitutions (noise), it is particularly important to correctly model all of the factors that could cause a high rate of convergent substitution in real data.

Now, the probability of convergent evolution toward the same amino-acid at the same site, in two unrelated species separated by a large evolutionary distance, is roughly the inverse of the number of possible amino-acids at that site. Thus, a small number of acceptable amino-acids per site mechanically implies a high probability of convergent evolution.

And thus, one can easily imagine that a model not correctly accounting for site-specific amino-acid restrictions, because it essentially assumes that all amino-acids are accepted at all sites in the long run, will automatically underestimate the probability of convergent evolution. Therefore, by mistaking noise for signal, it will tend to produce artifacts.

In contrast, a model explicitly accounting for site-specific amino-acid restrictions is in a better position to correctly estimate the probability of convergent substitution, and as a result, is better calibrated in terms of the signal/noise ratio. This is what make such site-specific models more robust against tree reconstruction artifacts.

All of what I have seen thus far suggests that this effect due to site-specific amino-acid restrictions is quite strong and that it can have a significant impact on phylogenetic accuracy over deep evolutionary times. In a sense, this is not so surprising: the number of amino-acids typically accepted at a given site is on average of the order of 4 or 5, out of 20 amino-acids -- which means that there is room here for an underestimation of the probability of convergent evolution (and thus, an underestimation of the prevalence of noise) by a factor of 4. I will come back to this point in later posts, in order to make it more precise and more quantitative.

In any case, this connection between site-specific amino-acid restrictions and the rate of convergent evolution, combined with what we have seen earlier concerning the problems inherent to the idea of imposing a single amino-acid replacement matrix to a heterogeneous collection of amino-acid positions, each with its own selective requirements, captures one of the most fundamental limitations of the simple amino-acid replacement models that are still currently used in phylogenomics.

--

Henner Brinkmann, Mark van der Giezen, Yan Zhou, Gaëtan Poncelin de Raucourt, and Hervé Philippe (2006). An empirical assessment of long-branch attraction artefacts in deep eukaryotic phylogenomics., 54(5), 743–757. http://doi.org/10.1080/10635150500234609

Lartillot, N., Brinkmann, H., & Philippe, H. (2007). Suppression of long-branch attraction artifacts in the animal phylogeny using a site-heterogeneous model. BMC Evolutionary Biology, 7 Suppl 1, S4. http://doi.org/10.1186/1471-2148-7-S1-S4