Thursday, December 15, 2016

Site-specific selection and evolutionary distances

Coming back the Microsporidia example, the LG model does not just give an artifactual position for Microsporidia. It also gives an estimate for the total tree length of about 10.3 amino-acid changes per site. On the same dataset, CAT-GTR, which accounts for site-specific amino-acid preferences, infers a total tree length of 18.6. 

And this makes total sense. In fact, the problem was very clearly identified more than 18 years ago, by Halpern and Bruno (1998, from the abstract):

Estimation of evolutionary distances from coding sequences must take into account protein-level selection to avoid relative underestimation of longer evolutionary distances. Current modeling of selection via site-to-site rate heterogeneity generally neglects another aspect of selection, namely position-specific amino acid frequencies. These frequencies determine the maximum dissimilarity expected for highly diverged but functionally and structurally conserved sequences, and hence are crucial for estimating long distances. 
Thus, nothing new — although I find it instructive to measure this effect more quantitatively. Here, if CAT-GTR is correct, then this means that, by ignoring site-specific amino-acid preferences, we are missing more than one third of all amino-acid substitutions. The same pattern is obtained (also with a difference of about 30 to 40%) on recent phylogenomic datasets at the scale of metazoans.

Incidentally, it is also interesting to note that, in spite of the importance of the problem raised by Halpern and Bruno, the ’current models’ of 1998 are still more or less the default option in many phylogenetic reconstruction programs.


Halpern, A. L., & Bruno, W. J. (1998). Evolutionary distances for protein-coding sequences: modeling site-specific residue frequencies. Molecular Biology and Evolution, 15(7), 910–917.

Tuesday, December 13, 2016

Second-order amino-acid replacement processes

As I mentioned earlier, classical amino-acid replacement matrices indirectly encode site-specific amino-acid preferences in their first-order Markov dependencies, and this is optimal in a low saturation regime. Now, this suggests that we could derive second-order or k-th order Markov processes that would generalize this idea and capture more information about site-specific selection, in particular about the longer-term behavior. How would we do that?

Let us consider the second-order case. Assume that, at time $t$, the current amino-acid at a given site is $b$, and the previous amino-acid state at that site, before the last substitution event, was $a$. We can then characterize the current state of the process by the $(a,b)$ pair of amino-acids.

Then, we can define a Markov process directly on the pairs $(a,b)$, with $a \neq b$, and such that transitions are allowed only between compatible pairs — i.e. such that the current state before the event becomes the previous state after the event. Thus, for instance:

$Q_{(a,b) \to (b,c)}$

is the rate from $(a,b)$ to $(b,c)$, i.e. the rate of substitution from $b$ to $c$, given that the previous amino-acid state (before $b$) was $a$.

All other rates are set equal to 0, i.e.

$Q_{(a,b) \to (c,d)} = 0$

whenever $b \neq c$.

This defines a second order 380x380 amino-acid replacement matrix, which can be exponentiated and then used for likelihood computation. Note that the instant rate matrix Q defined above is sparse, however, this will not be the case for the exponentiated matrix $P = e^{tQ}$.

Concerning the pruning algorithm, the only modification, compared to the first-order version classically implemented, concerns the initialization of the recursion at the tips. If the observed state for a given taxon is b, thern the conditional likelihood should be set equal to 1 for all pairs $(a,b)$.

We can see that, compared to first-order matrices, this second-order model will be in a better position to implement the empirically observed tendency to make repeated substitution events among overlapping subsets of amino-acids — basically, by estimating high rates of transition of the following type:

$Q_{(E,D) \to (D,E)}$
$Q_{(N,D) \to (D,N)}$

The approach can in principle be generalized to k-th order processes, by defining transition rates between $(x_1,x_2,…x_k)$ and $(x_2,x_3…x_{k+1})$.

Computationally speaking, this will not be super-efficient. The pruning algorithm is quadratic in the number of states, and thus, even for the second-order case, we already end up with a nearly 400-fold increase in computational time.

Still, I could imagine that, compared to simple 20x20 amino-acid replacement matrices, the second-order version could already make quite a difference in terms of phylogenetic accuracy. One advantage of this approach is that it is really a classical parametric likelihood model, without any complicated issue about modeling random-effects across sites, from an unknown, and potentially complex, distribution.

Conceptually speaking, this model also illustrates a fundamental idea about the relation between processes and parameters. Essentially, the parameterization of a model does not necessarily correspond to the actual mechanism. Instead, it already incorporates a level of statistical thinking, about the identifiable signal induced by the process over a large sample of observations. Usually, this signal can be decomposed in terms of frequencies over successive moments — and those frequencies are then captured by a parametric model, in terms of either proportions or rates.

Wednesday, December 7, 2016

The effective number of amino-acids per site

As we have seen, a fundamental consequence of site-specific selective constraints is that the number of amino-acids allowed at each site is typically much smaller than 20. This in turn results in a purely combinatorial effect: few amino-acids per site means that convergent substitution events are more frequent than what is anticipated by a model assuming that all 20 amino-acids are equally acceptable at each site — with potentially important consequences on phylogenetic accuracy.

Here is a more quantitative version of this argument.

First, imagine a position that can accept only $K$ amino-acids, but accepts each of them with equal frequency $1/K$. Then, if this position is sufficiently fast evolving, the probability of observing the same amino-acid at that position in two distantly related species (i.e. the long-term probability of convergent evolution) is just $p_c = 1/K$. Or, equivalently, $K = p_c^{-1}$.

Now, consider an arbitrary amino-acid replacement process, with equilibrium frequency vector $(\pi_k)_{k=1..20}$ over the 20 amino-acids. The probability of sampling twice the same amino-acid by chance from this equilibrium frequency vector is:

$p_c \, = \, \sum_{k=1}^{20} \, \pi_k^2$.

By analogy with the simple case considered above, we can define the effective number of amino-acids (or the effective size of the state space) implied by this amino-acid replacement process as:

$K_e \, = \, p_c^{-1} = \, \left( \sum_{k=1}^{20} \, \pi_k^2 \right) ^{-1}$

Note that this definition of $K_e$ is different from the one often used (e.g. Pollock et al, 2006, Echave and Wilke, 2016), which is instead related to Shannon entropy. The present definition is, I think, more adequate in the present context, as it more directly captures the fundamental relation between amino-acid restrictions and probability of convergent evolution ($p_c = 1/K_e$). The two definitions give qualitatively similar results in practice.

We can calculate the effective number of amino-acids at each site, based on the site-specific amino-acid equilibrium frequency profiles estimated by the CAT-GTR model. Here is the distribution of $K_e$ across the entire alignment in the case of Microsporidia:

The average across sites is $<K_e> =7.22$, and the first 25% quantile is 4.1. In other words, according to CAT-GTR, 25% of the sites effectively visit less than 4 distinct amino-acids in the long-run. Importantly, the mean substitution rate for those sites is such that they make on average 16.5 substitution events across the whole tree, and 25% of them make more than 20 substitutions repeatedly over at most 4 amino-acids. Thus, sites with a small $K_e$ are not particularly slowly evolving, and many of them are even highly saturated.

In contrast, the bar shown in red on this plot represents the effective number of amino-acids implied by the use of single amino-acid replacement matrix for all sites (i.e. calculated on the empirical frequencies over the whole alignment). This effective number is equal to 16.1, which is more than twice as much as the mean across sites estimated by CAT-GTR, and 4 times as large as the effective number for the 25% most restricted positions of the alignment. 

Also, the tail-area under the blue curve, to the right of this red bar, is only about 0.2% — which clearly illustrates that the amino-acid replacement process estimated by pooling sites is definitely not typical of what happens at most sites.

Coming back to the question of phylogenetic accuracy, all this means that using a single amino-acid replacement matrix for all sites leads to underestimating the long-run probability of convergent substitution for 99.8% of all sites, by a factor 2 on average, and by a factor at least 4 for 25% of the positions — clearly not a well-calibrated model, in terms of its estimation of the phylogenetic noise.


Pollock, D. D., Thiltgen, G., & Goldstein, R. A. (2012). Amino acid coevolution induces an evolutionary Stokes shift. Proceedings of the National Academy of Sciences of the United States of America, 109(21), E1352–9.

Echave, J. and Wilke, C. O. (2016). Biophysical models of protein evolution: understanding the patterns of evolutionary sequence divergence.

Tuesday, November 29, 2016

Site-specific selection and phylogenetic accuracy

In my last post, I argued that classical amino-acid replacement matrices do not faithfully describe the long-term evolutionary process at typical positions of protein-coding sequences. More specifically, those matrices do not correctly capture the long-run restrictions induced by puriyfing selection on the spectrum of amino-acids that are accepted at a typical site -- they anticipate too many amino-acids in the long run. In the words of Claus Wilke, empirical amino-acid replacement matrices are derived by pooling sites, but in the end, no specific, individual site behaves like a ‘pooled’ site.

Now, this has very important consequences on phylogenetic accuracy. To illustrate this point, here is a case presented by Andrew Roger in his phyloseminar, concerning the position of Microsporidia in the phylogeny of eukaryotes. This is based on a phylogenomic dataset of Brinkmann et al (2005), with 133 genes ($\sim$ 24000 aligned positions). I have reanalyzed this dataset, using two models, one that uses the same amino-acid replacement matrix for all sites (LG), and another one that models site-specific amino-acid equilibrium frequencies (CAT-GTR).

The tree reconstructed by the LG model gives Microsporidia sister group to all other eukaryotes:

In contrast, the tree obtained once you account for site-specific amino-acid preference in the long run (using CAT-GTR) shows Microsporidia sister-group to Fungi:

We used to believe that Microsporidia were ‘early-emerging’ eukaryotes, as suggested here by LG. The fact that they lack mitochondria was interpreted as a primitive trait: supposedly, the split between Microsporidia and the rest of eukaryotes would have pre-dated the endosymbiotic event leading to the acquisition of the mitochondrion by the so-called ‘crown’ eukaryotes. However, there are now quite a few lines of evidence suggesting that Microsporidia are in fact closely related to Fungi, as suggested here by CAT-GTR; also, their lack of mitochondria is probably due to a secondary loss (or, more accurately, a secondary simplification, since Microsporidia have hydrogenosome-like organelles that might well be the vestige of their mitochondria).

In fact, the tree obtained under the LG model does not just show Microsporida sister to all other eukaryotes. It also gives a somewhat unusual phylogeny for those other eukaryotes, with Fungi branching first. However, if we ignore the rooting, this tree differs from the tree obtained by CAT-GTR by just moving Archaea from their branching point between unikonts and bikonts to a position where they are sister group to Microsporidia. Thus, here, it is not so much that Microsporidia ‘move down’ in the tree because they are attracted by the archaean outgroup. Instead, it is the outgroup itself which ‘goes up’ in the tree, being attracted by Microsporidia. In any case, whichever way you decide to see it, this is basically a long branch attracted by another long branch -- thus, a paradigmatic case of a long-branch attraction artifact.

So, we have a clear example here, where accounting for site-specific amino-acid preference appears to result in a greater robustness against tree reconstruction errors. See also Lartillot et al (2007) for other examples, concerning the phylogenetic position of nematodes and platyhelminths.

Why is it that models accounting for site-specific amino-acid restrictions are more accurate? The main reason is relatively simple and can be summarized as follows. 

Convergent substitutions (or, equivalently, reverting substitutions) represent the primary source of tree reconstruction errors. Therefore, in order to correctly discriminate between shared-derived characters (signal) and convergent or reverting substitutions (noise), it is particularly important to correctly model all of the factors that could cause a high rate of convergent substitution in real data.

Now, the probability of convergent evolution toward the same amino-acid at the same site, in two unrelated species separated by a large evolutionary distance, is roughly the inverse of the number of possible amino-acids at that site. Thus, a small number of acceptable amino-acids per site mechanically implies a high probability of convergent evolution.

And thus, one can easily imagine that a model not correctly accounting for site-specific amino-acid restrictions, because it essentially assumes that all amino-acids are accepted at all sites in the long run, will automatically underestimate the probability of convergent evolution. Therefore, by mistaking noise for signal, it will tend to produce artifacts.

In contrast, a model explicitly accounting for site-specific amino-acid restrictions is in a better position to correctly estimate the probability of convergent substitution, and as a result, is better calibrated in terms of the signal/noise ratio. This is what make such site-specific models more robust against tree reconstruction artifacts.

All of what I have seen thus far suggests that this effect due to site-specific amino-acid restrictions is quite strong and that it can have a significant impact on phylogenetic accuracy over deep evolutionary times. In a sense, this is not so surprising: the number of amino-acids typically accepted at a given site is on average of the order of 4 or 5, out of 20 amino-acids -- which means that there is room here for an underestimation of the probability of convergent evolution (and thus, an underestimation of the prevalence of noise) by a factor of 4. I will come back to this point in later posts, in order to make it more precise and more quantitative.

In any case, this connection between site-specific amino-acid restrictions and the rate of convergent evolution, combined with what we have seen earlier concerning the problems inherent to the idea of imposing a single amino-acid replacement matrix to a heterogeneous collection of amino-acid positions, each with its own selective requirements, captures one of the most fundamental limitations of the simple amino-acid replacement models that are still currently used in phylogenomics.


Henner Brinkmann, Mark van der Giezen, Yan Zhou, Gaëtan Poncelin de Raucourt, and Hervé Philippe (2006). An empirical assessment of long-branch attraction artefacts in deep eukaryotic phylogenomics., 54(5), 743–757.

Lartillot, N., Brinkmann, H., & Philippe, H. (2007). Suppression of long-branch attraction artifacts in the animal phylogeny using a site-heterogeneous model. BMC Evolutionary Biology, 7 Suppl 1, S4.

Friday, November 25, 2016

Double standard (2)

Just a side remark: in my last post, I pointed out that classical amino-acid replacement matrices implicitly encode site-specific selection in the first-order Markov dependencies of the process, and that doing so is optimal in situations of low sequence divergence (more specifically, when the fractions of sites that have more than one substitution event over the whole tree is negligible).

But then, in this situation, there is also no information that can be gained about rate variation across sites. In this situation, sites have either 0 or 1 substitution event, and the only meaningful statistic that we have is the fraction of sites with one substitution — which gives us an estimate of the mean rate of substitution across sites, but otherwise, does not tell us anything about the variance (the shape parameter alpha).

An analogy can be made here with an imaginary coin-tossing experiment, in which you would make each toss with a different coin, all of which have potentially different biases (different probabilities of giving a head). Thus, there is an unknown distribution, over coins, of the probability of getting head.

In this situation, with only one draw per coin, your fraction of heads is equal to the average probability of getting head over all coins. However, this is the only information that you can get from this experiment. You need to make at least 2 tosses per coin in order to have information about the variance of the head probability over coins. An increasing number of tosses per coin give you information about an increasing number of the moments of the distribution.

In any case, coming back to sequence evolution, the main point I wanted to make is this: empirical amino-acid replacement matrices are the perfect solution in those situations where you also cannot meaningfully estimate the variance in rates across sites. Conversely, as soon as you have sufficient sequence divergence, giving you empirical information about rate-heterogeneity, then you also have useful information concerning pattern-heterogeneity — which will not be captured by a single amino-acid replacement matrix.

On a more general note, making the parallel between rate- and pattern-heterogeneity is, I think, insightful.

Also, to address one of the comments: of course, there are computational issues behind our potential methodological inconsistencies. But this does not prevent us from pointing them out and raising a little sense of discomfort about what we easily take for granted in our daily routine. That's the only way we can make any progress.

Wednesday, November 23, 2016

Long-term site-specific selective constraints

There is a catch in my argument about the double standards of some of the current phylogenetic models, concerning rate- versus pattern-heterogeneity across sites. It is unfair to suggest that a model does not capture the consequences of varying selection along the sequence just because it uses the same amino-acid replacement matrix across all sites. In fact, empirical amino-acid replacement matrices do capture the effects of site-specific selection, but it is just that they do it implicitly and globally -- simply by proposing higher exchange rates among biochemically similar amino-acids.

If we look for instance at the WAG matrix:

we see that this matrix gives particularly high exchange rates between I and V and E and D. The way those exchange rates capture site-specific selection is implicit, the logic being as follows: given that a site is currently is state D, then, probably, this site is under (site-specific) selection for a negatively charged amino-acid, and therefore, we expect that the next substitution at that site will be an E with high probability.

So, classical amino-acid replacement matrices do model site-specific selection. Technically, they encode it in the first-order Markov dependencies of the amino-acid replacement process. And in fact, this approach is optimal in non-saturated situations, where each site of the alignment typically has 0 or 1 substitution event over the whole tree, but rarely more than that. In this situation, there is no hope to get more information than what is already captured in this first-order Markov structure.

However, things are quite different in a deep evolutionary context, where each site makes potentially many substitution events. Here is an example (taken from an empirical alignment at the scale of metazoans):

We see that this site is apparently under strong selective constraint to accept negatively charged amino-acids, but otherwise, can make repeated substitutions between D and E (and sometimes visit other amino-acids). In other words, this site (and many other sites) tends to be confined, in the long-run, within a very restricted subset of all possible amino-acids.

Are such site patterns easily predicted (and reproduced) by classical empirical matrices? To check this, we could for instance simulate repeated substitution histories for a given site that would start on an aspartic acid (D), and then tabulate the frequencies at which the 20 possible amino-acids are observed after 1, 2, ... k substitutions. Here is what you get with the WAG model for a site starting on an aspartic acid (D) or on a lysine (K), and over the first 5 substitution events:

As you can see, the substitution process implied by the WAG matrix does not maintain itself for a very long time within the restricted orbit of the two negatively charged amino-acids D and E (same thing for the positively charged amino-acids K and R). Instead, a typical site very quickly looses memory of the selective constraint it is supposed to stay under in the long run, such that after as few as 4 substitution events, it ends up sampling amino-acids according to proteome-wide average frequencies.

Why doesn't that work? There are at least two reasons. The first is that, as you can check from the matrix of relative exchange rates shown above, there is a high exchange rate, not just between D and E, but also between D and N. Then, from N, there is a high rate to S, etc. So, basically, what happens is that, in reality, you have some sites that make repeated substitutions between D and E, other sites between D and N, and still many other sites visiting other different but overlapping amino-acid orbits. In terms of selection pressure, this betrays the fact that the same amino-acid can be selected at different sites for different reasons: D and E are both negatively charged, but on the other hand, D and N have similar geometries for their side chain.

In this situation, when you estimate an average amino-acid replacement matrix simultaneously over all of those sites, you end up with significant exchange rates between D and a broad set of alternative amino-acids: E, N, but also G and H -- which then makes it very difficult for this matrix to explain the long-term confinement that can be seen at typical sites under long-term purifying selection.

The second reason is more conceptual and connected to the following question: what should be encoded at the level of the transient regime of the process, versus what should be captured by its stationary regime?

The site shown above suggests that the selective constraint experienced by a typical site under strong purifying selection is stable in the long run. This may not always be true, but this is generally the case. This might in fact be even more pronounced in the context of phylogenetic analyses, where there is a selection bias in favor of proteins that have a highly conserved structure (this is what makes the alignment possible at such a large evolutionary scale in the first place). And thus, sites of those proteins tend to be in a conserved biochemical environment over very long evolutionary times.

If selection is stable in the long run, then it would perhaps make more sense to encode it in the stationary regime of the substitution process at a given site. Instead, the relative exchange rates of an amino-acid replacement matrix encode the transient regime of the substitution process, whereas the stationary regime is entirely determined by the equilibrium frequencies, which are basically equal to the observed amino-acid frequencies over the entire alignment. And what the figure above indicates is that the transient regime encoded by a typical amino-acid replacement matrix is, indeed, very transient.

This argument suggests that site-specific amino-acid preferences should perhaps be encoded at the level of the equilibrium frequency vector over the 20 amino-acids. But then this means that the equiibrium frequencies of the substitution process should be site-speciifc. This idea was first proposed, and explored, by Halpern and Bruno, back in 1998.

Tuesday, November 22, 2016

Double standard

Nobody today would publish a phylogeny that has been inferred using a model that does not account for rate variation across sites, at least not a phylogeny over a large evolutionary scale (e.g. animals, eukaryotes, etc).

And indeed, using such rates-across-sites models, we usually do uncover substantial rate variation in empirical sequence alignments: the shape parameter of the gamma distribution (alpha) is typically less than 1, of the order of 0.7. Below is a plot of the density of the gamma for alpha=0.7. The 0.25 and 0.75 quantiles are 0.18 and 1.38 respectively. Thus, the 25% fastest evolving sites are evolving about 10 times faster than the 25% slowest sites.

Also, it is clear that much of this variation occurs within genes: in about any gene, you expect to find both slow and fast sites (you just need to fit the gamma model on each gene separately and observe that the estimated value for the alpha parameter is, again, typically less than 1).

Now, why is there so much variation across sites in the overall rate of substitution? Mostly because of selection. There is some contribution from varying mutation rates. However, this is probably minor. For the most part, what explains rate variation across sites is simply that some sites are highly conserved (strong purifying selection), other sites are less conserved, and a minor fraction is under positive selection.

But then, if the selective pressure is so different at different sites, in terms of its overall strength, don’t we expect that it should also be different in terms of which types amino-acids are preferred at each site?

What I am suggesting here is that using a model that allows for rate variation across sites, but imposes the same amino-acid replacement matrix across all sites of a given gene — or, worse, across an entire multi-gene alignment — amounts to applying a double standard with respect to the consequences of varying selection along the sequence.

And yet this is what many current phylogenetic reconstruction models still do.

Monday, November 21, 2016

Ctenophores and the role of models in phylogenetics

I gave a phyloseminar last week, on the problem of dealing with patter-heterogeneity across sites in phylogenetic inference. I find this idea of phyloseminars really neat. Many thanks (and kudos) to Erick Matsen, for organizing all this. There is now quite a rich collection of phyloseminars given by many people and on many subjects in evolutionary sciences, which represents a such a valuable resource on the web. Here is the link:

To put things in context, I have been caught in the middle of a controversy about the phylogeny of animals — more specifically, but not only, concerning the phylogenetic position of ctenophores (comb jellies).  As you may know, ctenophores have been recently proposed as being the sister-group to all other animals, including sponges (Moroz et al, 2014, Ryan et al, 2014). This ctenophore-sister hypothesis is not supported, however, when you use more sophisticated models accounting for variation across sites in the amino-acid replacement process (the CAT and CAT-GTR models, implemented in PhyloBayes). Instead, these two models give sponges sister to all other animals, as is more traditionally believed (although never with really strong support, and in a way that depends on the exact choice for the outgroup). A detailed analysis is given in Pisani et al (2015).

My interpretation of all this (which is also the interpretation suggested in Pisani et al, 2015) is that ctenophore-sister is fundamentally a long-branch attraction artifact caused by the use of inadequate models of sequence evolution. But then, not everyone agree with this interpretation (e.g. Halanych et al, 2016). And progressively, this controversy about ctenophores is becoming a question about which models — and which software programs —  should be used in the context of large-scale phylogenetic reconstructions. The phyloseminar I gave last week is in fact the last of a series of three (the other two were given by Andrew Roger and Nathan Whelan), series which can be seen as organizing a debate about the role of site-heterogeneity in phylogenetic inference. 

In any case, all this raises very important questions about the current situation in the field of phylogenetic studies. Are current models and methods adequate, in particular, for reconstrcuting ‘deep’ phylogenies (over deep evolutionary times)? Where do we stand today, and what can we hope, in terms of methodological progress, for the next years on this front?

In the next posts I will publish on this blog, I will come back to some of those questions. So, bear with me!..


Halanych, K. M., Whelan, N. V., Kocot, K. M., Kohn, A. B., & Moroz, L. L. (2016). Miscues misplace sponges. Proceedings of the National Academy of Sciences, 113(8), E946–7.

Moroz, L. L., Kocot, K. M., Citarella, M. R., Dosung, S., Norekian, T. P., Povolotskaya, I. S., et al. (2014). The ctenophore genome and the evolutionary origins of neural systems. Nature, 510(7503), 109–114.

Pisani, D., Pett, W., Dohrmann, M., Feuda, R., Rota-Stabelli, O., Philippe, H., et al. (2015). Genomic data do not support comb jellies as the sister group to all other animals. Proceedings of the National Academy of Sciences, 112(50), 15402–15407.

Ryan, J. F., Pang, K., Schnitzler, C. E., Nguyen, A.-D., Moreland, R. T., Simmons, D. K., et al. (2013). The genome of the ctenophore Mnemiopsis leidyi and its implications for cell type evolution. Science, 342(6164), 1242592.


This blog has been silent for such a long time (two years…). But there are quite a few things I would like to share with you, concerning various current developments in phylogenetics and in Bayesian inference. So, let start it up again.