Thursday, December 15, 2016

Site-specific selection and evolutionary distances

Coming back the Microsporidia example, the LG model does not just give an artifactual position for Microsporidia. It also gives an estimate for the total tree length of about 10.3 amino-acid changes per site. On the same dataset, CAT-GTR, which accounts for site-specific amino-acid preferences, infers a total tree length of 18.6. 

And this makes total sense. In fact, the problem was very clearly identified more than 18 years ago, by Halpern and Bruno (1998, from the abstract):

Estimation of evolutionary distances from coding sequences must take into account protein-level selection to avoid relative underestimation of longer evolutionary distances. Current modeling of selection via site-to-site rate heterogeneity generally neglects another aspect of selection, namely position-specific amino acid frequencies. These frequencies determine the maximum dissimilarity expected for highly diverged but functionally and structurally conserved sequences, and hence are crucial for estimating long distances. 
Thus, nothing new — although I find it instructive to measure this effect more quantitatively. Here, if CAT-GTR is correct, then this means that, by ignoring site-specific amino-acid preferences, we are missing more than one third of all amino-acid substitutions. The same pattern is obtained (also with a difference of about 30 to 40%) on recent phylogenomic datasets at the scale of metazoans.

Incidentally, it is also interesting to note that, in spite of the importance of the problem raised by Halpern and Bruno, the ’current models’ of 1998 are still more or less the default option in many phylogenetic reconstruction programs.


Halpern, A. L., & Bruno, W. J. (1998). Evolutionary distances for protein-coding sequences: modeling site-specific residue frequencies. Molecular Biology and Evolution, 15(7), 910–917.

Tuesday, December 13, 2016

Second-order amino-acid replacement processes

As I mentioned earlier, classical amino-acid replacement matrices indirectly encode site-specific amino-acid preferences in their first-order Markov dependencies, and this is optimal in a low saturation regime. Now, this suggests that we could derive second-order or k-th order Markov processes that would generalize this idea and capture more information about site-specific selection, in particular about the longer-term behavior. How would we do that?

Let us consider the second-order case. Assume that, at time $t$, the current amino-acid at a given site is $b$, and the previous amino-acid state at that site, before the last substitution event, was $a$. We can then characterize the current state of the process by the $(a,b)$ pair of amino-acids.

Then, we can define a Markov process directly on the pairs $(a,b)$, with $a \neq b$, and such that transitions are allowed only between compatible pairs — i.e. such that the current state before the event becomes the previous state after the event. Thus, for instance:

$Q_{(a,b) \to (b,c)}$

is the rate from $(a,b)$ to $(b,c)$, i.e. the rate of substitution from $b$ to $c$, given that the previous amino-acid state (before $b$) was $a$.

All other rates are set equal to 0, i.e.

$Q_{(a,b) \to (c,d)} = 0$

whenever $b \neq c$.

This defines a second order 380x380 amino-acid replacement matrix, which can be exponentiated and then used for likelihood computation. Note that the instant rate matrix Q defined above is sparse, however, this will not be the case for the exponentiated matrix $P = e^{tQ}$.

Concerning the pruning algorithm, the only modification, compared to the first-order version classically implemented, concerns the initialization of the recursion at the tips. If the observed state for a given taxon is b, thern the conditional likelihood should be set equal to 1 for all pairs $(a,b)$.

We can see that, compared to first-order matrices, this second-order model will be in a better position to implement the empirically observed tendency to make repeated substitution events among overlapping subsets of amino-acids — basically, by estimating high rates of transition of the following type:

$Q_{(E,D) \to (D,E)}$
$Q_{(N,D) \to (D,N)}$

The approach can in principle be generalized to k-th order processes, by defining transition rates between $(x_1,x_2,…x_k)$ and $(x_2,x_3…x_{k+1})$.

Computationally speaking, this will not be super-efficient. The pruning algorithm is quadratic in the number of states, and thus, even for the second-order case, we already end up with a nearly 400-fold increase in computational time.

Still, I could imagine that, compared to simple 20x20 amino-acid replacement matrices, the second-order version could already make quite a difference in terms of phylogenetic accuracy. One advantage of this approach is that it is really a classical parametric likelihood model, without any complicated issue about modeling random-effects across sites, from an unknown, and potentially complex, distribution.

Conceptually speaking, this model also illustrates a fundamental idea about the relation between processes and parameters. Essentially, the parameterization of a model does not necessarily correspond to the actual mechanism. Instead, it already incorporates a level of statistical thinking, about the identifiable signal induced by the process over a large sample of observations. Usually, this signal can be decomposed in terms of frequencies over successive moments — and those frequencies are then captured by a parametric model, in terms of either proportions or rates.

Wednesday, December 7, 2016

The effective number of amino-acids per site

As we have seen, a fundamental consequence of site-specific selective constraints is that the number of amino-acids allowed at each site is typically much smaller than 20. This in turn results in a purely combinatorial effect: few amino-acids per site means that convergent substitution events are more frequent than what is anticipated by a model assuming that all 20 amino-acids are equally acceptable at each site — with potentially important consequences on phylogenetic accuracy.

Here is a more quantitative version of this argument.

First, imagine a position that can accept only $K$ amino-acids, but accepts each of them with equal frequency $1/K$. Then, if this position is sufficiently fast evolving, the probability of observing the same amino-acid at that position in two distantly related species (i.e. the long-term probability of convergent evolution) is just $p_c = 1/K$. Or, equivalently, $K = p_c^{-1}$.

Now, consider an arbitrary amino-acid replacement process, with equilibrium frequency vector $(\pi_k)_{k=1..20}$ over the 20 amino-acids. The probability of sampling twice the same amino-acid by chance from this equilibrium frequency vector is:

$p_c \, = \, \sum_{k=1}^{20} \, \pi_k^2$.

By analogy with the simple case considered above, we can define the effective number of amino-acids (or the effective size of the state space) implied by this amino-acid replacement process as:

$K_e \, = \, p_c^{-1} = \, \left( \sum_{k=1}^{20} \, \pi_k^2 \right) ^{-1}$

Note that this definition of $K_e$ is different from the one often used (e.g. Pollock et al, 2006, Echave and Wilke, 2016), which is instead related to Shannon entropy. The present definition is, I think, more adequate in the present context, as it more directly captures the fundamental relation between amino-acid restrictions and probability of convergent evolution ($p_c = 1/K_e$). The two definitions give qualitatively similar results in practice.

We can calculate the effective number of amino-acids at each site, based on the site-specific amino-acid equilibrium frequency profiles estimated by the CAT-GTR model. Here is the distribution of $K_e$ across the entire alignment in the case of Microsporidia:

The average across sites is $<K_e> =7.22$, and the first 25% quantile is 4.1. In other words, according to CAT-GTR, 25% of the sites effectively visit less than 4 distinct amino-acids in the long-run. Importantly, the mean substitution rate for those sites is such that they make on average 16.5 substitution events across the whole tree, and 25% of them make more than 20 substitutions repeatedly over at most 4 amino-acids. Thus, sites with a small $K_e$ are not particularly slowly evolving, and many of them are even highly saturated.

In contrast, the bar shown in red on this plot represents the effective number of amino-acids implied by the use of single amino-acid replacement matrix for all sites (i.e. calculated on the empirical frequencies over the whole alignment). This effective number is equal to 16.1, which is more than twice as much as the mean across sites estimated by CAT-GTR, and 4 times as large as the effective number for the 25% most restricted positions of the alignment. 

Also, the tail-area under the blue curve, to the right of this red bar, is only about 0.2% — which clearly illustrates that the amino-acid replacement process estimated by pooling sites is definitely not typical of what happens at most sites.

Coming back to the question of phylogenetic accuracy, all this means that using a single amino-acid replacement matrix for all sites leads to underestimating the long-run probability of convergent substitution for 99.8% of all sites, by a factor 2 on average, and by a factor at least 4 for 25% of the positions — clearly not a well-calibrated model, in terms of its estimation of the phylogenetic noise.


Pollock, D. D., Thiltgen, G., & Goldstein, R. A. (2012). Amino acid coevolution induces an evolutionary Stokes shift. Proceedings of the National Academy of Sciences of the United States of America, 109(21), E1352–9.

Echave, J. and Wilke, C. O. (2016). Biophysical models of protein evolution: understanding the patterns of evolutionary sequence divergence.