The Bayesian kitchen: Long-term site-specific selective constraints

There is a catch in my argument about the double standards of some of the current phylogenetic models, concerning rate- versus pattern-heterogeneity across sites. It is unfair to suggest that a model does not capture the consequences of varying selection along the sequence just because it uses the same amino-acid replacement matrix across all sites. In fact, empirical amino-acid replacement matrices do capture the effects of site-specific selection, but it is just that they do it implicitly and globally -- simply by proposing higher exchange rates among biochemically similar amino-acids.

If we look for instance at the WAG matrix:

we see that this matrix gives particularly high exchange rates between I and V and E and D. The way those exchange rates capture site-specific selection is implicit, the logic being as follows: given that a site is currently is state D, then, probably, this site is under (site-specific) selection for a negatively charged amino-acid, and therefore, we expect that the next substitution at that site will be an E with high probability.

So, classical amino-acid replacement matrices do model site-specific selection. Technically, they encode it in the first-order Markov dependencies of the amino-acid replacement process. And in fact, this approach is optimal in non-saturated situations, where each site of the alignment typically has 0 or 1 substitution event over the whole tree, but rarely more than that. In this situation, there is no hope to get more information than what is already captured in this first-order Markov structure.

However, things are quite different in a deep evolutionary context, where each site makes potentially many substitution events. Here is an example (taken from an empirical alignment at the scale of metazoans):

We see that this site is apparently under strong selective constraint to accept negatively charged amino-acids, but otherwise, can make repeated substitutions between D and E (and sometimes visit other amino-acids). In other words, this site (and many other sites) tends to be confined, in the long-run, within a very restricted subset of all possible amino-acids.

Are such site patterns easily predicted (and reproduced) by classical empirical matrices? To check this, we could for instance simulate repeated substitution histories for a given site that would start on an aspartic acid (D), and then tabulate the frequencies at which the 20 possible amino-acids are observed after 1, 2, ... k substitutions. Here is what you get with the WAG model for a site starting on an aspartic acid (D) or on a lysine (K), and over the first 5 substitution events:

As you can see, the substitution process implied by the WAG matrix does not maintain itself for a very long time within the restricted orbit of the two negatively charged amino-acids D and E (same thing for the positively charged amino-acids K and R). Instead, a typical site very quickly looses memory of the selective constraint it is supposed to stay under in the long run, such that after as few as 4 substitution events, it ends up sampling amino-acids according to proteome-wide average frequencies.

Why doesn't that work? There are at least two reasons. The first is that, as you can check from the matrix of relative exchange rates shown above, there is a high exchange rate, not just between D and E, but also between D and N. Then, from N, there is a high rate to S, etc. So, basically, what happens is that, in reality, you have some sites that make repeated substitutions between D and E, other sites between D and N, and still many other sites visiting other different but overlapping amino-acid orbits. In terms of selection pressure, this betrays the fact that the same amino-acid can be selected at different sites for different reasons: D and E are both negatively charged, but on the other hand, D and N have similar geometries for their side chain.

In this situation, when you estimate an average amino-acid replacement matrix simultaneously over all of those sites, you end up with significant exchange rates between D and a broad set of alternative amino-acids: E, N, but also G and H -- which then makes it very difficult for this matrix to explain the long-term confinement that can be seen at typical sites under long-term purifying selection.

The second reason is more conceptual and connected to the following question: what should be encoded at the level of the transient regime of the process, versus what should be captured by its stationary regime?

The site shown above suggests that the selective constraint experienced by a typical site under strong purifying selection is stable in the long run. This may not always be true, but this is generally the case. This might in fact be even more pronounced in the context of phylogenetic analyses, where there is a selection bias in favor of proteins that have a highly conserved structure (this is what makes the alignment possible at such a large evolutionary scale in the first place). And thus, sites of those proteins tend to be in a conserved biochemical environment over very long evolutionary times.

If selection is stable in the long run, then it would perhaps make more sense to encode it in the stationary regime of the substitution process at a given site. Instead, the relative exchange rates of an amino-acid replacement matrix encode the transient regime of the substitution process, whereas the stationary regime is entirely determined by the equilibrium frequencies, which are basically equal to the observed amino-acid frequencies over the entire alignment. And what the figure above indicates is that the transient regime encoded by a typical amino-acid replacement matrix is, indeed, very transient.

This argument suggests that site-specific amino-acid preferences should perhaps be encoded at the level of the equilibrium frequency vector over the 20 amino-acids. But then this means that the equiibrium frequencies of the substitution process should be site-speciifc. This idea was first proposed, and explored, by Halpern and Bruno, back in 1998.

The Bayesian kitchen

Wednesday, November 23, 2016

Long-term site-specific selective constraints

No comments:

Post a Comment

About Me