## Tuesday, December 13, 2016

### Second-order amino-acid replacement processes

As I mentioned earlier, classical amino-acid replacement matrices indirectly encode site-specific amino-acid preferences in their first-order Markov dependencies, and this is optimal in a low saturation regime. Now, this suggests that we could derive second-order or k-th order Markov processes that would generalize this idea and capture more information about site-specific selection, in particular about the longer-term behavior. How would we do that?

Let us consider the second-order case. Assume that, at time $t$, the current amino-acid at a given site is $b$, and the previous amino-acid state at that site, before the last substitution event, was $a$. We can then characterize the current state of the process by the $(a,b)$ pair of amino-acids.

Then, we can define a Markov process directly on the pairs $(a,b)$, with $a \neq b$, and such that transitions are allowed only between compatible pairs — i.e. such that the current state before the event becomes the previous state after the event. Thus, for instance:

$Q_{(a,b) \to (b,c)}$

is the rate from $(a,b)$ to $(b,c)$, i.e. the rate of substitution from $b$ to $c$, given that the previous amino-acid state (before $b$) was $a$.

All other rates are set equal to 0, i.e.

$Q_{(a,b) \to (c,d)} = 0$

whenever $b \neq c$.

This defines a second order 380x380 amino-acid replacement matrix, which can be exponentiated and then used for likelihood computation. Note that the instant rate matrix Q defined above is sparse, however, this will not be the case for the exponentiated matrix $P = e^{tQ}$.

Concerning the pruning algorithm, the only modification, compared to the first-order version classically implemented, concerns the initialization of the recursion at the tips. If the observed state for a given taxon is b, thern the conditional likelihood should be set equal to 1 for all pairs $(a,b)$.

We can see that, compared to first-order matrices, this second-order model will be in a better position to implement the empirically observed tendency to make repeated substitution events among overlapping subsets of amino-acids — basically, by estimating high rates of transition of the following type:

$Q_{(E,D) \to (D,E)}$
$Q_{(N,D) \to (D,N)}$
etc.

The approach can in principle be generalized to k-th order processes, by defining transition rates between $(x_1,x_2,…x_k)$ and $(x_2,x_3…x_{k+1})$.

Computationally speaking, this will not be super-efficient. The pruning algorithm is quadratic in the number of states, and thus, even for the second-order case, we already end up with a nearly 400-fold increase in computational time.

Still, I could imagine that, compared to simple 20x20 amino-acid replacement matrices, the second-order version could already make quite a difference in terms of phylogenetic accuracy. One advantage of this approach is that it is really a classical parametric likelihood model, without any complicated issue about modeling random-effects across sites, from an unknown, and potentially complex, distribution.

Conceptually speaking, this model also illustrates a fundamental idea about the relation between processes and parameters. Essentially, the parameterization of a model does not necessarily correspond to the actual mechanism. Instead, it already incorporates a level of statistical thinking, about the identifiable signal induced by the process over a large sample of observations. Usually, this signal can be decomposed in terms of frequencies over successive moments — and those frequencies are then captured by a parametric model, in terms of either proportions or rates.

1. Interesting post!

(1) One important feature of the above formulation is that it automatically loses reversibility! Thus, the likelihood depends on the choice of the root node. Is there a formulation of a "second order" process that can maintain reversibility?

(2) Another feature of the above proposal is that it depends on the current and previous amino acid states regardless of how long the current amino acid has been present. Thus a highly conserved site and a purely neutral site have the same rates if they have the same two most recent amino acids.

A different formulation of a second order chain is to have the rates depend on the current amino acid and the amino acid t time units ago.

This can be generalized to a k-th order chain by having the rates depend on the current amino acid and the amino acids i*t time units ago for i=1 to k-1.

This class of models has the nice feature that it approaches the first order model for fixed k as t->0 but can be made to approach the model where the rates depend on the whole substitution history if we let k-> \infty and t->0 in an appropriate manner.

Q: Do there exist non-trivial high-order reversible models in this fixed-lag class? The answer is not obvious to me.

2. Thanks for your comments and questions..

good point about time-reversibility -- although I am not sure about the exact reasons why we would really want to enforce this property. It is certainly convenient (the pulley principle, as you point out), but it is not necessarily adequate, empirically speaking (e.g. we know that mutation rates, for instance, are not time-reversible).

concerning the second point: what you say is true, but after all, the same objection could be raised in the case of classical 1st order processes: the amino-acids that are accepted at a given site, given the current state, do not depend on the substitution rate. Yet, it is reasonable to imagine that a more constrained site should in principle undergo more conservative amino-acid replacements, compared to a less constrained site.

3. This is an interesting idea indeed. Yet, it is not quite clear to me how this model articulates with the pruning algorithm. Under a 'traditional' first-order Markov model of substitution, the partial likelihood at a tip u corresponds to Pr(D(u)=y|N(u)=x) where D(u) is the observed sequence data (at one particular site), and N(u) is the state at tip u. Then Pr(D(u)=y|N(u)=x)=1 whenever x 'agrees' with y and 0 otherwise (e.g., for nucleotide data, if y is a purine then Pr(D(u)=y|N(u)=x)=1 if x is in {A,G}). Now, under the model you propose here, Pr(D(u)=y|N(u)=x)=1 for all substitutions x that lead to nt/aa/codon y and zero otherwise. But, surely one cannot simply ignore the timing of these substitutions, right? More generally, because the Markov model you propose here involves only non-observable states, it is not quite clear to me how the 'connection' to actual sequences works.