Tuesday, November 29, 2016

Site-specific selection and phylogenetic accuracy

In my , I argued that classical amino-acid replacement matrices do not faithfully describe the long-term evolutionary process at typical positions of protein-coding sequences. More specifically, those matrices do not correctly capture the long-run restrictions induced by puriyfing selection on the spectrum of amino-acids that are accepted at a typical site -- they anticipate too many amino-acids in the long run. In the words of Claus Wilke, empirical amino-acid replacement matrices are derived by pooling sites, but in the end, no specific, individual site behaves like a ‘pooled’ site.

Now, this has very important consequences on phylogenetic accuracy. To illustrate this point, here is a case presented by Andrew Roger in his phyloseminar, concerning the position of Microsporidia in the phylogeny of eukaryotes. This is based on a phylogenomic dataset of Brinkmann et al (2005), with 133 genes ($\sim$ 24000 aligned positions). I have reanalyzed this dataset, using two models, one that uses the same amino-acid replacement matrix for all sites (LG), and another one that models site-specific amino-acid equilibrium frequencies (CAT-GTR).

The tree reconstructed by the LG model gives Microsporidia sister group to all other eukaryotes:

In contrast, the tree obtained once you account for site-specific amino-acid preference in the long run (using CAT-GTR) shows Microsporidia sister-group to Fungi:

We used to believe that Microsporidia were ‘early-emerging’ eukaryotes, as suggested here by LG. The fact that they lack mitochondria was interpreted as a primitive trait: supposedly, the split between Microsporidia and the rest of eukaryotes would have pre-dated the endosymbiotic event leading to the acquisition of the mitochondrion by the so-called ‘crown’ eukaryotes. However, there are now quite a few lines of evidence suggesting that Microsporidia are in fact closely related to Fungi, as suggested here by CAT-GTR; also, their lack of mitochondria is probably due to a secondary loss (or, more accurately, a secondary simplification, since Microsporidia have hydrogenosome-like organelles that might well be the vestige of their mitochondria).

In fact, the tree obtained under the LG model does not just show Microsporida sister to all other eukaryotes. It also gives a somewhat unusual phylogeny for those other eukaryotes, with Fungi branching first. However, if we ignore the rooting, this tree differs from the tree obtained by CAT-GTR by just moving Archaea from their branching point between unikonts and bikonts to a position where they are sister group to Microsporidia. Thus, here, it is not so much that Microsporidia ‘move down’ in the tree because they are attracted by the archaean outgroup. Instead, it is the outgroup itself which ‘goes up’ in the tree, being attracted by Microsporidia. In any case, whichever way you decide to see it, this is basically a long branch attracted by another long branch -- thus, a paradigmatic case of a long-branch attraction artifact.

So, we have a clear example here, where accounting for site-specific amino-acid preference appears to result in a greater robustness against tree reconstruction errors. See also Lartillot et al (2007) for other examples, concerning the phylogenetic position of nematodes and platyhelminths.

Why is it that models accounting for site-specific amino-acid restrictions are more accurate? The main reason is relatively simple and can be summarized as follows.

Convergent substitutions (or, equivalently, reverting substitutions) represent the primary source of tree reconstruction errors. Therefore, in order to correctly discriminate between shared-derived characters (signal) and convergent or reverting substitutions (noise), it is particularly important to correctly model all of the factors that could cause a high rate of convergent substitution in real data.

Now, the probability of convergent evolution toward the same amino-acid at the same site, in two unrelated species separated by a large evolutionary distance, is roughly the inverse of the number of possible amino-acids at that site. Thus, a small number of acceptable amino-acids per site mechanically implies a high probability of convergent evolution.

And thus, one can easily imagine that a model not correctly accounting for site-specific amino-acid restrictions, because it essentially assumes that all amino-acids are accepted at all sites in the long run, will automatically underestimate the probability of convergent evolution. Therefore, by mistaking noise for signal, it will tend to produce artifacts.

In contrast, a model explicitly accounting for site-specific amino-acid restrictions is in a better position to correctly estimate the probability of convergent substitution, and as a result, is better calibrated in terms of the signal/noise ratio. This is what make such site-specific models more robust against tree reconstruction artifacts.

All of what I have seen thus far suggests that this effect due to site-specific amino-acid restrictions is quite strong and that it can have a significant impact on phylogenetic accuracy over deep evolutionary times. In a sense, this is not so surprising: the number of amino-acids typically accepted at a given site is on average of the order of 4 or 5, out of 20 amino-acids -- which means that there is room here for an underestimation of the probability of convergent evolution (and thus, an underestimation of the prevalence of noise) by a factor of 4. I will come back to this point in later posts, in order to make it more precise and more quantitative.

In any case, this connection between site-specific amino-acid restrictions and the rate of convergent evolution, combined with what we have seen earlier concerning the problems inherent to the idea of imposing a single amino-acid replacement matrix to a heterogeneous collection of amino-acid positions, each with its own selective requirements, captures one of the most fundamental limitations of the simple amino-acid replacement models that are still currently used in phylogenomics.

--

Henner Brinkmann, Mark van der Giezen, Yan Zhou, Gaëtan Poncelin de Raucourt, and Hervé Philippe (2006). An empirical assessment of long-branch attraction artefacts in deep eukaryotic phylogenomics., 54(5), 743–757. http://doi.org/10.1080/10635150500234609

Lartillot, N., Brinkmann, H., & Philippe, H. (2007). Suppression of long-branch attraction artifacts in the animal phylogeny using a site-heterogeneous model. BMC Evolutionary Biology, 7 Suppl 1, S4. http://doi.org/10.1186/1471-2148-7-S1-S4

Friday, November 25, 2016

Double standard (2)

Just a side remark: in my last post, I pointed out that classical amino-acid replacement matrices implicitly encode site-specific selection in the first-order Markov dependencies of the process, and that doing so is optimal in situations of low sequence divergence (more specifically, when the fractions of sites that have more than one substitution event over the whole tree is negligible).

But then, in this situation, there is also no information that can be gained about rate variation across sites. In this situation, sites have either 0 or 1 substitution event, and the only meaningful statistic that we have is the fraction of sites with one substitution — which gives us an estimate of the mean rate of substitution across sites, but otherwise, does not tell us anything about the variance (the shape parameter alpha).

An analogy can be made here with an imaginary coin-tossing experiment, in which you would make each toss with a different coin, all of which have potentially different biases (different probabilities of giving a head). Thus, there is an unknown distribution, over coins, of the probability of getting head.

In this situation, with only one draw per coin, your fraction of heads is equal to the average probability of getting head over all coins. However, this is the only information that you can get from this experiment. You need to make at least 2 tosses per coin in order to have information about the variance of the head probability over coins. An increasing number of tosses per coin give you information about an increasing number of the moments of the distribution.

In any case, coming back to sequence evolution, the main point I wanted to make is this: empirical amino-acid replacement matrices are the perfect solution in those situations where you also cannot meaningfully estimate the variance in rates across sites. Conversely, as soon as you have sufficient sequence divergence, giving you empirical information about rate-heterogeneity, then you also have useful information concerning pattern-heterogeneity — which will not be captured by a single amino-acid replacement matrix.

On a more general note, making the parallel between rate- and pattern-heterogeneity is, I think, insightful.

Also, to address one of the comments: of course, there are computational issues behind our potential methodological inconsistencies. But this does not prevent us from pointing them out and raising a little sense of discomfort about what we easily take for granted in our daily routine. That's the only way we can make any progress.

Wednesday, November 23, 2016

Long-term site-specific selective constraints

There is a catch in my argument about the double standards of some of the current phylogenetic models, concerning rate- versus pattern-heterogeneity across sites. It is unfair to suggest that a model does not capture the consequences of varying selection along the sequence just because it uses the same amino-acid replacement matrix across all sites. In fact, empirical amino-acid replacement matrices do capture the effects of site-specific selection, but it is just that they do it implicitly and globally -- simply by proposing higher exchange rates among biochemically similar amino-acids.

If we look for instance at the WAG matrix:

we see that this matrix gives particularly high exchange rates between I and V and E and D. The way those exchange rates capture site-specific selection is implicit, the logic being as follows: given that a site is currently is state D, then, probably, this site is under (site-specific) selection for a negatively charged amino-acid, and therefore, we expect that the next substitution at that site will be an E with high probability.

So, classical amino-acid replacement matrices do model site-specific selection. Technically, they encode it in the first-order Markov dependencies of the amino-acid replacement process. And in fact, this approach is optimal in non-saturated situations, where each site of the alignment typically has 0 or 1 substitution event over the whole tree, but rarely more than that. In this situation, there is no hope to get more information than what is already captured in this first-order Markov structure.

However, things are quite different in a deep evolutionary context, where each site makes potentially many substitution events. Here is an example (taken from an empirical alignment at the scale of metazoans):

We see that this site is apparently under strong selective constraint to accept negatively charged amino-acids, but otherwise, can make repeated substitutions between D and E (and sometimes visit other amino-acids). In other words, this site (and many other sites) tends to be confined, in the long-run, within a very restricted subset of all possible amino-acids.

Are such site patterns easily predicted (and reproduced) by classical empirical matrices? To check this, we could for instance simulate repeated substitution histories for a given site that would start on an aspartic acid (D), and then tabulate the frequencies at which the 20 possible amino-acids are observed after 1, 2, ... k substitutions. Here is what you get with the WAG model for a site starting on an aspartic acid (D) or on a lysine (K), and over the first 5 substitution events:

As you can see, the substitution process implied by the WAG matrix does not maintain itself for a very long time within the restricted orbit of the two negatively charged amino-acids D and E (same thing for the positively charged amino-acids K and R). Instead, a typical site very quickly looses memory of the selective constraint it is supposed to stay under in the long run, such that after as few as 4 substitution events, it ends up sampling amino-acids according to proteome-wide average frequencies.

Why doesn't that work? There are at least two reasons. The first is that, as you can check from the matrix of relative exchange rates shown above, there is a high exchange rate, not just between D and E, but also between D and N. Then, from N, there is a high rate to S, etc. So, basically, what happens is that, in reality, you have some sites that make repeated substitutions between D and E, other sites between D and N, and still many other sites visiting other different but overlapping amino-acid orbits. In terms of selection pressure, this betrays the fact that the same amino-acid can be selected at different sites for different reasons: D and E are both negatively charged, but on the other hand, D and N have similar geometries for their side chain.

In this situation, when you estimate an average amino-acid replacement matrix simultaneously over all of those sites, you end up with significant exchange rates between D and a broad set of alternative amino-acids: E, N, but also G and H -- which then makes it very difficult for this matrix to explain the long-term confinement that can be seen at typical sites under long-term purifying selection.

The second reason is more conceptual and connected to the following question: what should be encoded at the level of the transient regime of the process, versus what should be captured by its stationary regime?

The site shown above suggests that the selective constraint experienced by a typical site under strong purifying selection is stable in the long run. This may not always be true, but this is generally the case. This might in fact be even more pronounced in the context of phylogenetic analyses, where there is a selection bias in favor of proteins that have a highly conserved structure (this is what makes the alignment possible at such a large evolutionary scale in the first place). And thus, sites of those proteins tend to be in a conserved biochemical environment over very long evolutionary times.

If selection is stable in the long run, then it would perhaps make more sense to encode it in the stationary regime of the substitution process at a given site. Instead, the relative exchange rates of an amino-acid replacement matrix encode the transient regime of the substitution process, whereas the stationary regime is entirely determined by the equilibrium frequencies, which are basically equal to the observed amino-acid frequencies over the entire alignment. And what the figure above indicates is that the transient regime encoded by a typical amino-acid replacement matrix is, indeed, very transient.

This argument suggests that site-specific amino-acid preferences should perhaps be encoded at the level of the equilibrium frequency vector over the 20 amino-acids. But then this means that the equiibrium frequencies of the substitution process should be site-speciifc. This idea was first proposed, and explored, by Halpern and Bruno, back in 1998.

Tuesday, November 22, 2016

Double standard

Nobody today would publish a phylogeny that has been inferred using a model that does not account for rate variation across sites, at least not a phylogeny over a large evolutionary scale (e.g. animals, eukaryotes, etc).

And indeed, using such rates-across-sites models, we usually do uncover substantial rate variation in empirical sequence alignments: the shape parameter of the gamma distribution (alpha) is typically less than 1, of the order of 0.7. Below is a plot of the density of the gamma for alpha=0.7. The 0.25 and 0.75 quantiles are 0.18 and 1.38 respectively. Thus, the 25% fastest evolving sites are evolving about 10 times faster than the 25% slowest sites.

Also, it is clear that much of this variation occurs within genes: in about any gene, you expect to find both slow and fast sites (you just need to fit the gamma model on each gene separately and observe that the estimated value for the alpha parameter is, again, typically less than 1).

Now, why is there so much variation across sites in the overall rate of substitution? Mostly because of selection. There is some contribution from varying mutation rates. However, this is probably minor. For the most part, what explains rate variation across sites is simply that some sites are highly conserved (strong purifying selection), other sites are less conserved, and a minor fraction is under positive selection.

But then, if the selective pressure is so different at different sites, in terms of its overall strength, don’t we expect that it should also be different in terms of which types amino-acids are preferred at each site?

What I am suggesting here is that using a model that allows for rate variation across sites, but imposes the same amino-acid replacement matrix across all sites of a given gene — or, worse, across an entire multi-gene alignment — amounts to applying a double standard with respect to the consequences of varying selection along the sequence.

And yet this is what many current phylogenetic reconstruction models still do.

Monday, November 21, 2016

Ctenophores and the role of models in phylogenetics

I gave a phyloseminar last week, on the problem of dealing with patter-heterogeneity across sites in phylogenetic inference. I find this idea of phyloseminars really neat. Many thanks (and kudos) to Erick Matsen, for organizing all this. There is now quite a rich collection of phyloseminars given by many people and on many subjects in evolutionary sciences, which represents a such a valuable resource on the web. Here is the link:

To put things in context, I have been caught in the middle of a controversy about the phylogeny of animals — more specifically, but not only, concerning the phylogenetic position of ctenophores (comb jellies).  As you may know, ctenophores have been recently proposed as being the sister-group to all other animals, including sponges (Moroz et al, 2014, Ryan et al, 2014). This ctenophore-sister hypothesis is not supported, however, when you use more sophisticated models accounting for variation across sites in the amino-acid replacement process (the CAT and CAT-GTR models, implemented in PhyloBayes). Instead, these two models give sponges sister to all other animals, as is more traditionally believed (although never with really strong support, and in a way that depends on the exact choice for the outgroup). A detailed analysis is given in Pisani et al (2015).

My interpretation of all this (which is also the interpretation suggested in Pisani et al, 2015) is that ctenophore-sister is fundamentally a long-branch attraction artifact caused by the use of inadequate models of sequence evolution. But then, not everyone agree with this interpretation (e.g. Halanych et al, 2016). And progressively, this controversy about ctenophores is becoming a question about which models — and which software programs —  should be used in the context of large-scale phylogenetic reconstructions. The phyloseminar I gave last week is in fact the last of a series of three (the other two were given by Andrew Roger and Nathan Whelan), series which can be seen as organizing a debate about the role of site-heterogeneity in phylogenetic inference.

In any case, all this raises very important questions about the current situation in the field of phylogenetic studies. Are current models and methods adequate, in particular, for reconstrcuting ‘deep’ phylogenies (over deep evolutionary times)? Where do we stand today, and what can we hope, in terms of methodological progress, for the next years on this front?

In the next posts I will publish on this blog, I will come back to some of those questions. So, bear with me!..

References

Halanych, K. M., Whelan, N. V., Kocot, K. M., Kohn, A. B., & Moroz, L. L. (2016). Miscues misplace sponges. Proceedings of the National Academy of Sciences, 113(8), E946–7. http://doi.org/10.1073/pnas.1525332113

Moroz, L. L., Kocot, K. M., Citarella, M. R., Dosung, S., Norekian, T. P., Povolotskaya, I. S., et al. (2014). The ctenophore genome and the evolutionary origins of neural systems. Nature, 510(7503), 109–114. http://doi.org/10.1038/nature13400

Pisani, D., Pett, W., Dohrmann, M., Feuda, R., Rota-Stabelli, O., Philippe, H., et al. (2015). Genomic data do not support comb jellies as the sister group to all other animals. Proceedings of the National Academy of Sciences, 112(50), 15402–15407. http://doi.org/10.1073/pnas.1518127112

Ryan, J. F., Pang, K., Schnitzler, C. E., Nguyen, A.-D., Moreland, R. T., Simmons, D. K., et al. (2013). The genome of the ctenophore Mnemiopsis leidyi and its implications for cell type evolution. Science, 342(6164), 1242592. http://doi.org/10.1126/science.1242592

Resurrection

This blog has been silent for such a long time (two years…). But there are quite a few things I would like to share with you, concerning various current developments in phylogenetics and in Bayesian inference. So, let start it up again.

Thursday, November 13, 2014

Bayesian diversification models (2)

Concerning my last post, I realize that my notation for rejection sampling models is a bit elliptic. In particular, the rejection step corresponding to the fact that we condition on the observed data is never explicitly written. So, let me just reformulate the technical arguments and restate the main idea.

In the general case, i.e. for any Bayesian model, the rejection sampling algorithm could be written as follows:

REPEAT
sample $\theta$ from prior
simulate data given $\theta$
UNTIL simulated data match observed data $D$
return $\theta$

What this algorithm returns is one single value of $\theta$ sampled from the posterior distribution:

$p(\theta \mid D) \propto p(\theta) p(D \mid \theta)$.

Of course, you can imagine that one would typically run this algorithm many times and then use the resulting sample of independent draws from the posterior to estimate posterior mean, median, credibility intervals, etc.

Note that I use upper case letters for the REPEAT-UNTIL loop corresponding to the conditioning on empirical data -- to distinguish it from other repeat-until loops that might also be invoked, just because of the data acquisition protocol.

Now, in the case of a diversification model, we have to account for the fact that we observe only surviving species clades. In addition, we imagine that there is some natural variation among clades in diversification rates, represented by a distribution $\phi$ on the diversification parameter $\theta = (\lambda, \mu)$, such that, each clade produced by Nature would have its speciation and extinction rates independently sampled from $\phi$.

Thus, altogether, our complete rejection sampling would look like:

REPEAT
repeat
Nature chooses $\theta$ from $\phi$
Nature runs diversification process given $\theta$
until process survives
UNTIL simulated phylogeny matches observed phylogeny $T$
return $\theta$

We now have two repeat-until loops: the inner one represents the fact that, by construction, we only consider non extinct groups. The outer one represents our conditioning on the observed phylogeny $T$. However, this could just as well be written as:

REPEAT
Nature chooses $\theta$ from $\phi$
Nature runs diversification process given $\theta$
UNTIL simulated phylogeny matches observed phylogeny
return $\theta$

simply because an empty tree will never match our observed phylogeny anyway. Thus, we end up with a value of $\theta$ sampled from a very simple posterior distribution, one that does not invoke any conditional likelihood (this was my equation 1 in the previous post):

(1)    $p(\theta \mid T) \propto \phi(\theta) \, p(T \mid \theta)$

This posterior distribution combines a prior not conditional on survival $\phi(\theta)$, with a likelihood also not conditional on survival $p(T \mid \theta)$.

Now, you can always rewrite this as:

$p(\theta \mid T) \propto \phi(\theta) p(S \mid \theta) \, p(T \mid \theta) / p(S \mid \theta)$

or, equivalently:

(2)   $p(\theta \mid T) \propto p(\theta \mid S) \, p(T \mid \theta, S)$

where $p(S \mid \theta)$ is the probability of survival of the clade. Thus, we now have a prior conditional on survival together with a likelihood also conditional on survival. Equation 1 and 2 are equivalent, and you can choose between them, fundamentally, based on which prior information you have (fossils: equation 1, other neontological studies: equation 2).

But in any case, you never get, under this algorithmic model, an unconditional prior combined with a conditional likelihood. The only way to get this combination is by considering another model:

REPEAT
choose $\theta$ from some prior $p(\theta)$
repeat
Nature runs diversification process given $\theta$
until process survives
UNTIL simulated phylogeny matches observed phylogeny

which basically corresponds to the case where Nature has produced many clades, all under the exact same diversification rates (same parameter vector $\theta$), and we have randomly sampled one among the surviving clades thus produced. Here, we still draw $\theta$ from some prior distribution $p(\theta)$, but this distribution now merely represents our prior uncertainty about the fixed value of $\theta$ 'chosen by Nature' -- not the natural variation of $\theta$ among groups. So, here, we cannot collapse the two repeat-until loops, we have to keep the two: an internal loop corresponding to the actual 'repeated natural experiment', under fixed but unknown $\theta$, and an external loop simply representing our Bayesian inference about this unknown value of $\theta$, by rejection sampling, based on some subjective prior $p(\theta)$.

Personally, I consider this second model less convincing than the first one: we do see in practice quite some variation among closely related groups, so, we should invoke some distribution $\phi$ representing this variation.

Of course, ultimately, none of the two models is really convincing: as soon as you acknowledge the existence of variation in diversification rates between clades, then you should also expect variation within clades: if diversification rates differ between rodents and primates, then, they almost certainly vary among primates, or among rodents. Mathematically, $\theta$ should itself be the result of a stochastic process along the lineages. In fact, some interesting work has already been done along those lines, using a compound Poisson process (Rabosky, 2014).

===

Rabosky, D. L. (2014). Automatic detection of key innovations, rate shifts, and diversity-dependence on phylogenetic trees. PLoS One, 9(2), e89543. doi:10.1371/journal.pone.0089543

Monday, November 10, 2014

Bayesian priors for diversification studies

In my last post, I was asking whether one should condition the likelihood on survival in the context of diversification studies. The whole discussion, however, was purely at the level of the likelihood. Now, how should we look at this problem from a Bayesian perspective?

Assuming that $\theta = (\lambda, \mu)$ is our vector representing the speciation and extinction rates, I was suggesting the following algorithmic model:

Nature has fixed distribution $\phi(\theta)$
repeat
Nature chooses $\theta$ from $\phi$
Nature runs diversification process given $\theta$
until process survives

where $\phi$ is the natural variation in diversification rate among clades (both extant and extinct). Our observed phylogeny can then be seen as one particular draw from this simulation model. Conditioning on the observed phylogeny $T$ leads to the following posterior distribution:

(1)    $p(\theta \mid T) \propto \phi(\theta) \, p(T \mid \theta)$

involving the likelihood unconditional on ultimate survival. I was then considering two limiting cases of this equation in the context of a maximum a posteriori approach.

Now, simply identifying $\phi$ with our prior and sampling from the posterior distribution defined by equation 1 would represent a meaningful Bayesian formalization of the problem. However, this assumes that we can see our prior as a reasonable estimate of the true natural variation in diversification rates $\phi$  and not just as a representation of some loosely defined prior beliefs about possible values of $\theta$. So, in particular, this has to be an informative prior.

Do we have practical ways to do this? Yes, I think so. In fact, I can see two different approaches to this problem.

Fossil data  Suppose we are able to come up with an acceptable estimate of the distribution of diversification rates from the fossil record. Since $\phi$ in equation 1 is to be interpreted as the distribution of diversification rates across all clades, both extant and extinctwe can directly identify $\phi$ with our empirical estimate $\hat \phi$ obtained from fossils, which leads us to the following posterior distribution:

(2)    $p(\theta \mid T) \propto \hat \phi(\theta) \, p(T \mid \theta)$.

Neontological data  Alternatively, suppose that we already have estimates of $\theta$ across several other clades (e.g. several mammalian orders), obtained by fitting or conditioning diversification models on their respective phylogenies. We would like to use this information to derive the prior for our clade of interest (say, Carnivores). Our prior information, however, is now itself conditional on survival, so, we can not simply calculate the mean and the variance of the speciation and extinction rates across other clades and use it as a proxy for $\phi$, since $\phi$ is not conditional on extinction.

On the other hand, we can re-write equation 1 as follows:

$p(\theta \mid T) = \phi(\theta) p(T \mid \theta) = \phi(\theta) p(S \mid \theta) \, p(T \mid \theta) / p(S \mid \theta) = p(\theta \mid S) \, p(T \mid \theta, S)$.

where $p(S \mid \theta)$ is, as in the previous post, the probability of ultimate survival of the clade given $\theta$. So, here, we have two factors: a conditional prior:

$p(\theta \mid S) \propto \phi(\theta) p(S \mid \theta)$,

which can be represented by the following sampling model:

repeat
choose $\theta$ from $\phi$
run a diversification process under $\theta$
until process survives
return $\theta$

clearly suggesting that this is the prior distribution of values of $\theta$ across surviving groups. And a likelihood now conditional on survival:

$p(T \mid \theta, S) = p(T \mid \theta) / p(S \mid \theta)$,

which we have already seen. What all this means is that, instead of working with $\phi$ and with the unconditional likelihood, we can equivalently work with a prior conditioned on survival, but then we should combine it with the conditional likelihood. In practice, we can for instance calculate the mean and the variance of speciation and extinction rates estimated on other mammalian orders and, using a moment-matching argument, derive an empirical prior $\hat \psi(\theta)$, now conditional on survival. The posterior is then:

(3)    $p(\theta \mid T) \propto \hat \psi(\theta) \, p(T \mid \theta, S)$.

Note that this idea of combining the conditional likelihood with a conditional prior is similar to what is proposed in Stadler et al (2013).

Thus, it all depends on what prior information you have. If your prior information is obtained from fossil evidence, then this should be considered as unconditional prior, to be used in conjunction with an unconditional likelihood (equation 2). If, on the other hand, your prior information has been gathered from other neontological studies, then this prior information is conditional on survival, and as such it should be be combined with a conditional likelihood (equation 3).

No prior information  Now, what if we do not want to use fossil evidence, nor any information gained from other clades? We would like to say that we do not know anything about $\lambda$ and $\mu$ and, accordingly, use a non-informative prior. However, which likelihood should we combine with this uninformative prior? conditional or unconditional?

This is a bit tricky. First, there would be much to say about uninformative priors in general, before dealing with the question of how to use them in the present case. Second, technically, the only Bayesian prior that we can meaningfully introduce here should express our uncertainty about $\phi$. So, we would end up with two levels in our model: $\theta$ sampled from $\phi$, itself sampled from some hyperprior. We could explicitly derive something along those lines, but this would be uselessly complicated for the present matter. Instead, we can pretend that we implicitly marginalize our complicated two-level model over the hyperprior, ending up with an effective prior directly on $\theta$.

At the end, I think it is just a question of deciding whether we want to express lack of information about $\theta$ among all clades or only among surviving clades. In the first case, we would combine our uninformative prior with the unconditional likelihood, in the second case, with the conditional likelihood*. Personally, I think I would prefer to express lack of information among all clades, for it is not true that I don't know anything about the diversification rate of an extant species group: I know in particular that $\lambda - \mu$ cannot be unreasonably low, otherwise, I would not have observed this clade.

In summary, in a Bayesian context, there seems to be some flexibility in which version of the likelihood we can use: in fact, it all depends on the meaning attached to the prior.

We can of course imagine more complicated settings: estimating $\phi$ by simultaneously considering several clades in the context of a hierarchical Bayesian model, constructing integrative models jointly considering fossil and molecular data, etc. All this can lead to relatively complicated arguments and calculations for deriving the exact prior to be used  but this is exactly where algorithmic representations can be useful.

===

Stadler, T., Kühnert, D., Bonhoeffer, S., & Drummond, A. J. (2013). Birth-death skyline plot reveals temporal changes of epidemic spread in HIV and hepatitis C virus (HCV). Proceedings of the National Academy of Sciences, 110(1), 228–233. doi:10.1073/pnas.1207965110

* Strictly speaking, there is a third case: we could also entertain the model in which there is in fact no available variation under $\phi$, so that $\phi$ is highly concentrated around some value $\theta_0$ (as considered in my previous post). But we don't know the location of $\theta_0$ and therefore we put an uninformative prior on it. In that case, we would combine the uninformative prior with the conditional likelihood. However, personally, I am not convinced by this specific model: I tend to think that there is variation among clades, so that the clades that we observe are statistically enriched in higher diversification rates -- a fact that should be included in, and not factored out from, what we call our empirical data.

Thursday, November 6, 2014

Should we condition on non-extinction?

Diversification studies are concerned with the problem of testing models of species diversification and estimating their parameters (such as speciation and extinction rates) based on time-calibrated phylogenies. One particular question that regularly comes up in this context is whether and how we should take into account, in the inference procedure, the fact that, by construction, we only consider non-extinct groups.

A good sign that this question is not so obvious is that many different solutions seem to be considered in practice: likelihood functions are sometimes conditional on non-extinction (Stadler et al, 2013), sometimes conditional on the age of the last common ancestor of the group (Nee et al, 1994), sometimes not conditioned at all (Maddison et al, 2007, FitzJohn, 2010), or even conditioned on the number of species sampled today (in the context of molecular dating, Rannala and Yang, 1996). So, clearly, the question requires some clarification.

As a simple illustrating case, let us assume a diversification process with constant but unknown rates of speciation ($\lambda$) and extinction ($\mu$), and let us call $\theta = (\lambda, \mu)$ our parameter vector. Technically, we also have a third parameter, $t_0$, the time of origin of the process, but let us leave this detail aside for the moment (it does not really change the main argument).

A first attempt at deriving a likelihood for this problem is simply to consider the probability of producing the observed tree $T$, conditional on the parameter $\theta$:

(1)    $p(T \mid \theta)$.

One problem, however, is that there is a positive probability that the process gets extinct and therefore returns an empty tree:

$p(T = \emptyset \mid \theta) = 1 - p(S \mid \theta) > 0$,

where $p(S \mid \theta)$ is the probability of ultimate survival of the group, given the parameters. Yet, by the very design of the experiment, the only cases that we consider in practice are non-empty trees, and we would like our likelihood to sum to one over all possible observable configurations for the data — thus, over all possible non-empty trees.

We can avoid this ‘missing mass’ problem and restore a correct normalization of the probability by just conditioning on ultimate survival:

(2)    $p(T \mid \theta, S) = \frac {p(T \mid \theta)} {p(S \mid \theta)}$.

We could then take this conditional probability as our likelihood and proceed to estimation, for instance, by maximizing this conditional likelihood with respect to $\theta$.

One important thing to note here is that, since the unconditional likelihood (1) accounts for the probability of surviving, while the conditional likelihood (2) factors this effect out of the probability, then, everything else being equal, parameter configurations that tend to increase the chances of ultimate survival will be favored by the unconditional likelihood (1), but not by the conditional likelihood (2). In particular, the probability of survival is strongly dependent on the net expected growth rate of the species group $r = \lambda - \mu$, with larger values of $r$ (higher growth rates) implying a higher chance of ultimate survival. Thus, if you compare your estimates returned by each of the two versions of the likelihood, you will in general obtain a larger value for $r$ under (1) than under (2).

So, which likelihood should we use? At first sight, the normalization argument suggested above would seem to imply that we should of course use the conditional version (equation 2). But is that so simple?

I am not sure. Let us try to derive an algorithmic model of our problem. Fundamentally, what our conditional likelihood means, in algorithmic terms, is this:

Nature chooses a fixed parameter configuration $\theta = (\lambda,\mu)$
repeat
- Nature runs a diversification process with rates $\lambda$ and $\mu$
until process survives

First, note that the repeat loop here refers to a true, objective, repetition of the generating process: we use a conditional likelihood precisely because we consider that a typical instance of the experiment potentially involves several repeated attempts until a surviving species clade is produced. In practice, we can imagine that Nature has 'run' many clades in parallel, only some of which did not suffer from premature extinction — and the one we are considering is randomly chosen from this surviving subset. But since we assume that all the runs are independent, it is mathematically equivalent to imagine, as suggested by the 'repeat' loop above, that Nature runs the process repeatedly until the first surviving clade is produced — and we can then identify this first successful draw with our observation.

But then, it seems that we are assuming that Nature has repeatedly, stubbornly, tried under the same parameter configuration, until succeeding. However, I am not sure that this is really a good model of what actually happens.

Imagine for instance that we are more specifically interested in one particular order of mammals; say, we want to estimate the speciation and extinction rates of Carnivores. Carnivores are just one among ~25 mammalian orders. There is no doubt that there is quite some variation in diversification rates among mammalian orders: think about rodents versus primates. But then, if there is variation in diversification rates among surviving mammalian clades, there has surely been variation, more generally, among all ‘replicates’ that Nature has run in parallel, surviving or not.

So, perhaps our model should instead be something like the following:

Nature chooses a fixed distribution $\phi$ over the parameter $\theta = (\lambda, \mu)$
repeat
- Nature chooses a random $\theta = (\lambda,\mu)$ from $\phi$
- Nature runs a diversification process with rates $\lambda$ and $\mu$
until process survives

Importantly, $\phi$ is not really a Bayesian prior. Instead, it is meant to be a representation of an objective property of the process: the true variation among clades, both extant and extinct.

Notice also that this new model introduces potential selection effects on the parameter configurations that you end up observing: those clades that survive are statistically enriched in values of $\theta$ that are less likely to lead to premature extinction — in particular, enriched in values of $\lambda$ and $\mu$ such that $r = \lambda - \mu$ is larger.

If we now try to write down the probability represented by this model, we first note that $\lambda$ and $\mu$ are now intermediate variables in the algorithm, thus they should be integrated out. In addition, we have a repeat-until loop, therefore, we condition on the exit clause -- which is also averaged over $\lambda$ and $\mu$. Altogether, the model leads to the following probability of producing our tree:

(3)    $p (T \mid \phi, S) = \frac{p(T \mid \phi)}{p(S \mid \phi)} = \frac{\int p(T \mid \theta) \phi(\theta) d \theta}{\int p(S \mid \theta) \phi(\theta) d \theta}$.

This probability is still normalized: it sums to one over all possible non-empty trees, although now, for fixed $\phi$. Note that this probability is not anymore a function of $\lambda$ and $\mu$, which have been integrated away, since they are now random effects. In other words, under this model, there is just no meaningful likelihood that would be a function of our parameters of interest.

If we knew $\phi$, we could still obtain a point estimate of $\theta= (\lambda, \mu)$, by taking the maximum a posteriori (MAP): that is, by calculating the value of $\theta$ maximizing the integrand of the numerator of equation 3:

(4)    $p(T \mid \theta) \phi(\theta)$.

But we generally do not know $\phi$, and without any further assumption, we cannot hope to estimate this distribution just based on one single tree, however non-empty. Still, we can at least consider the following two limiting cases:

At one extreme, $\phi$ might be highly peaked around some unknown value $\theta_0 = (\lambda_0, \mu_0)$. In the limit of a very sharp peak, $\phi$ is nearly a point mass (a ‘Dirac’) at $\theta_0$, and equation (3) then reduces to:

$p(T \mid \theta_0, S) = \frac{p(T \mid \theta_0)}{p(S \mid \theta_0)}$,

which we can maximize w.r.t. $\theta_0$.

This probability is identical to our initial conditional likelihood (equation 2) — which makes sense, since, in that regime, the process is indeed repeatedly running under a (nearly) fixed parameter configuration $\theta_0$, until producing a non-empty tree. This likelihood does not include any selection bias in favor of groups with higher growth rates — which also makes sense, since there is no meaningful variation on which this selection bias could act.

At the other extreme, $\phi$ could be a broad distribution, such that $\phi(\theta)$ is locally constant in the vicinity of the true parameter value. In that case, our posterior probability (equation 4) becomes proportional to

$p(T \mid \theta)$,

and the MAP estimate therefore reduces to the estimate obtained by maximizing the unconditional likelihood. As already mentioned above, this unconditional likelihood includes a bonus for higher growth rates $r$ — which makes sense, since there is now sufficient variation among clades to indeed experience a species selection effect on $\theta$ when considering only surviving groups. This unconditional likelihood is not normalized over observable data, but this is not a problem: this is just because, in fact, this is not really a likelihood in the present context — it is, up to a proportionality constant, a limiting case of a posterior probability.

So, what all this means is the following: if you use a conditional likelihood, this is because you believe that there is actual repetition of the stochastic process (there is a repeat-until loop in your algorithmic model of the process). Now, repetition opens the door to the possibility of variation in the value of the parameters across instances. If you also have good reasons to suspect that this variation is in fact substantial (if there is sufficient available variation in diversification rates among species clades, extant and extinct), then, apparently, you should not condition your likelihood on ultimate survival. You should not, because by doing so, you would factor out from your estimation the selection bias that is in fact induced on the diversification parameters by differential survival of the groups. Conversely, if you believe that variation in diversification rates is limiting, then you should use a conditional likelihood.

In practice, the difference between the two estimates is probably very small anyway. Nevertheless, regardless of the practical impact, it is interesting to contemplate the purely theoretical and philosophical implications of this problem. In particular, one can see here on a specific example that being a good Frequentist does not necessarily imply that you can always consider your parameter of interest as fixed. Sometimes, the design of the problem implies that you have to consider it instead as a random quantity. Or, to put it differently, insisting on using a conditional likelihood with a fixed parameter in the present case cannot be justified purely on the grounds of some preference for one statistical paradigm — it also seems to entail a specific hypothesis about the underlying objective process (no available variation in diversification rates among clades).

Conversely, the entire derivation done here is not, strictly speaking, Bayesian either: I did not invoke any subjective prior, nor any distribution that would not have an objective meaning in terms of the underlying macroevolutionary process. The MAP estimate derived above could in fact be considered as a frequentist MAP. But after all, there are other situations where Frequentists sometimes use MAP estimation: for instance, in population genetics, when you want to estimate the age of the last common ancestor of your sample using genetic sequence data (e.g. Thomson et al, 2000): there, you consider your genealogy, not as a fixed parameter, but as a random draw from Kingman’s coalescent. Therefore, you end up using a posterior distribution, and not a likelihood.

===

Fitzjohn, R. G. (2010). Quantitative traits and diversification. Systematic Biology, 59(6), 619–633. doi:10.1093/sysbio/syq053

Maddison, W., Midford, P., & Otto, S. P. (2007). Estimating a binary character's effect on speciation and extinction. Syst Biol, 56(5), 701–710. doi:10.1080/10635150701607033

Nee, S., May, R. M., & Harvey, P. H. (1994). The Reconstructed Evolutionary Process. Philosophical Transactions of the Royal Society B: Biological Sciences, 344(1309), 305–311. doi:10.1098/rstb.1994.0068

Rannala, B., & Yang, Z. (1996). Probability distribution of molecular evolutionary trees: a new method of phylogenetic inference. Journal of Molecular Evolution, 43(3), 304–311.

Stadler, T., Kühnert, D., Bonhoeffer, S., & Drummond, A. J. (2013). Birth-death skyline plot reveals temporal changes of epidemic spread in HIV and hepatitis C virus (HCV). Proceedings of the National Academy of Sciences, 110(1), 228–233. doi:10.1073/pnas.1207965110

Thomson, R., Pritchard, J. K., Shen, P., Oefner, P. J., & Feldman, M. W. (2000). Recent common ancestry of human Y chromosomes: evidence from DNA sequence data. Proceedings of the National Academy of Sciences of the United States of America, 97(13), 7360–7365. doi:10.1073/pnas.97.13.6927