Thursday, November 13, 2014

Bayesian diversification models (2)

Concerning my last post, I realize that my notation for rejection sampling models is a bit elliptic. In particular, the rejection step corresponding to the fact that we condition on the observed data is never explicitly written. So, let me just reformulate the technical arguments and restate the main idea.

In the general case, i.e. for any Bayesian model, the rejection sampling algorithm could be written as follows:

   sample $\theta$ from prior
   simulate data given $\theta$
UNTIL simulated data match observed data $D$
return $\theta$

What this algorithm returns is one single value of $\theta$ sampled from the posterior distribution:

$p(\theta \mid D) \propto p(\theta) p(D \mid \theta)$.

Of course, you can imagine that one would typically run this algorithm many times and then use the resulting sample of independent draws from the posterior to estimate posterior mean, median, credibility intervals, etc.

Note that I use upper case letters for the REPEAT-UNTIL loop corresponding to the conditioning on empirical data -- to distinguish it from other repeat-until loops that might also be invoked, just because of the data acquisition protocol.

Now, in the case of a diversification model, we have to account for the fact that we observe only surviving species clades. In addition, we imagine that there is some natural variation among clades in diversification rates, represented by a distribution $\phi$ on the diversification parameter $\theta = (\lambda, \mu)$, such that, each clade produced by Nature would have its speciation and extinction rates independently sampled from $\phi$.

Thus, altogether, our complete rejection sampling would look like:

      Nature chooses $\theta$ from $\phi$
      Nature runs diversification process given $\theta$
   until process survives
UNTIL simulated phylogeny matches observed phylogeny $T$
return $\theta$

We now have two repeat-until loops: the inner one represents the fact that, by construction, we only consider non extinct groups. The outer one represents our conditioning on the observed phylogeny $T$. However, this could just as well be written as:

   Nature chooses $\theta$ from $\phi$
   Nature runs diversification process given $\theta$
UNTIL simulated phylogeny matches observed phylogeny
return $\theta$

simply because an empty tree will never match our observed phylogeny anyway. Thus, we end up with a value of $\theta$ sampled from a very simple posterior distribution, one that does not invoke any conditional likelihood (this was my equation 1 in the previous post):

(1)    $p(\theta \mid T) \propto \phi(\theta) \, p(T \mid \theta) $

This posterior distribution combines a prior not conditional on survival $\phi(\theta)$, with a likelihood also not conditional on survival $p(T \mid \theta)$.

Now, you can always rewrite this as:

$p(\theta \mid T) \propto \phi(\theta) p(S \mid \theta) \, p(T \mid \theta) / p(S \mid \theta)$

or, equivalently:

(2)   $p(\theta \mid T) \propto p(\theta \mid S) \, p(T \mid \theta, S)$

where $p(S \mid \theta)$ is the probability of survival of the clade. Thus, we now have a prior conditional on survival together with a likelihood also conditional on survival. Equation 1 and 2 are equivalent, and you can choose between them, fundamentally, based on which prior information you have (fossils: equation 1, other neontological studies: equation 2).

But in any case, you never get, under this algorithmic model, an unconditional prior combined with a conditional likelihood. The only way to get this combination is by considering another model:

   choose $\theta$ from some prior $p(\theta)$
      Nature runs diversification process given $\theta$
   until process survives
UNTIL simulated phylogeny matches observed phylogeny

which basically corresponds to the case where Nature has produced many clades, all under the exact same diversification rates (same parameter vector $\theta$), and we have randomly sampled one among the surviving clades thus produced. Here, we still draw $\theta$ from some prior distribution $p(\theta)$, but this distribution now merely represents our prior uncertainty about the fixed value of $\theta$ 'chosen by Nature' -- not the natural variation of $\theta$ among groups. So, here, we cannot collapse the two repeat-until loops, we have to keep the two: an internal loop corresponding to the actual 'repeated natural experiment', under fixed but unknown $\theta$, and an external loop simply representing our Bayesian inference about this unknown value of $\theta$, by rejection sampling, based on some subjective prior $p(\theta)$.

Personally, I consider this second model less convincing than the first one: we do see in practice quite some variation among closely related groups, so, we should invoke some distribution $\phi$ representing this variation.

Of course, ultimately, none of the two models is really convincing: as soon as you acknowledge the existence of variation in diversification rates between clades, then you should also expect variation within clades: if diversification rates differ between rodents and primates, then, they almost certainly vary among primates, or among rodents. Mathematically, $\theta$ should itself be the result of a stochastic process along the lineages. In fact, some interesting work has already been done along those lines, using a compound Poisson process (Rabosky, 2014).


Rabosky, D. L. (2014). Automatic detection of key innovations, rate shifts, and diversity-dependence on phylogenetic trees. PLoS One, 9(2), e89543. doi:10.1371/journal.pone.0089543

Monday, November 10, 2014

Bayesian priors for diversification studies

In my last post, I was asking whether one should condition the likelihood on survival in the context of diversification studies. The whole discussion, however, was purely at the level of the likelihood. Now, how should we look at this problem from a Bayesian perspective? 

Assuming that $\theta = (\lambda, \mu)$ is our vector representing the speciation and extinction rates, I was suggesting the following algorithmic model:

Nature has fixed distribution $\phi(\theta)$
   Nature chooses $\theta$ from $\phi$
   Nature runs diversification process given $\theta$
until process survives

where $\phi$ is the natural variation in diversification rate among clades (both extant and extinct). Our observed phylogeny can then be seen as one particular draw from this simulation model. Conditioning on the observed phylogeny $T$ leads to the following posterior distribution:

(1)    $p(\theta \mid T) \propto \phi(\theta) \, p(T \mid \theta) $

involving the likelihood unconditional on ultimate survival. I was then considering two limiting cases of this equation in the context of a maximum a posteriori approach.

Now, simply identifying $\phi$ with our prior and sampling from the posterior distribution defined by equation 1 would represent a meaningful Bayesian formalization of the problem. However, this assumes that we can see our prior as a reasonable estimate of the true natural variation in diversification rates $\phi$  and not just as a representation of some loosely defined prior beliefs about possible values of $\theta$. So, in particular, this has to be an informative prior.

Do we have practical ways to do this? Yes, I think so. In fact, I can see two different approaches to this problem.

Fossil data  Suppose we are able to come up with an acceptable estimate of the distribution of diversification rates from the fossil record. Since $\phi$ in equation 1 is to be interpreted as the distribution of diversification rates across all clades, both extant and extinctwe can directly identify $\phi$ with our empirical estimate $\hat \phi$ obtained from fossils, which leads us to the following posterior distribution:

(2)    $p(\theta \mid T) \propto \hat \phi(\theta) \, p(T \mid \theta)$.

Neontological data  Alternatively, suppose that we already have estimates of $\theta$ across several other clades (e.g. several mammalian orders), obtained by fitting or conditioning diversification models on their respective phylogenies. We would like to use this information to derive the prior for our clade of interest (say, Carnivores). Our prior information, however, is now itself conditional on survival, so, we can not simply calculate the mean and the variance of the speciation and extinction rates across other clades and use it as a proxy for $\phi$, since $\phi$ is not conditional on extinction.

On the other hand, we can re-write equation 1 as follows:

$p(\theta \mid T) = \phi(\theta) p(T \mid \theta) = \phi(\theta) p(S \mid \theta) \, p(T \mid \theta) / p(S \mid \theta) = p(\theta \mid S) \, p(T \mid \theta, S)$.

where $p(S \mid \theta)$ is, as in the previous post, the probability of ultimate survival of the clade given $\theta$. So, here, we have two factors: a conditional prior:

$p(\theta \mid S) \propto \phi(\theta) p(S \mid \theta)$,

which can be represented by the following sampling model:

   choose $\theta$ from $\phi$
   run a diversification process under $\theta$
until process survives
return $\theta$

clearly suggesting that this is the prior distribution of values of $\theta$ across surviving groups. And a likelihood now conditional on survival:

$p(T \mid \theta, S) = p(T \mid \theta) / p(S \mid \theta)$,

which we have already seen. What all this means is that, instead of working with $\phi$ and with the unconditional likelihood, we can equivalently work with a prior conditioned on survival, but then we should combine it with the conditional likelihood. In practice, we can for instance calculate the mean and the variance of speciation and extinction rates estimated on other mammalian orders and, using a moment-matching argument, derive an empirical prior $\hat \psi(\theta)$, now conditional on survival. The posterior is then:

(3)    $p(\theta \mid T) \propto \hat \psi(\theta) \, p(T \mid \theta, S)$.

Note that this idea of combining the conditional likelihood with a conditional prior is similar to what is proposed in Stadler et al (2013).

Thus, it all depends on what prior information you have. If your prior information is obtained from fossil evidence, then this should be considered as unconditional prior, to be used in conjunction with an unconditional likelihood (equation 2). If, on the other hand, your prior information has been gathered from other neontological studies, then this prior information is conditional on survival, and as such it should be be combined with a conditional likelihood (equation 3).

No prior information  Now, what if we do not want to use fossil evidence, nor any information gained from other clades? We would like to say that we do not know anything about $\lambda$ and $\mu$ and, accordingly, use a non-informative prior. However, which likelihood should we combine with this uninformative prior? conditional or unconditional? 

This is a bit tricky. First, there would be much to say about uninformative priors in general, before dealing with the question of how to use them in the present case. Second, technically, the only Bayesian prior that we can meaningfully introduce here should express our uncertainty about $\phi$. So, we would end up with two levels in our model: $\theta$ sampled from $\phi$, itself sampled from some hyperprior. We could explicitly derive something along those lines, but this would be uselessly complicated for the present matter. Instead, we can pretend that we implicitly marginalize our complicated two-level model over the hyperprior, ending up with an effective prior directly on $\theta$.

At the end, I think it is just a question of deciding whether we want to express lack of information about $\theta$ among all clades or only among surviving clades. In the first case, we would combine our uninformative prior with the unconditional likelihood, in the second case, with the conditional likelihood*. Personally, I think I would prefer to express lack of information among all clades, for it is not true that I don't know anything about the diversification rate of an extant species group: I know in particular that $\lambda - \mu$ cannot be unreasonably low, otherwise, I would not have observed this clade.

In summary, in a Bayesian context, there seems to be some flexibility in which version of the likelihood we can use: in fact, it all depends on the meaning attached to the prior.

We can of course imagine more complicated settings: estimating $\phi$ by simultaneously considering several clades in the context of a hierarchical Bayesian model, constructing integrative models jointly considering fossil and molecular data, etc. All this can lead to relatively complicated arguments and calculations for deriving the exact prior to be used  but this is exactly where algorithmic representations can be useful.


Stadler, T., Kühnert, D., Bonhoeffer, S., & Drummond, A. J. (2013). Birth-death skyline plot reveals temporal changes of epidemic spread in HIV and hepatitis C virus (HCV). Proceedings of the National Academy of Sciences, 110(1), 228–233. doi:10.1073/pnas.1207965110

* Strictly speaking, there is a third case: we could also entertain the model in which there is in fact no available variation under $\phi$, so that $\phi$ is highly concentrated around some value $\theta_0$ (as considered in my previous post). But we don't know the location of $\theta_0$ and therefore we put an uninformative prior on it. In that case, we would combine the uninformative prior with the conditional likelihood. However, personally, I am not convinced by this specific model: I tend to think that there is variation among clades, so that the clades that we observe are statistically enriched in higher diversification rates -- a fact that should be included in, and not factored out from, what we call our empirical data.

Thursday, November 6, 2014

Should we condition on non-extinction?

Diversification studies are concerned with the problem of testing models of species diversification and estimating their parameters (such as speciation and extinction rates) based on time-calibrated phylogenies. One particular question that regularly comes up in this context is whether and how we should take into account, in the inference procedure, the fact that, by construction, we only consider non-extinct groups.

A good sign that this question is not so obvious is that many different solutions seem to be considered in practice: likelihood functions are sometimes conditional on non-extinction (Stadler et al, 2013), sometimes conditional on the age of the last common ancestor of the group (Nee et al, 1994), sometimes not conditioned at all (Maddison et al, 2007, FitzJohn, 2010), or even conditioned on the number of species sampled today (in the context of molecular dating, Rannala and Yang, 1996). So, clearly, the question requires some clarification.

As a simple illustrating case, let us assume a diversification process with constant but unknown rates of speciation ($\lambda$) and extinction ($\mu$), and let us call $\theta = (\lambda, \mu)$ our parameter vector. Technically, we also have a third parameter, $t_0$, the time of origin of the process, but let us leave this detail aside for the moment (it does not really change the main argument).

A first attempt at deriving a likelihood for this problem is simply to consider the probability of producing the observed tree $T$, conditional on the parameter $\theta$:

(1)    $p(T \mid \theta)$.

One problem, however, is that there is a positive probability that the process gets extinct and therefore returns an empty tree:

$p(T = \emptyset \mid \theta) = 1 - p(S \mid \theta) > 0$,

where $p(S \mid \theta)$ is the probability of ultimate survival of the group, given the parameters. Yet, by the very design of the experiment, the only cases that we consider in practice are non-empty trees, and we would like our likelihood to sum to one over all possible observable configurations for the data — thus, over all possible non-empty trees.

We can avoid this ‘missing mass’ problem and restore a correct normalization of the probability by just conditioning on ultimate survival:

(2)    $p(T \mid \theta, S) = \frac {p(T \mid \theta)} {p(S \mid \theta)}$.

We could then take this conditional probability as our likelihood and proceed to estimation, for instance, by maximizing this conditional likelihood with respect to $\theta$.

One important thing to note here is that, since the unconditional likelihood (1) accounts for the probability of surviving, while the conditional likelihood (2) factors this effect out of the probability, then, everything else being equal, parameter configurations that tend to increase the chances of ultimate survival will be favored by the unconditional likelihood (1), but not by the conditional likelihood (2). In particular, the probability of survival is strongly dependent on the net expected growth rate of the species group $r = \lambda - \mu$, with larger values of $r$ (higher growth rates) implying a higher chance of ultimate survival. Thus, if you compare your estimates returned by each of the two versions of the likelihood, you will in general obtain a larger value for $r$ under (1) than under (2).

So, which likelihood should we use? At first sight, the normalization argument suggested above would seem to imply that we should of course use the conditional version (equation 2). But is that so simple?

I am not sure. Let us try to derive an algorithmic model of our problem. Fundamentally, what our conditional likelihood means, in algorithmic terms, is this:

Nature chooses a fixed parameter configuration $\theta = (\lambda,\mu)$
- Nature runs a diversification process with rates $\lambda$ and $\mu$
until process survives

First, note that the repeat loop here refers to a true, objective, repetition of the generating process: we use a conditional likelihood precisely because we consider that a typical instance of the experiment potentially involves several repeated attempts until a surviving species clade is produced. In practice, we can imagine that Nature has 'run' many clades in parallel, only some of which did not suffer from premature extinction — and the one we are considering is randomly chosen from this surviving subset. But since we assume that all the runs are independent, it is mathematically equivalent to imagine, as suggested by the 'repeat' loop above, that Nature runs the process repeatedly until the first surviving clade is produced — and we can then identify this first successful draw with our observation.

But then, it seems that we are assuming that Nature has repeatedly, stubbornly, tried under the same parameter configuration, until succeeding. However, I am not sure that this is really a good model of what actually happens.

Imagine for instance that we are more specifically interested in one particular order of mammals; say, we want to estimate the speciation and extinction rates of Carnivores. Carnivores are just one among ~25 mammalian orders. There is no doubt that there is quite some variation in diversification rates among mammalian orders: think about rodents versus primates. But then, if there is variation in diversification rates among surviving mammalian clades, there has surely been variation, more generally, among all ‘replicates’ that Nature has run in parallel, surviving or not.

So, perhaps our model should instead be something like the following:

Nature chooses a fixed distribution $\phi$ over the parameter $\theta = (\lambda, \mu)$
- Nature chooses a random $\theta = (\lambda,\mu)$ from $\phi$
- Nature runs a diversification process with rates $\lambda$ and $\mu$
until process survives

Importantly, $\phi$ is not really a Bayesian prior. Instead, it is meant to be a representation of an objective property of the process: the true variation among clades, both extant and extinct.

Notice also that this new model introduces potential selection effects on the parameter configurations that you end up observing: those clades that survive are statistically enriched in values of $\theta$ that are less likely to lead to premature extinction — in particular, enriched in values of $\lambda$ and $\mu$ such that $r = \lambda - \mu$ is larger.

If we now try to write down the probability represented by this model, we first note that $\lambda$ and $\mu$ are now intermediate variables in the algorithm, thus they should be integrated out. In addition, we have a repeat-until loop, therefore, we condition on the exit clause -- which is also averaged over $\lambda$ and $\mu$. Altogether, the model leads to the following probability of producing our tree:

(3)    $p (T \mid \phi, S) = \frac{p(T \mid \phi)}{p(S \mid \phi)} = \frac{\int p(T \mid \theta) \phi(\theta) d \theta}{\int p(S \mid \theta) \phi(\theta) d \theta}$.

This probability is still normalized: it sums to one over all possible non-empty trees, although now, for fixed $\phi$. Note that this probability is not anymore a function of $\lambda$ and $\mu$, which have been integrated away, since they are now random effects. In other words, under this model, there is just no meaningful likelihood that would be a function of our parameters of interest.

If we knew $\phi$, we could still obtain a point estimate of $\theta= (\lambda, \mu)$, by taking the maximum a posteriori (MAP): that is, by calculating the value of $\theta$ maximizing the integrand of the numerator of equation 3:

(4)    $p(T \mid \theta) \phi(\theta)$.

But we generally do not know $\phi$, and without any further assumption, we cannot hope to estimate this distribution just based on one single tree, however non-empty. Still, we can at least consider the following two limiting cases:

At one extreme, $\phi$ might be highly peaked around some unknown value $\theta_0 = (\lambda_0, \mu_0)$. In the limit of a very sharp peak, $\phi$ is nearly a point mass (a ‘Dirac’) at $\theta_0$, and equation (3) then reduces to:

$p(T \mid \theta_0, S) = \frac{p(T \mid \theta_0)}{p(S \mid \theta_0)}$,

which we can maximize w.r.t. $\theta_0$.

This probability is identical to our initial conditional likelihood (equation 2) — which makes sense, since, in that regime, the process is indeed repeatedly running under a (nearly) fixed parameter configuration $\theta_0$, until producing a non-empty tree. This likelihood does not include any selection bias in favor of groups with higher growth rates — which also makes sense, since there is no meaningful variation on which this selection bias could act.

At the other extreme, $\phi$ could be a broad distribution, such that $\phi(\theta)$ is locally constant in the vicinity of the true parameter value. In that case, our posterior probability (equation 4) becomes proportional to

$p(T \mid \theta)$,

and the MAP estimate therefore reduces to the estimate obtained by maximizing the unconditional likelihood. As already mentioned above, this unconditional likelihood includes a bonus for higher growth rates $r$ — which makes sense, since there is now sufficient variation among clades to indeed experience a species selection effect on $\theta$ when considering only surviving groups. This unconditional likelihood is not normalized over observable data, but this is not a problem: this is just because, in fact, this is not really a likelihood in the present context — it is, up to a proportionality constant, a limiting case of a posterior probability.

So, what all this means is the following: if you use a conditional likelihood, this is because you believe that there is actual repetition of the stochastic process (there is a repeat-until loop in your algorithmic model of the process). Now, repetition opens the door to the possibility of variation in the value of the parameters across instances. If you also have good reasons to suspect that this variation is in fact substantial (if there is sufficient available variation in diversification rates among species clades, extant and extinct), then, apparently, you should not condition your likelihood on ultimate survival. You should not, because by doing so, you would factor out from your estimation the selection bias that is in fact induced on the diversification parameters by differential survival of the groups. Conversely, if you believe that variation in diversification rates is limiting, then you should use a conditional likelihood.

In practice, the difference between the two estimates is probably very small anyway. Nevertheless, regardless of the practical impact, it is interesting to contemplate the purely theoretical and philosophical implications of this problem. In particular, one can see here on a specific example that being a good Frequentist does not necessarily imply that you can always consider your parameter of interest as fixed. Sometimes, the design of the problem implies that you have to consider it instead as a random quantity. Or, to put it differently, insisting on using a conditional likelihood with a fixed parameter in the present case cannot be justified purely on the grounds of some preference for one statistical paradigm — it also seems to entail a specific hypothesis about the underlying objective process (no available variation in diversification rates among clades).

Conversely, the entire derivation done here is not, strictly speaking, Bayesian either: I did not invoke any subjective prior, nor any distribution that would not have an objective meaning in terms of the underlying macroevolutionary process. The MAP estimate derived above could in fact be considered as a frequentist MAP. But after all, there are other situations where Frequentists sometimes use MAP estimation: for instance, in population genetics, when you want to estimate the age of the last common ancestor of your sample using genetic sequence data (e.g. Thomson et al, 2000): there, you consider your genealogy, not as a fixed parameter, but as a random draw from Kingman’s coalescent. Therefore, you end up using a posterior distribution, and not a likelihood.


Fitzjohn, R. G. (2010). Quantitative traits and diversification. Systematic Biology, 59(6), 619–633. doi:10.1093/sysbio/syq053

Maddison, W., Midford, P., & Otto, S. P. (2007). Estimating a binary character's effect on speciation and extinction. Syst Biol, 56(5), 701–710. doi:10.1080/10635150701607033

Nee, S., May, R. M., & Harvey, P. H. (1994). The Reconstructed Evolutionary Process. Philosophical Transactions of the Royal Society B: Biological Sciences, 344(1309), 305–311. doi:10.1098/rstb.1994.0068

Rannala, B., & Yang, Z. (1996). Probability distribution of molecular evolutionary trees: a new method of phylogenetic inference. Journal of Molecular Evolution, 43(3), 304–311.

Stadler, T., Kühnert, D., Bonhoeffer, S., & Drummond, A. J. (2013). Birth-death skyline plot reveals temporal changes of epidemic spread in HIV and hepatitis C virus (HCV). Proceedings of the National Academy of Sciences, 110(1), 228–233. doi:10.1073/pnas.1207965110

Thomson, R., Pritchard, J. K., Shen, P., Oefner, P. J., & Feldman, M. W. (2000). Recent common ancestry of human Y chromosomes: evidence from DNA sequence data. Proceedings of the National Academy of Sciences of the United States of America, 97(13), 7360–7365. doi:10.1073/pnas.97.13.6927