Monday, January 27, 2014

Overcoming the fear of over-parameterization



I am always suprised to see many phylogeneticists and evolutionary biologists often so afraid of parameter-rich models. For some reason, nothing scares people more than the risk of overfitting the data. Sometimes, this goes to the point that, in spite of overwhelming evidence of model violations and systematic biases of all kinds, clearly suggesting that the models used are in fact awfully under-parameterized, people will nevertheless stick to rigid models, refusing to abandon them in favor of more flexible alternatives just because they are afraid of overfitting.

For me, shunning rich and flexible models out of fear of over-parameterization is pure superstition -- a bit like refraining from the pleasures of life out of fear of going to hell.

There are at least three different reasons why I think that this fear of over-parameterization is irrational.

First, although overfitting may indeed be a true concern in some contexts, in particular using classical maximum likelihood, its consequences are greatly overestimated. In fact, if one has to choose between under or over-parameterization, the latter is, by far, the lesser of two evils. An over-parameterized model will typically result in high variance. In the context of a phylogenetic analysis, for instance, this will mean low boostrap support values. Low bootstrap support may not always be a good news, but at least, one does not get a strong support for false claims. In contrast, the use of an excessively rigid model often leads to bias, i.e. to a strong support for wrong trees -- we have seen that but too often in applied phylogenetics over the last twenty years. Thus, in case of doubt, and if we want to be on the safe side, we should in fact prefer more, not less, parameter-rich model configurations.

Second, overfitting is much less likely to occur today than it used to be. Multi-gene, not to mention genome-wide, datasets almost certainly contain enough information to estimate even the richest models currently available in phylogenetics. Similar situations are probably encountered in many other domains of applied statistics today. I think that people have kept some Pavlovian reflexes from old times, when data used to be a scarce resource. Clearly, these reflexes are totally outdated.

Third, and more fundamentally, over-parameterization problems are evitable -- parameter-rich models can easily be regularized. In particular, in a Bayesian framework, they can be regularized through the use of hierarchical priors.

The idea is very simple and can be illustrated by making an analogy with the idea of allowing for among site rate variation when reconstructing phylogenies. In this context, since maximizing the likelihood with respect to site-specific rates would result in over-fitting problems, the likelihood is instead integrated over site-specific rates. The integral is done over a distribution that is itself parameterized, so that its overall shape (in particular its variance) can be adjusted according to the information collected across all sites. In practice, the distribution is classically chosen to be a gamma of mean 1, parameterized by its shape parameter $\alpha$ tuning the variance of the distribution (Yang, 1994).

This model has a justification in the context of the classical maximum likelihood paradigm, as a random-effect model. But it can also be seen as a particular case of shrinkage. The amount of shrinkage is tuned by $\alpha$: if $\alpha$ is large, the variance of rates across sites is small, and shrinkage is strong. In turn, the $\alpha$ parameter is tuned by the data. In particular, if the sequences under study are such that there is no among site rate variation, then the estimated value of $\alpha$ will be very large, and rates across sites will be shrunk so strongly toward their mean that the model will effectively collapse into the uniform-rate model. In other words, shrinkage is self-tuned -- it automatically adjusts to what is needed by the data.

Thanks to this self-tuning behavior, explicit model selection (here, between the uniform or the variable rate models) it is not even necessary: one can directly use the more complex model (with variable rates) by default, given that it will automatically reduce to the simpler one if needed*. Note how this contradicts accepted widsom, according to which you are supposed to use the simpler model by default, unless it is firmly rejected by the data. 

This idea of self-tuned shrinkage can be used much more generally. Consider for instance the situation where you want to reconstruct a phylogeny using ribosomal RNA sequences from many bacterial species. We know that there is substantial compositional variation among lineages. We also know that compositional variation is an important source of phylogenetic error, potentially leading to an artifactual grouping of species with similar GC composition. Thus, we want to accomodate variation in GC content over the tree. A simple phenomenological way to do that is to invoke a distinct equilibrium GC for the substitution process on each branch. Doing this entails quite a few additional parameters (1 per branch), but we can shrink those GC parameters toward their mean over branches, e.g. by considering them i.i.d. from a distribution whose mean and variance are themselves hyperparameters of the model. The model will automatically regulate the level of shrinkage through the estimation of the variance parameter.

Endless variations on that theme can be imagined.  One can afford, not just an equilibrium GC, but more generally a distrinct equilbrium frequency profile of even a distinct substitution matrix on each branch, for each gene, for each site, etc. 

It is even possible to invoke several levels of shrinkage. For instance, in the context of a partitioned analysis, one would shrink branch lengths across partitions toward branch-specific means and according to branch-specific variances that are themselves shrunk across branches through global means and variances. In this way, the empirical signal is ultimately funnelled onto a small set of hyperparameters, thus providing a very flexible tuning of the global structure and intensity of shrinkage.

There are of course a few delicate issues here: how to choose the distributions, the architecture of the hierarchy, as well as the hyperpriors. In fact, in some cases, not doing those things properly can result in biases that might be worse than the excess variance that would have been incurred by bold maximization of the likelihood without any consideration for shrinkage (although not as bad as using a less parameter-rich model).

But in any case, what all this means is that over-parameterization is just not, in itself, a problem.

Now, I am not saying that parameter-rich models do not suffer from important limitations. In particular, complex models obviously raise substantial computational challenges. It is generally not very convenient to have as many substitution matrices as there are branches, genes or sites in a phylogenetic analysis. So, clearly, simpler models have an edge on the side of computational efficiency.

On a more philosophical note, it is also true that being able to capture the essence of a phenomenon through a compact model invoking as few parameters as possible is indeed an elegant approach to statistical modeling. I can of course see the value of such an ideal of conceptual parsimony.

But then, even if we sometimes want or need less parameter-rich models, this is for other reasons than just the issue of avoiding over-fitting. It is stupid to avoid parameter-rich models at all costs. Instead, one should learn to enjoy the freedom of choosing between rich or compact models, depending on what we want to do. Exactly as one can sometimes appreciate the idea of refraining from the pleasures of life, in praise of moderation and frugality -- but not out of fear of going to hell.

There are ways to deal with rich and flexible models. So, it is time to move on from our primitive fear of over-parameterization.

===

* The fact that the simpler model is obtained by setting $\alpha$ to infinity is a bit inconvenient, but this can easily be fixed by reparameterizing the model (e.g. in terms of $v = 1/\alpha$, so that $v=0$ is now the uniform-rate model).

Yang, Z. (1993). Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. Molecular Biology and Evolution, 10(6), 1396–1401.

Friday, January 17, 2014

The adventurous Bayesian and the cautious Frequentist


Although the idea of controlling for false discovery rate (FDR) was first proposed in a purely classical context (Benjamini and Hochberg, 1995), the FDR has a simple empirical Bayes interpretation (Efron, 2010), which I would like to illustrate here on a simple case.

Also, I find it useful to contrast the two approaches, classical and empirical Bayes, on the issue of controlling for FDR and, more generally, on hypothesis testing, for it reveals interesting things about the fundamental difference between them.

To fix the ideas, let us consider the classical textbook case of a blood test for detecting some particular disease. The test has a rate of type I error equal to $\alpha$ (an expected fraction $\alpha$ of the people not affected by the disease will show up as positive), and a rate of type II error of $\beta$ (a fraction $\beta$ of affected individuals will be missed by the test). The population is composed of a fraction $p_0$ of unaffected people, and a fraction $p_1 = 1 - p_0$ of individuals affected by the disease.

Conducting the test on a large sample of individuals taken at random, the total fraction of positive outcomes will be:

$\pi = p_0 \alpha + p_1 (1-\beta)$,

that is, the total number of discoveries will be the sum of a fraction $\alpha$ of the unaffected individuals (the false positives) and a fraction ($1-\beta$) of affected individuals (the true positives). The rate of false discovery is then simply the proportion of all discoveries corresponding to false positives, i.e. the relative contribution of the first term:

$FDR = \frac{p_0 \alpha}{\pi}$.

Letting $H_0$ denote the hypothesis that a given individual is unaffected, and $H_1$ the hypothesis that the individual is affected, this can be rewritten as:

$\pi = p(positive) = p(H_0) p(positive \mid H_0) + p(H_1) p(positive \mid H_1)$,

$FDR = \frac{p(H_0) p(positive \mid H_0) } { p(positive) }  = p(H_0 \mid positive)$,

which makes the Bayesian nature of the concept of FDR more than apparent: the FDR is just the posterior probability of not being affected given that the test was positive.

An empirical Bayes approach typically tries to achieve an (asymptotically) exact control for the FDR, by estimating $p_0$ and $p_1$ directly from the data.

On the other hand, it is possible to control for FDR -- although now conservatively -- in a non-Bayesian context. The idea is very simple: the fraction $\pi$ is empirically observed, $\alpha$ is known by design, and $p_0 < 1$. Therefore:

$FDR = \frac{p_0 \alpha}{\pi} < \frac{\alpha}{\pi}$,

which gives an upper bound on the FDR. In plain words: at $\alpha=0.05$, I expect at most 5% of positive tests just by chance. I observe 15%. Therefore, at most one third of my disoveries are false.

If $p_1 << 1$, $p_0$ is very close to 1 and, therefore, the upper bound is tight (I expect only slightly less than 5% just by chance). Even if $p_1$ is not that small, as long as it is not very large (as long as we are not in a situation where the vast majority of the population is affected by the disease), the bound is still very useful. For instance, even if half of the population is affected, which is already a lot, and the nominal control is at 5%, then the true FDR is between 2 and 3 %, which is of the same order of magnitude as the nominal rate.

What is also interesting here is that, in order to obtain this useful upper bound on the FDR, we do not need to assume anything about the test nor about the population, except that $p_1$ is small. Thus, to control for FDR, we do not need to take the trouble to explicitly model the whole situation by empirical Bayes: we get good bounds based on elementary arguments and direct counting of the number of positive outcomes that we observe.

On the other hand, if we care, not just about FDR, but more generally about the sensitivity-specificity tradeoff, then we have to look at the power of the test: are we detecting a sufficient proportion of the true cases?

The problem, however, is that power often depends on some underlying effect size (let us call it $\theta$). Typically, the null hypothesis is just $\theta = 0$ and the alternative is $\theta \ne 0$. In our example, $\theta$ could be the concentration of some antibody or viral antigen in the blood. The power of the test will generally increase with $\theta$: values of $\theta$ close to 0 will be less easily detected than very large values of $\theta$.

In this context, what matters for the sensitivity-specificity tradeoff is the average power of the test over the population under $H_1$. Statistically, this corresponds to an average over the prior distribution of $\theta$ conditional on $H_1$. Thus, we see that the realized power of a test also has a natural empirical Bayes interpretation, like the FDR.

Again, typically, an empirical Bayes procedure will try to estimate the true law of $\theta$ under $H_1$. If it succeeds in this estimation, then the control of the sensitivity-specificity compromise is optimal: for a given level $\alpha$, we can obtain good estimate of the exact FDR (not an upper bound), as well as a good estimate of the average power of the test, so we are in the best position to fine-tune the balance between specificity and sensitivity.

On the other hand, all this assumes that we are able to correctly estimate the true law of $\theta$ under $H_1$, which is not so trivial. Failure to do so may result in some discrepancy between the sensitivity-specificity balance that we think we achieve and the true performances of the procedure. In other words, the whole empirical Bayes enterprise is attractive, in terms of its theoretically optimal properties, but potentially risky if the practical implementation details turn out to be difficult to control.

As an alternative, we could try to guarantee good bounds on the power of the test using non-Bayesian arguments. In particular, in some cases, there may be a lower bound on $\theta$ under $H_1$ -- say, we are reasonably certain that almost all affected individuals will have an effect size (e.g. a viral load) larger than some $\theta_0$. If we are lucky, this lower bound on $\theta_0$ may be sufficiently large to simultaneously guarantee a high power and a low FDR, and this, even in the worst-case situation where the entire affected population would be characterized by an effect size $\theta$ no larger than $\theta_0$. Such an argument provides the basis for an acceptable sensitivity-specificity tradeoff that may not be optimal, but good enough and, more importantly, robust with respect to the exact distribution of effect sizes (as long as it is indeed bounded below by $\theta_0$).

There are situations, however, where things are not so easy. In particular, there are many cases where $\theta$ can take values arbitrarily close to 0 under $H_1$, in which case tight bounds cannot be found. There, if we really want to have a good control and good sensitivity-specificity tradeoff, we might need to adopt a Bayesian perspective on the problem and explicitly model the statistical behavior of the effect size $\theta$ across invididuals belonging to $H_1$.

All this illustrates the fundamental difference between classical frequentism and empirical Bayes. Both care about the operational properties of their statistical decisional procedures in the long-run (thus, in a sense, both are frequentist). But it is just that empirical Bayes is more confident in its ability to come up with a complete probabilistic model of the situation and to reliably estimate this model from the data. In principle, the resulting procedure is asymptotically optimal. However, this optimality critically depends on the overall validity of the model and the estimation procedure. If these conditions are not met, then, the empirical Bayes approach may fail to keep its promise to deliver a reliable frequentist control.

In contrast, classical frequentism is more like a bet-hedging strategy, trying to guarantee some good properties even in the worst-case situations, provided a minimum set of critical, generally non-distributional, assumptions. In the case of FDR, it turns out that it is possible to get very useful bounds -- FDR is a case where classical worst-case thinking pays off. On the other hand, a good estimation of the power of a test, and therefore a good compromise between sensitivity and specificity, is typically more difficult to obtain using a worst-case driven statistical approach.

Both strategies are useful. In fact, they could be combined more systematically. Typically, the empirical Bayesian perspective on a given problem could be first contemplated, as the ideal approach, and then used as a starting point from which to derive more robust procedures by trying to obtain tight bounds that are less dependent on the exact prior distributions.

Efron, again: "One definition says that a frequentist is a Bayesian trying to do well, or at least not too badly, against any possible prior distribution."

See also Bickel (2012), who elaborates on the idea of Bayesian and frequentist inference being opposite extremes along a "degree-of-caution" axis.

===

Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological), 289–300.

Bickel, D.R. (2012). Controlling the degree of caution in statistical inference with the Bayesian and frequentist approaches as opposite extremes. Electron. J. Statist. 6:1. Arxiv version here.

Efron, B. (2005). Bayesians, Frequentists, and Scientists. Journal of the American Statistical Association 100 (469), 1–5.

Efron, B. (2010). Large-scale inference: empirical Bayes methods for estimation, testing and prediction. (Institute of Mathematical Statistics Monographs). Cambridge University Press.




Saturday, January 11, 2014

The empirical Bayes future of genomics


Empirical Bayes is the future of genomics. None explains this point more clearly and more convincingly than Brad Efron, in a book that should really belong to your personal library if you are into statistical bioinformatics (Efron, 2010, see also Efron 2008).

The fundamental and very simple idea is that large amounts of data make it possible to estimate the priors directly from them, thus ensuring good frequentist properties for Bayesian estimation and hypothesis testing in this context.

In fact, it is not just that you have good frequentist properties: in many respects, you have a better frequentist paradigm than the classical one -- a paradigm that speaks to your intuition (it gives you the probability that your hypothesis is true!), is more relevant in what it controls for (see below), often has optimal properties, and finally, can much more easily be adapted, modulated, shrunk in various directions, depending on the specific needs of the empirical question of interest.

In evolutionary biology, empirical Bayes is used, among other things, for estimating gene trees, site-specific rates, site-specific selective regimes (dN/dS), or for reconstructing ancestral sequences. In genomics, for calling SNPs, detecting differentially expressed genes or for conducting genome-wide association studies. Empirical Bayes is a very natural idea in a fully Bayesian context but also in a maximum likelihood framework (in fact, it is more often associated with the latter). So, empirical Bayes is already at home in evolutionary biology.

Still, I have the impression that it is not always clearly realized, at least in our community, that these approaches should in principle have good asymptotic frequentist calibration properties. For instance, it is rarely emphasized that clade posterior probabilities might be well calibrated in a gene-tree species-tree reconciliation framework, that the credible intervals attached to site-specific estimates of evolutionary rates or dN/dS should be true confidence intervals (group-wise, i.e. 95% of them across positions contain the true value), and that there is a built-in asymptotic control for the false discovery rate (FDR) when testing for positively selected coding positions.

In particular, I think it is under-appreciated that the good frequentist property that you get nearly immediately in an empirical Bayes hypothesis testing context is, not control for type I error, but control for the false discovery rate.

For instance, if you decide to call "positively selected" all positions that have posterior probablity $p>0.95$ of having $dN/dS>1$, you are not ensuring that you will get at most 5% false positives among sites not under positive selection, which corresponds to the classical type I error idea. On the other hand, you are guaranteeing at most 5% of false discoveries among the set of all sites which you declared significantly under positive selection. Which is perhaps more useful in practice.

There are ways to obtain a good estimate of the actual false discovery rate, instead of just an upper bound. You just need to calculate the average of $(1 - p)$ over all sites for which $p>0.95$ (you can check that this will indeed be smaller than 5%). Conversely, defining a target FDR of, say 5%, the idea would be to identify the threshold $p_0$ such that the average of $(1-p)$ over all sites for which $p>p_0$ is exactly 5% and then call "positively selected" all positions with $p>p_0$.

All this has already been said and done, including in the context of detecting positive selection across sites (Guindon et al, 2005). Yet I have the impression that this is not yet totally integrated by our research community. Many people still seem to think that posterior probabilities, even in an empirical Bayes context, are not to be interpreted in frequentist terms (because they are somehow subjective), and that, conversely, being a frequentist means that you should think within the frame of a "type I error" paradigm.

Of course, all this is true only asymptotically and under correct model specification. Concerning the asymptotic condition, simulations suggest that calibration does indeed deteriorate for very small alignments and under weak selective effects (Guindon et al, 2005). Concerning model specification, the conditions to get good calibration are in particular that that the law of effect sizes (e.g. distribution of dN/dS) under the null and under the alternative should be correctly modelled. 

This may look like strict conditions. Yet, for me, this just means that we should move beyond gene-by-gene analyses and implement genome-wide studies instead, using hierarchical priors simultaneously accounting for gene-specific and site-specific effects. In this genome-wide context, we also start to get sufficient amounts of empirical signal for modelling the laws of effect sizes with more generality, using semi-parametric or non-parametric methods.

In brief, with genome-wide sequences, large computers and MPI parallelism, it now becomes possible to experiment the power and the frequentist properties of empirical Bayes in practically relevant situations.

---

Efron, B. (2008). Microarrays, empirical Bayes and the two-groups model. Statistical Science, 1–22.

Efron, B. (2010). Large-scale inference: empirical Bayes methods for estimation, testing and prediction. (Institute of Mathematical Statistics Monographs). Cambridge University Press.

Guindon, S., Black, M., & Rodrigo, A. (2006). Control of the false discovery rate applied to the detection of positively selected amino acid sites. Molecular biology and evolution, 23(5), 919-926.

Thursday, January 9, 2014

Shrinking gene trees


A particularly interesting example of empirical Bayes in current phylogenetics is provided by the gene-tree species-tree reconciliation methods (Arvestad et al, 2003, Akerborg et al, 2009, see also a related approach by maximum a posteriori, Boussau et al, 2013).

There are many reasons to be interested in reconstructing gene trees, and then trying to interpret these trees in terms of gene duplication and loss (and horizontal gene transfer, but let us ignore that for the moment). In particular, it is useful for assessing whether genes are paralogous or orthologous (you just need to look at their last common ancestor in the tree, and see whether this is a duplication or a speciation node).

However, genes usually have relatively short coding sequences. After aligning genes and trimming poorly aligned regions, we typically end up with something between 100 and 1000 coding positions. Thus, not much signal to infer a phylogenetic tree with high support.

On the other hand, the process of gene duplication and loss along the species tree can be modelled. If the species tree, the duplication rate and the loss rate are known, this process represents a reasonable mechanistic prior distribution over the set of all possible gene phylogenies.

If we do not know the species tree and the duplication and loss rates, we can adopt an empirical Bayes approach, i.e. devise a hierarchical model that will borrow strength across gene families for estimating the global parameters (the species tree and the duplication and loss rates), while inferring gene trees by local application of Bayes rule.

The Bayesian gene-tree species-tree reconciliation approach therefore can be ssen as a typical empirical Bayes shrinkage estimator -- gene trees are shrunk toward the species tree. And there is indeed quite some shrinkage here: gene trees are substantially more accurate and more supported than if reconstructed using standard phylogenetic methods that do not consider the species tree (Akerborg et al, 2009, Boussau et al, 2013).

Also, clade posterior probabilities (at the level of gene trees) are asymptotically calibrated. In the context of probabilistic orthology analysis (Sennblad and Lagergren, 2009), this means in that there is a good asymptotic control for the rate of false discovery (false orthology/paralogy assignment).

All this is true only asymptotically, but I guess that, for most practical purposes, we can equate "genome-wide" with "asymptotic".

And all this is true only under the assumptions of the model, but good frequentist properties are always conditional on correct model specification.

Still, if we want to question the assumptions of the model: one dubious assumption is the hypothesis that duplication and loss rates are homogeneous across gene families and across the species tree. If needed, however, it is always possible to add some flexibility, such as branch-specific or gene-specific rates, or a combination of both. This simply adds additional levels to the hierarchical model, without representing major theoretical challenges in terms of asymptotic accuracy and calibration.

Other potentially important issues, concerning model specification, is whether the other aspects of the model (substitution process, relaxed clock, etc) are reasonable. But again, this can be worked out.

Finally, there are still difficult computational challenges on the side of species tree estimation. In this respect, I sometimes have the impression that even the most sophisticated Monte Carlo approaches will not be sufficient to efficiently sample species trees. For this, some computational short-cuts are probably needed. One possibility is to use a maximum a posteriori approach instead of integrating over gene trees (Boussau et al, 2013).

--

Akerborg, O., Sennblad, B., Arvestad, L., & Lagergren, J. (2009). Simultaneous Bayesian gene tree reconstruction and reconciliation analysis. Proceedings of the National Academy of Sciences of the United States of America, 106(14), 5714–5719.

Arvestad, L., Berglund, A.-C., Lagergren, J., & Sennblad, B. (2003). Bayesian gene/species tree reconciliation and orthology analysis using MCMC. Bioinformatics, 19 Suppl 1, i7–15.

Boussau, B., Sz├Âllosi, G. J., Duret, L., Gouy, M., Tannier, E., & Daubin, V. (2013). Genome-scale coestimation of species and gene trees. Genome Res, 23(2), 323–330.

Sennblad, B., & Lagergren, J. (2009). Probabilistic orthology analysis. Syst Biol, 58(4), 411–424.

Wednesday, January 8, 2014

Empirical Bayes and shrinkage


One thing I did not mention in my last post is that empirical Bayes is fundamentally a shrinkage approach.

Shrinkage relates to the idea that you can improve an estimator by combining it with other information. In the example of the last post, when the population parameters are not known, estimating allele frequencies at a given locus can be improved by pooling similar problems (other unlinked loci), so as to estimate the population parameters across loci, and then use this information about the population for the estimation at each locus.

Shrinkage entails some tradeoff: you get better properties averaged over the group, at the expense of good frequentist properties for each item. The tradeoff is thus between point-wise and group-wise frequentist properties.

There is an interesting story about Stein's paradox, shrinkage and empirical Bayes (see Brad Efron and Carl Morris on the subject, references below).

In brief, here is the idea.

Suppose that we observe $X$, normally distributed of unknown mean $\theta$ and known variance $\sigma^2$: $X\sim N(\theta, \sigma^2)$. We want to estimate $\theta$. Obviously, $\hat \theta = X$ is our best estimator.

Suppose now that we observe $X_i \sim N(\theta_i, \sigma^2)$, for $i=1..N$. We observe all of the $X_i$'s and want to estimate all of the $\theta_i$'s. Obviously, we can use $\hat \theta_i = X_i$ for each $i$ (since these are just independent copies of the same problem).

Stein's paradox basically says that, for $N \ge 3$, our joint estimation is inadmissible, in the sense that we can always do better in terms of mean quadratic error. We can do better, in particular, by shrinking toward the empirical mean of our observations, $\bar X$, e.g. 

$\hat \theta_i = \bar X +  \left( 1 - \frac{N-3}{N V} \sigma^2 \right) (X_i - \bar X)$

where $V = 1/N \sum_i (X_i - \bar X)^2$ is the empirical variance of the observations.

Stein's shrinkage estimator can in fact be seen as an empirical Bayes procedure, under the following two-level model:
$X_i \sim N(\theta_i, \sigma^2)$
$\theta_i \sim N(\mu, \tau^2)$

Thus, from a Bayesian perspective, Stein's estimator amounts to assuming that the $\theta_i$'s are themselves normally distributed, of unknown mean and variance. These unknown mean and variance are estimated by pooling observations and then used to shrink the estimation of each of the $\theta_i$'s.


Why should shrinkage work, given that our $\theta_i$'s are not assumed to be generated by some common mechanism? Somehow, the fact that there is no condition on the $\theta_i$'s seems to imply that we will improve our estimation even by pooling things that have absolutely nothing to do with each other (this is why it is called a paradox).

I think I still need to understand this point, but in any case, a key property of Stein's shrinkage estimator is this inverse dependence on the empirical variance of the observations, $V$. Which practically means that, if the $\theta_i$'s really have nothing to do with each other, then the variance $V$ will typically be large, in which case the shrinkage effect will essentially vanish: for large $V$, $\hat \theta_i \simeq X_i$. [PS: this is not really an explanation, though, more like a Bayesian interpretation of a mathematical result...]

In any case, what Stein's paradox implies is just that, at least for normal means of same known variance, if properly done, shrinkage never costs you anything (in terms of group-wise performance). Stein's estimator has optimal self-tuning properties -- at worst, if not relevant in the context of the problem of interest, shrinkage automatically fades out. However, this is true only for normal means of same known variance. For more complicated cases, it is a bit less obvious.

In practice, my feeling is that shrinkage is generally a good idea as long as we ensure good self-tuning properties for our shrinkage procedures. In fact, if you think about it, this is the whole point of Bayesian hierarchical priors: you can see them as sophisticated self-tuning devices trying to shrink the estimation in all possible directions -- across loci, across genes, across branches of the phylogeny, across quantitative traits, etc. They are self-tuning in the sense that, through their hyperparameters, they can detect and regulate the amount of shrinkage to impose on the estimation.

In a sense, the fundamental point of hierarchical Bayes is to push the idea of shrinkage as far as it can go -- and this, hopefully for the best. Conversely, however, all this also suggests that, once hierarchical priors have been proposed in the context of a particular estimation problem, then, ideally, the frequentist properties of the resulting shrinkage estimator should perhaps be more carefully assessed.

--

Brad Efron (2010). Large-scale inference: empirical Bayes methods for estimation, testing and prediction. (Institute of Mathematical Statistics Monographs). Cambridge University Press.

Efron, B. and Morris, C (1977). Stein's paradox in statistics. Scientific American 

Efron, B. and Morris, C. (1973). Stein's estimation rule and its competitors -- an empirical Bayes approach. Journal of American Statistical Association, 68:117-130.

Empirical Bayes and calibration

An obvious case where Bayesian inference has good frequentist properties is when the prior itself has a clear interpretation in terms of empirical frequencies.

As a simple example to illustrate the point, imagine that we want to estimate the allele frequencies at a neutral bi-allelic locus, based on a random sample of individuals genotyped at this locus. Say we have sampled $n$ individuals, and $k$ had allele 1 and $n-k$ allele 0, and want to estimate $\pi$, the frequency of allele 1 in the population. We could of course use the usual estimator for the proportion of a binomial, which is simply the sample frequency, $k/n$, and calculate a confidence interval using some standard method for the binomial case.

However, under certain conditions (randomly mating population, constant population size), we have in fact very good prior knowledge about the site frequency spectrum of polymorphisms across neutral loci -- with symmetric mutation rates, it is a beta distribution of parameter $\theta = 4N u$, where $N$ is the population size and $u$ the mutation rate. Therefore, if we know the population size and the mutation rate, we can use this theoretical site frequency distribution as a prior, combine it with the binomial likelihood of our observations, so as to obtain a posterior distribution (the posterior turns out to be again a beta in the present case).

This example is the perfect case, where everyone, frequentists and Bayesians, would agree on using Bayes rule. This estimation method makes an optimal use of all of the available information about the unknown allele frequency. On average, the posterior mean estimate will be better (in quadratic error) than if we had used the standard maximum likelihood estimator for a binomial proportion. As for the credibility interval, it is an exact confidence interval, so we have frequentist calibration.

It is perhaps important to point out exactly what kind of calibration/coverage we have here: we do not have coverage if we draw one value of $\pi$, then repeatedly draw $k$ out of $n$, always from this constant value of $\pi$, and finally calculate our credible interval on each replicate and check whether it contains the true value. Instead, we get good frequentist coverage only if we redraw $\pi$ from the prior each time. In other words, we have good coverage across an imaginary infinite ensemble of independent neutral loci (for 95% of them, the credible interval will contain the true value).

--

Now, imagine that we do not know $\theta = 4Nu$. But we have sequence data from many (unlinked, neutral, bi-allelic) loci across the genome for $n$ individuals.

Here, we can design a two-level model: the unknown frequencies across loci, $\pi_i$, $i=1..P$, are i.i.d. from a beta prior, which is itself parameterized by an unknown $\theta$. With sufficiently many loci, we will have enough information across loci to reliably estimate $\theta$. As a result, our posterior credible intervals for each of the $\pi_i$'s will essentially be the same as if we had used the true value of $\theta$.

This is called empirical Bayes (EB) -- Bayes because we are using Bayes theorem to infer $\pi_i$, empirical because we have estimated the prior distribution on $\pi_i$ empirically, using the data at hand. It is also more specifically a case of parametric empirical Bayes (PEB), because we have assumed a parametric form for the prior (a beta distribution), and even more specifically, a particular application of PEB for which we have a mechanistic justification: we are not using a beta distribution because we find it convenient, but because population genetics theory leads us to indeed expect a beta in that case.

Note that we can use either maximum likelihood or Bayesian inference on $\theta$ (in both cases, however, we are being Bayesian with respect to the $\pi_i$'s). There are discussions about which is best, but asymptotically, this does not matter: they are asymptotically equivalent. For me, the choice between them is mostly driven by practical considerations (depending on what is computationally more convenient).

Thus, empirical Bayes means that we borrow strength across loci to estimate $\theta$, which in turn improves our local application, for each locus, of Bayes rule.

In terms of calibration: our dataset is essentially a finite-but-large version of the imaginary infinite ensemble of independent neutral loci mentioned above. In addition, empirical Bayes entails a finite-but-accurate estimation of $\theta$. Therefore, for sufficiently many loci, our collection of credible intervals across loci give us good coverage across loci -- in other words, under empirical Bayes, we have good group-wise calibration.

--

Finally, what if we do not trust the beta prior ? After all, this prior was derived based on fairly strong assumptions about the demography and the neutralilty of the loci. Our estimation may not be very robust to violations of these assumptions.

If we do not trust the beta prior, we can use a non-parametric empirical Bayes approach (NPEB). Essentially, the idea of NPEB is to estimate an arbitrary distribution of allele frequencies across loci directly from the data, without making the assumption that it belongs to some parametric family. One way to implement it is to explicitly invoke a prior over the space of all possible distributions for the $\pi_i$'s: e.g., a Dirichlet process mixture*.

Again, here, in principle, we get good group-wise calibration asymptotically. I say "in principle", because this depends on many technical conditions, essentially concerning asymptotic consistency of the non-parametric estimation of the unknown distribution.

--

In summary, parametric or non-parametric, the general idea of empirical Bayes is to borrow strength across observations (here, across loci) to somehow make Bayesian inference at the level of each observation (here, at the level of each locus) well justified by frequentist standards.

As Efron puts it, large datasets "carry with them their own prior".

Now, the next step is to identify the interesting problems currently faced by evolutionary biologists, which can be seen as particular instances of empirical Bayes, and for which we can therefore expect good frequentist properties for posterior probability evaluations. And there are in fact quite a few of them.

---

*In fact, here, with a binomial likelihood, there are issues about how much of this unknown distribution we can estimate, depending on the exact experimental settings. Also, to be accurate, the original non-parametric empirical Bayes idea of Robbins is a bit different -- it does not really attempt to estimate the distribution, it somehow eliminates it.

Wednesday, January 1, 2014

Two well-calibrated cases



Before getting into challenging problems, let us have a quick look at two simple cases of Bayesian methods that appear to have good frequentist calibration properties. Both are Gaussian comparative models.

(1) Revell and Reynolds (2012) estimate the parameters of a Brownian model of quantitative trait evolution with intraspecific variation, on data simulated under various conditions and on trees with 50 taxa. They report that the frequency at which their 95% confidence intervals contain the true value of the parameter(s) is statistically indistinguishable from 95% in all cases that were analyzed. This corresponds to the standard definition of calibration, as frequentist coverage associated to simple parameter estimation.

(2) Ancestral trait reconstruction based on a phylogenetic regression model (Lartillot, 2013). Here, two correlated traits are assumed to evolve according to a bivariate Brownian model. Both traits are known in extant species, and one (the predictor) is also known in the ancestors (interior nodes). The other trait is to be predicted for all ancestors along the phylogeny.

The frequentist calibration turns out to be very good in the sense that, on average, 95% of all credible intervals across the phylogeny contain the true value of the trait to be predicted. On the other hand, if you look at one particular ancestor along the phylogeny, then, across all replicates, the credible interval associated to that particular ancestor may contain the true value significantly more often than 95% of the time -- or significantly less often (minimum observed is 87%), depending on the ancestor. However, ancestor-specific deviations cancel out and the average coverage over ancestors is 95%. In other words, what I seem to get here is not point-wise calibration, but group-wise calibration (I still don't completely understand why).

Group-wise calibration is probably the only thing we can hope to obtain in many cases, in particular in an empirical Bayes context. But this is also most often what we need in order to make sense of our probabilistic evaluations in frequentist terms: most often, we want to know how many false discoveries we are making across ancestors (or genes, or observations, etc).

One should of course not generalize too hastily from these two relatively anecdotic examples. Gaussian models are relatively easy cases. Under some simple normal models, you can even obtain exact calibration for any sample size, using suitably invariant uninformative priors (Berger, 1985). But this is encouraging, at least in the context of relatively simple parametric problems with gentle probability distributions.

Yes, I know. Phylogenetics is not just about simple parametric problems with gentle probability distributions.

--

Berger, J. O. (1985). Statistical Decision Theory and Bayesian Analysis (1985 ed.). New-York: Springer-Verlag.

Lartillot, N. (2013). A phylogenetic Kalman filter for ancestral trait reconstruction using molecular data. Bioinformatics, epub ahead of print.

Revell, L. J., & Graham Reynolds, R. (2012). A new Bayesian method for fitting evolutionary models to comparative data with intraspecific variation. Evolution, 66:2697–2707.