Thursday, February 27, 2014

Revised standards: p-values and false discoveries

Since we are again talking about p-values, evidence and scientific false discoveries (see my last post), I would like to come back to Valen Johnson's article of last fall (Johnson, 2013). Even if the main ideas are apparently not totally new (see e.g. Berger and Sellke, 1983), Johnson's article revives an important question, and this, apparently, at a very good moment (see also a similar point in Nuzzo, 2014).

The argument of Johnson's article relies on Bayes factors and uses a relatively technical concept of uniformly most powerful Bayesian tests. For that reason, it may give the impression that the whole question is about the choice between Bayes or frequentism, or between p-values or Bayes factors. Yet, for me, the whole problem could more directly be presented of as a false discovery rate issue.

In this direction, what I propose here is a relatively simple derivation of Johnson's main point (at least as I understand it). Apart from emphasizing the false discovery aspect, this derivation does not rely on uniformly most powerful Bayesian tests and is therefore hopefully easier to understand.

Suppose you conduct a one-sided test: you observe $X$ normally distributed of mean $\theta$ and variance 1 and want to test $H_0$: $\theta=0$ against $H_1$: $\theta > 0$. For instance, you observe a log-ratio of expression levels for a given gene between two conditions, and want to deternine whether the gene is over-expressed.

Imagine that you observe $x = 1.646$. This is slighty greater than $x_0 = 1.645$, the usual threshold for a type I error rate of 5%. Thus, you have obtained what is usually considered as marginally significant evidence in favor of over-expression, with a p-value just slightly less than 0.05.

Now, what does that imply in terms of the chances that this gene is indeed over-expressed?

This question sounds Bayesian: it asks for the posterior probability that $H_1$ is true conditional on the empirically observed value of $X$. But you can ask a similar question in a frequentist perspective: consider the collection of all one-sided normal tests (with known variance) that have been published over the last 10 years in evolutionary biology, and for which a marginal significance was reported. What fraction of them are false discoveries?

To answer that question, we need to contemplate the complete collection of all one-sided normal tests that were conducted (not just those that were published because they turned out to be significant): this represents a total of $N = N_0 + N_1$ tests, among which $N_0$ were true nulls and $N_1$ were true alternatives.  Let us call $p_0 = N_0 / N$ and $p_1 = N_1 / N$ the fractions of true nulls and alternatives, and $\theta_i$, $i=1..N_1$ the collection of true effect sizes across all true alternatives.

The probability of obtaining a marginally significant result ($x_0 \pm \delta x$) for a null case is $p(X=x_0 \mid H_0) \delta x$. For a given non-null case with effect size $\theta_i$, it is $p(X=x_0 \mid \theta_i) \delta x$. Therefore, the expected total number of marginally significant discoveries over the $N$ tests is just:

$n = N_0 p(X = x_0 \mid H_0) \delta x + \sum_{i=1}^{N_1} p(X = x_0 \mid \theta_i) \delta x$

which, upon dividing by $N \delta x$, can be rewritten as:

$\frac{n}{N \delta x} = p_0 L_0 + p_1 \bar L_1$

where $L_0 = p(X=x_0 \mid H_0)$ and $\bar L_1$ is the average likelihood under the alternative cases:

$\bar L_1 = \frac{1}{N_1} \sum_{i=1}^{N_1} p(X = x_0 \mid \theta_i)$.

The fraction of false discoveries is simply the contribution of the first term:

$fdr = p_0 L_0 / (p_0 L_0 + p_1 \bar L_1) = p_0 / (p_0 + p_1 B)$

where $B$ is:

$B = \bar L_1 / L_0$.

($B$ can be seen as an empirical version of the Bayes factor between the two hypotheses, but this is not essential for the present derivation.)

The average likelihood under $H_1$, $\bar L_1$, is less than the maximum likelihood under $H_1$, $\hat L_1$, which is here attained for $\hat \theta = x_0$. Using the formula for a normal density gives:

$B < B_{max} = \frac{\hat L_1}{L_0} = e^{\frac{1}{2} x_0^2}$

or equivalently:

$fdr > fdr_{min} = \frac{p_0}{p_0 + p_1 B_{max}}$.

Some numbers here. For $x_0 = 1.645$, $B_{max} = 3.87$. If $p_0 = p_1 = 0.5$, $fdr > 0.20$. In other words, at least 20% of your marginally significant discoveries are false. And still, this assumes that half of the tests were conducted on true alternatives, which is a generous assumption. If only 10% of the tested hypotheses are true ($p_1 = 0.1$), then, $fdr > 0.70$.

And all this is generous for yet another reason: it assumes that all your true alternatives are at $\theta=1.645$, the configuration that gives you the smallest possible local fdr. Reasonable distributions of effect sizes under the alternative can easily result in even higher false discovery rates. For example, if, under $H_1$, $\theta$ is distributed according to the positive half of a normal distribution of mean 0 and standard deviation of 3, then $B$ is less than 1, which implies more than 50% of false discoveries if half of the tested hypotheses are true, and more than 90% of false discoveries if 10% of the tested hypotheses are true.

I guess it is fair to say that most tested hypotheses are in fact false ($p_1$ is most probably less than 0.1) -- if most tested hypotheses were true, then this would mean that we already know most of what we are inquiring about, and thus research would just be an idle confirmatory exercise. It is also fair to say that the whole point of scientific hypothesis testing is to reverse this unfavorable proportion by filtering out the majority of non-findings and publishing an enriched set hopefully composed of a majority of true effects. Yet, as we can see here, this is not what happens, at least if we focus on marginal discoveries.

The entire argument above assumes that the variance of $X$ around $\theta$ is known. If it is unknown, or more generally if there are nuisance parameters under the null, things are a bit more complicated. The exact quantitative results also depend on the sampling model (normal or other). However, for the most part, the message is probably valid in many other circumstances and is very simple: what we call marginally significant findings are most often false discoveries.

Now, we can quibble over many details here, discuss the practical relevance of this result (do we really need to define a threshold for p-values?), object that significance is just one aspect of the problem (you can get very significant but weak effects), etc etc. Nevertheless, one should probably admit one thing: many of us (me included) have perhaps not adopted the correct psychological calibration in the face of p-values and have tended to over-estimate the significance of marginally significant findings.

In other words, it is probably fair to admit that we should indeed revise our standards for statistical evidence.

Also, we should more often think in terms of false discovery rate: it tells us important things that cannot be understood by just looking at p-values and type I errors.

Independently of this theoretical argument, it would certainly be interesting to conduct meta-studies here: based on the collection of reported p-values, and using standard algorithms like the one of Benjamini and Hochberg, 1995, one can retrospectively estimate the fraction of false discoveries across all published results in the context of molecular evolution, comparative or diversification studies, for instance (all of which have often relied on relatively weak effects), over the last ten years. I would be curious about the outcome of such meta-analyses.

===

Berger, J. O., & Sellke, T. (1987). Testing a point null hypothesis: the irreconcilability of P values and evidence. Journal of the American Statistical Association, 82:112.

Johnson, V. E. (2013). Revised standards for statistical evidence. Proceedings of the National Academy of Sciences, 110:19313.

Nuzzo, R. (2014, February 13). Scientific method: statistical errors. Nature, pp. 150–152. doi:10.1038/506150a

Thursday, February 13, 2014

Blending p-values and posterior probabilities

Interesting column in Nature this week, by Regina Nuzzo, about p-values and why so many published findings are not true (see also Johnson, 2013, as well as many other earlier articles, e.g. Berger and Sellke, 1987).

Just one little point I find amusing: the entire column is about the problem of evaluating the "chances" that a scientific hypothesis is true or false given that a marginally significant effect was detected (thus leading to a publication of a result that may fail to replicate). The figure accompanying the column (the box entitled "probable cause") makes this point even more obvious, emphasizing that a p-value of 0.05 or even 0.01 does not imply that the chances are high that the effect is true -- in fact, in many situations, it implies that the chances are rather low.

It is of course a good idea to re-emphasize, one more time, the slippery nature of p-values. As an aside, the column proposes a globally interesting and integrated discussion, with a good dose of common sense, about many other aspects of the problem of statistical inference and scientific reproducibility.

Still, I am a bit surprised that the text does not make it clear that the "false alarm probability" is fundamentally a Bayesian posterior probability (of being in front of a true null given that the test was rejected). In particular, what the the figure shows is just that: it is a series of rough estimates of Bayesian posterior probabilities assuming different reasonable prior odds and assuming that different p-values have obtained.

The article does mention the idea of using Bayesian inference for addressing these problems, but only very late and very incidentally (and mentioning that this entail some "subjectivity"). As for the expression "posterior probability", it is not mentioned any single time in the whole article. Yet, for me, the figure should read as: prior probabilities (top), p-values (middle), and posterior probabilities (bottom).

Why not mentioning the name of the central concept of your whole discourse? I suspect this is because classical frequentism has censored this question for 70 years: you were not supposed to even talk about the probability that a hypothesis is true or false given the available evidence, since hypotheses were not supposed to be random events. Therefore, presumably, if you want to reach a sufficiently large audience, then, you should perhaps better not wave a bright red flag (posterior probabilities? subjective!).

Just to be clear: what I am saying here does not imply that Bayesian hypothesis testing, at least the current state-of-the-art version of it, is more reliable than p-values. Personally, I think that there are many, many problems to be addressed there as well. I am not even suggesting that, ultimately, Bayes factors or posterior probabilities should necessarily be used as the reference for assessing significance of scientific findings. I still don't have a very clear opinion about that.

Also, by Bayesian, I do not mean subjectivist. I just refer to the concept of evidential probabilities, i.e. probabilites of hypotheses conditional on observed facts (in this respect, I think I am true to the spirit of Bayes and Laplace.)

What I am saying is just that one should perhaps call things by their names.

In any case, the whole question of the relation between p-values and the lack of replication of scientific discoveries cannot be correctly conceptualized without articulating together ideas traditionally attached to both the Bayesian and the classical frequentist schools. That's what makes the question theoretically interesting, in addition to being urgent for its practical consequences.

===

Johnson, V. E. (2013). Revised standards for statistical evidence. Proceedings of the National Academy of Sciences, 110:19313.

Berger, J. O., & Sellke, T. (1987). Testing a point null hypothesis: the irreconcilability of P values and evidence. Journal of the American Statistical Association, 82:112.

Monday, February 10, 2014

Parameter estimation: optimizing versus conditioning

Misunderstandings about how Bayesian inference works are very common. In particular, people often don't understand that, in Bayesian inference, you do not have the problem of parameter-rich models being inherently more prone to over-fitting.

The thing is, much of currently accepted wisdom about statistics has been acquired from a nearly universal and exclusive use, over decades, of classical methods such as least square or maximum likelihood approaches, all of which are based on the idea of optimizing a score (e.g. maximizing a likelihood or minimizing a sum of squares).

Bayesian inference, however, works differently. It does not rely on any optimization principle. In fact, strictly speaking, in the context of Bayesian inference, you do not fit a model to the data. Instead, you condition a model on the data. And this conditioning, as opposed to optimizing, means a radically different logic of inference.

I think that this logical difference is one of the main reasons why common intuition, being too reliant on a long-lasting experience of optimization-oriented paradigms, regularly fails when trying to understand some key aspects of Bayesian inference -- and in particular, totally fails when it comes to correctly visualizing how Bayesian inference deals with model complexity.

Optimization is, in some sense, an aggressive methodology, selecting the single parameter configuration that best explains the data. As a consequence, the more parameters you have, the more you can satisfy your greed -- which means in particular that optimizing an overly rich model will lead to fitting both the structural patterns (the signal) and the irrelevant aspects (the noise) in the data.

Conditioning, on the other hand, is more temperate. It works by eliminating, among all possible model configurations, those that are ruled out by the data, so as to keep all of those that are good enough (and not just the best one). One can see it very clearly in the context of approximate Bayesian computation: there, one repeatedly draws random model configurations from the prior, simulates data and discards those parameter values for which the simulation does not match the empirical data (up to a certain tolerance interval). Then, estimation, and more generally decision making, is typically done by averaging over all configurations remaining after this elimination step.

Thus, optimization and conditioning cannot be more different; a bit like positive versus negative selection. Optimization actively searches for the best fitting configuration and rules out everything else. Conditioning actively rules out what does not fit at all and keeps everything else.

Given such a fundamental difference between the two paradigms, one may expect that intuitions gained in one of two contexts may be counter-productive in the other context.

For instance, what happens under a parameter-rich model? In the context of an optimization-oriented approach, it is easy to tweak the parameters so as to make the model fit the irrelevant details of the data: you get over-fit. In contrast, in the context of Bayesian inference, such highly contrieved parameter configurations will not be typical among the set of model configurations that have not been eliminated by conditioning, so that their impact on the final estimate or decision will be completely swamped by all other configurations that have been kept.

Therefore, the whole problem that rich models might over-fit irrelevant aspects of the data is just not present in a Bayesian framework. Or, to put it differently, the apparent propensity of parameter-rich models to over-fit is merely a consequence of aggressive optimization -- not an inherent problem of parameter-rich models.

Now, all this does not mean that Bayesian inference is necessarily better. In particular, Bayesian inference may well have replaced over-fitting problems by prior sensitivity issues (although this is exactly where hierarchical priors have something to propose).

But it does raise the following question (for you to ponder...): how much of what you consider as obvious universal facts about statistics, such as the inherent ability of richer models to "have it easier" and their correlative propensity to over-fit, the fact that potentially useless extra-parameters are necessarily harmful, the necessity of externally imposed penalizations on richer models, the bias-variance tradeoff, among other things, are in fact consequences of the aggressive nature of one particular approach to statistics?

Again, this is a purely logical question, not a normative one. Some of the consequences of optimization (in particular, the possibility of playing with the bias-variance balance, which you cannot really do in a Bayesian context) may not be problematic at all. But it is just that one should perhaps not consider as "laws of nature" things that are in fact specific to one particular statistical paradigm.

More fundamentally, I find it interesting to exercise one's intuition by working out the differences between these two approaches to parameter estimation, by optimization or by conditioning. They offer complementary insights about statistical inference. Also, they can be combined (e.g. empirical Bayes by maximum marginal likelihood). Therefore, it is important to get used to both of them.

Statistical pluralism is not a problem, but an opportunity for us to diversify our thinking routines.

Monday, February 3, 2014

Soft and hard shrinkage

Shrinkage based on hierarchical priors can be seen as a way of reducing the effective number of parameters of a model.

Consider for instance the problem of allowing for amino-acid compositional variation through time in the context of phylogenetic estimation. In a maximum likelihood framework, and for small datasets (short sequences), we cannot afford unconstrained estimation of a distinct equilibrium frequency vector over the 20 amino-acids (19 free parameters) on each branch: this would presumably lead to over-fitting problems (although, again, I would like to see whether it would have such dire consequences in practice, but let us assume that it is the case).

A clever solution to this problem (Groussin et al, 2013) is to identify the first one or two principal components in the empirically observed compositional variation among taxa. Biologically, most of the compositional variation of proteomes is due to two independent factors: an underlying GC bias and a correlation with growth temperature (although the latter is true only in prokaryotes). Therefore, the first two components should absorb a large proportion of the total compositional variation.Then, a constrained model is implemented, allowing for compositional variation across branches only along these two principal directions. This reduces the number of parameters from 19 to 2 per branches, thus making the model suitable for shorter alignments.

Now, what would be the hierarchical Bayesian solution to this problem? Typically, one would consider the equilibrium frequencies across branches i.i.d. from a multivariate normal prior (for this to be well defined, equilibrium frequencies have to be log-transformed and reduced to 19 free parameters, see Lartillot, 2014). If $x_j$ is the 19-dimensional parameter associated with branch j, then:

$x_j \sim N(\mu, \Sigma)$

Here, $\mu$ is a 19-vector and $\Sigma$ a 19x19 covariance matrix. Both are hyperparameters, which would then be estimated by borrowing strength across branches.

This hierarchical prior will implement self-tuned shrinkage of equilibrium frequency profiles across branches in a direction-dependent manner (in the 19-dimensional space). Mathermatically $\Sigma$ can be decomposed into 19 independent variance components (corresponding to its eigenvalues). The prior will be loose in the directions corresponding to large eigenvalues, and tight in the directions corresponding to small eigenvalues. Thanks to this mechanism, the model will allow for compositional variation mostly along the first principal directions of $\Sigma$ (large eigenvalues), while shrinking more strongly along all other directions (small eigenvalues). Since $\Sigma$ is estimated on the dataset, its first principal directions will roughly correspond to the first few principal directions of the empirically observed compositional variation among taxa, which are also the ones selected by the method of Groussin et al (2013).

So, the two methods should do very similar things.

In this sense, hierarchical Bayes shrinkage indeed amounts to reducing the effective number of parameters, although in a soft manner: there is no truncation, no thresholding, and therefore no parameter-counting nor any model selection by AIC or BIC. Instead, there is an implicit context-dependent penalization imposed on the parameter vector.

In fact, we do not need to be Bayesian to use this kind of soft penalization. Alternatively, one could imagine to maximize a penalized likelihood, such as:

$\ln L - \lambda tr(V)$

Here $\ln L$ is the log of the likelihood, $V$ would be the empirical covariance matrix (the scatter matrix) of the (suitably log-transformed) equilibrium frequencies across branches, and $\lambda$ would be the strength of penalization. I don't know if this specific penalization scheme is the most reasonable one, but intuitively, the trace of $V$ is the sum of the variances along the principal directions, and therefore, the penalization term will implement shrinkage by allowing compositional variation only along the principal directions where a significant increase of the likelihood can be gained.

At first sight, this penalization scheme looks a bit like what is called ridge regression in statistics.

In any case, I just wanted to illustrate the idea of "soft" shrinkage: both in the Bayesian and the penalized likelihood framweork, we still have all the degrees of freedom in the model, and yet, the model is not over-parameterized -- its effective number of parameters is under control.

I tend to believe that, when modeling nuisances, soft shrinkage is better than hard shrinkage by dimensional reduction, fundamentally because soft shrinkage does not "throw away" part of the nuisances, as is done when using hard truncation. This may be a relatively minor problem in the case of protein composition, where the first two components explain a large proportion of the total variance. In other cases, however, hard truncation may result in models turning a blind eye to a significant proportion of the nuisances that were supposed to be accounted for. Soft shrinkage, in contrast, will embrace the whole of those nuisances, and should therefore lead to more accurate estimation of the parameters of interest.

Hierarchical Bayes will implement regularization by soft-shrinkage more or less automatically, and this is fundamentally what makes it such an attractive paradigm. On the other hand, if one really wants to stay in a maximum likelihood paradigm (in particular, for computational reasons), while being flexible with respect to the nuisances, then perhaps maximum penalized likelihood, instead of classical maximum likelihood, should be considered more seriously, in particular in phylogenetics (Kim and Sanderson, 2008).

==

Groussin, M., Boussau, B., & Gouy, M. (2013). A branch-heterogeneous model of protein evolution for efficient inference of ancestral sequences. Syst Biol, 62(4), 523–538. doi:10.1093/sysbio/syt016

Kim, J., & Sanderson, M. J. (2008). Penalized likelihood phylogenetic inference: bridging the parsimony-likelihood gap. Syst Biol, 57(5), 665–674. doi:10.1080/10635150802422274

Lartillot, N. (2014). A phylogenetic Kalman filter for ancestral trait reconstruction using molecular data. Bioinformatics. doi:10.1093/bioinformatics/btt707