Thursday, February 13, 2014

Blending p-values and posterior probabilities

Interesting column in Nature this week, by Regina Nuzzo, about p-values and why so many published findings are not true (see also Johnson, 2013, as well as many other earlier articles, e.g. Berger and Sellke, 1987).

Just one little point I find amusing: the entire column is about the problem of evaluating the "chances" that a scientific hypothesis is true or false given that a marginally significant effect was detected (thus leading to a publication of a result that may fail to replicate). The figure accompanying the column (the box entitled "probable cause") makes this point even more obvious, emphasizing that a p-value of 0.05 or even 0.01 does not imply that the chances are high that the effect is true -- in fact, in many situations, it implies that the chances are rather low.

It is of course a good idea to re-emphasize, one more time, the slippery nature of p-values. As an aside, the column proposes a globally interesting and integrated discussion, with a good dose of common sense, about many other aspects of the problem of statistical inference and scientific reproducibility.

Still, I am a bit surprised that the text does not make it clear that the "false alarm probability" is fundamentally a Bayesian posterior probability (of being in front of a true null given that the test was rejected). In particular, what the the figure shows is just that: it is a series of rough estimates of Bayesian posterior probabilities assuming different reasonable prior odds and assuming that different p-values have obtained.

The article does mention the idea of using Bayesian inference for addressing these problems, but only very late and very incidentally (and mentioning that this entail some "subjectivity"). As for the expression "posterior probability", it is not mentioned any single time in the whole article. Yet, for me, the figure should read as: prior probabilities (top), p-values (middle), and posterior probabilities (bottom).

Why not mentioning the name of the central concept of your whole discourse? I suspect this is because classical frequentism has censored this question for 70 years: you were not supposed to even talk about the probability that a hypothesis is true or false given the available evidence, since hypotheses were not supposed to be random events. Therefore, presumably, if you want to reach a sufficiently large audience, then, you should perhaps better not wave a bright red flag (posterior probabilities? subjective!).

Just to be clear: what I am saying here does not imply that Bayesian hypothesis testing, at least the current state-of-the-art version of it, is more reliable than p-values. Personally, I think that there are many, many problems to be addressed there as well. I am not even suggesting that, ultimately, Bayes factors or posterior probabilities should necessarily be used as the reference for assessing significance of scientific findings. I still don't have a very clear opinion about that.

Also, by Bayesian, I do not mean subjectivist. I just refer to the concept of evidential probabilities, i.e. probabilites of hypotheses conditional on observed facts (in this respect, I think I am true to the spirit of Bayes and Laplace.)

What I am saying is just that one should perhaps call things by their names.

In any case, the whole question of the relation between p-values and the lack of replication of scientific discoveries cannot be correctly conceptualized without articulating together ideas traditionally attached to both the Bayesian and the classical frequentist schools. That's what makes the question theoretically interesting, in addition to being urgent for its practical consequences.


Johnson, V. E. (2013). Revised standards for statistical evidence. Proceedings of the National Academy of Sciences, 110:19313.

Berger, J. O., & Sellke, T. (1987). Testing a point null hypothesis: the irreconcilability of P values and evidence. Journal of the American Statistical Association, 82:112.

1 comment:

  1. PS: Thinking about it, my claim in the last paragraph, i.e. that this question requires a combination of Bayesian and frequentist ideas, is perhaps not correct.

    The whole question is in fact a question of false discovery rate, and therefore, you have the choice: you can either formalize it in an empirical Bayes style (like Efron, 2008), in which case you are allowing yourself to consider hypotheses as random variables having a well defined prior and posterior probability of being true; or in a purely classical frequentist manner (like Benjamini and Hochberg, 1995), in which case you talk only about an expected proportion of false discoveries among a collection of fixed hypotheses.

    Thus, you can be a die-hard frequentist and make sense of this question... but I personally find it helpful to adopt a joint frequentist and (empirical) Bayesian perspective on the problem.

    Concerning Regina Nuzzo's column, on the other hand, the figure mentions "plausibilities" of hypotheses, so it does sound rather Bayesian (even if cryptically so).