Tuesday, April 22, 2014

FDR and the single-gene experiment

Imagine that you want to test for the presence of positive selection in a given gene in the human lineage, based on a protein alignment and using a codon model. Your data consist of aligned sequences from 5 mammals (human, chimp, macaque, mouse, rat and dog). You apply the branchwise test of codeml (Nielsen and Yang, 1998) on this alignment, with the branch leading to humans after the divergence from chimps taken as the focal branch. The test returns a p-value of 0.01 against the null model (i.e. the model in which the human lineage is just like all other branches making up the background).

With a p-value of 0.01, is it legimitate to consider that this gene is under positive selection in the human lineage? I guess that most of us would consider that it is legitimate.

Yet, in the context of a genome-wide scan for positive selection, based on the complete genomes of the same 5 mammalian species, Kosiol et al (2008) found that, in order to control for a false discovery rate (FDR) of 5% when detecting episodes of positive selection along the human lineage, one has to impose a significance threshold of p<0.000025. This threshold is 400 times smaller than 0.01. I have not made the exact calculation here, but I would not be surprised if a threshold of 0.01 in this context would imply something like a FDR of more than 50%, and a local fdr (i.e. among findings with a p-value at 0.01) of more than 80 % --  for the purpose of the following discussion, let us assume that this is the case.

Thus, if your gene of interest were analyzed, not in isolation, but in conjunction with all other protein coding sequences across these 5 mammalian genomes, then one would not have considered it as significantly under positive selection. Instead, one would have concluded that it was most probably a false discovery.

So, what should we do here? Are we supposed to adopt double standards and react differently when confronted to the same p-value, depending on whether it is considered in the context of a single-gene or a multiple-gene analysis? Or should we somehow transpose what we have obtained in the context of the FDR-controlled genome-wide scan to a new individual case? More fundamentally, what are we to conclude here, concerning the presence of positive selection on our gene in the human lineage?

At first sight, one could possibly argue that, technically, the two situations are not comparable. FDR is often presented as something meant to correct for multiple comparisons. When analyzing a gene in isolation, there is no multiple comparison involved, and therefore, no correction is needed. Thus, one should not transpose FDR results to a single case.

However, I think that this would be a fallacious argument. Fundamentally, what the FDR calibration tells you is that 50% of the genes for which p<0.01 are false discoveries in the present context. If your gene of interest has nothing special, then it is just like any gene taken at random and for which p<0.01, and therefore it has at least 50% chances of being a false discovery (in fact, 80% because it is at 0.01). As far as I can see, it does not matter whether or not this gene was tested in isolation or in conjunction with other genes.

Personally, I think that it is misleading to see FDR just as a way to 'correct for multiple comparisons'. For me, FDR is fundamentally an empirical Bayes posterior probability: the probability that a case is a true null given that it was rejected. This posterior probability is empirical in the sense that it is based on a prior that has been empirically estimated on a large series of cases. But apart from that, it is just an evidential probability.

Seen from this perspective, it becomes clear that the only difference between the multiple-testing and the single-gene settings is that the multiple-testing framework gives you the means to estimate your priors -- and therefore, in the present case, the means to know that you are in an unfavorable situation where a p-value of 0.01 just does not represent sufficient evidence for positive selection. In contrast, the single-gene experiment merely prevents you from knowing that a p-value of 0.01 does not represent much in the present case -- but of course, not knowing a problem does not imply that the problem does not exist.

Now, you could still object one point against this argument: you could claim that a single-gene analysis is different from a genome-wide scan, in that the gene under study has generally not been chosen uniformly at random among all possible genes. After all, one of the reasons that make the FDR so stringent is that the overall probability that a gene taken at random along the genome is under positive selection in humans is just very low. Therefore it takes strong empirical evidence to turn this very low prior probability into a reasonably high posterior probability. However, your gene may be different in this respect: you may have decided to study that gene precisely because you have good reason to believe that it is under positive selection -- for instance, a gene involved in the immune response against pathogens. If this is the case, then perhaps you could be a bit less stringent when interpreting the p-value returned by the test specifically for that gene.

Maybe. But, then, first, you should note that this is fundamentally a Bayesian argument, which amounts to invoking a special prior in the particular case of your gene of interest.

Second, assuming that you accept to play this Bayesian game, it remains to be determined how you can then justify your special prior. The typical subjective Bayes approach, relying on introspection and verbal arguments to that effect, may be fine in other situations in which one can show that the end result is not too sensitive to the prior anyway. But here, it is not the case: the prior is precisely the critical aspect of the question: it is the only factor that can make a difference between your gene and any gene with p=0.01.

For me, if you really want to claim that immunity-related genes are different and are more likely to be under positive selection, then you have to let the data speak by themselves on this issue. Which means in practice that, instead of searching your soul and coming up with 'your' prior probability for immunity-related genes to be under positive selection, you should use some well-accepted gene ontology, select the subset of genes that are involved in immunity and redo the FDR analysis on this subset of genes. If genes involved in immunity are indeed more likely to be under positive selection, this will automatically show up in the form of a less stringent significance threshold compared to the global genome-wide analysis -- perhaps up to the point that your gene will now be included in the genes significantly under positive selection at a FDR of 5% or 10%.

But as long as you do not let empirical data confirm your prior guess, any claim that your gene is different has just not been vindicated. In which case the global genome-wide FDR control represents, by default, our best calibration for assessing the chances that your gene, with p=0.01, is indeed under positive selection -- and based on what this FDR calibration says, these chances appear to be rather low.

So, in conclusion, once the fundamental meaning of FDR as an evidential (posterior) probability is understood, then it becomes clear that ignoring FDR calibrations derived from genome-wide analyses when interpreting the results of small-scale experiments is not rational. For me, it is like refusing to use what has been learned in the past on a large series of examples when considering a new similar but isolated case.

And indeed, there tends to be a surprising disconnect in our collective statistical practice, between the rather stringent thresholds, usually in the range of p<1e-4 or even p<1e-6, most often imposed in typical genome-wide analyses in which FDR control is explicitly implemented (like GWAS, or genome-wide scans for positive selection), and the substantially more liberal attitude, in the range of p<5e-2, most often adopted in the context of small-scale studies in which FDR is not explicitly considered.

The only possible logical justification for this difference in our attitude would be that we are generally confident that our small scale studies typically concern genes, or SNPs, for which the prior odds in favor of the existence of a true effect are typically 100x higher than for a random gene or SNP along the genome. Yet, I am not totally sure that we have such good reasons to be so confident in our pre-selected candidates.

I am not saying that one should always conduct FDR calibrations. Sometimes it is just too difficult to find a good reference set. But if, each time we can find a reasonably good reference set (and genes or SNPs represent an easy case in this respect), we are surprised by how stringent we have to be in order to control for a decent rate of false discovery, then perhaps it means something about how excessively liberal we tend to be when we don't have the means to know our true FDR.


Nielsen R, Yang Z (1998) Likelihood models for detecting positively selected amino acid sites and applications to the HIV-1 envelope gene. Genetics 148: 929–936.

Kosiol, C., Vinar, T., da Fonseca, R. R., Hubisz, M. J., Bustamante, C. D., Nielsen, R., & Siepel, A. (2008). Patterns of positive selection in six Mammalian genomes. PLoS Genet, 4(8), e1000144. doi:10.1371/journal.pgen.1000144

Waking up from hibernation

This blog has not been very active lately.. I was too busy on other things. But there are still many different subjects that I would like to discuss here.

Let me post one or two additional comments about posterior probabilities, false discovery rates and general issues in current statistics. After that, I would like to move on to other subjects more directly related to statistical evolutionary biology.

stay tuned!

Thursday, February 27, 2014

Revised standards: p-values and false discoveries

Since we are again talking about p-values, evidence and scientific false discoveries (see my last post), I would like to come back to Valen Johnson's article of last fall (Johnson, 2013). Even if the main ideas are apparently not totally new (see e.g. Berger and Sellke, 1983), Johnson's article revives an important question, and this, apparently, at a very good moment (see also a similar point in Nuzzo, 2014).

The argument of Johnson's article relies on Bayes factors and uses a relatively technical concept of uniformly most powerful Bayesian tests. For that reason, it may give the impression that the whole question is about the choice between Bayes or frequentism, or between p-values or Bayes factors. Yet, for me, the whole problem could more directly be presented of as a false discovery rate issue.

In this direction, what I propose here is a relatively simple derivation of Johnson's main point (at least as I understand it). Apart from emphasizing the false discovery aspect, this derivation does not rely on uniformly most powerful Bayesian tests and is therefore hopefully easier to understand.

Suppose you conduct a one-sided test: you observe $X$ normally distributed of mean $\theta$ and variance 1 and want to test $H_0$: $\theta=0$ against $H_1$: $\theta > 0$. For instance, you observe a log-ratio of expression levels for a given gene between two conditions, and want to deternine whether the gene is over-expressed.

Imagine that you observe $x = 1.646$. This is slighty greater than $x_0 = 1.645$, the usual threshold for a type I error rate of 5%. Thus, you have obtained what is usually considered as marginally significant evidence in favor of over-expression, with a p-value just slightly less than 0.05.

Now, what does that imply in terms of the chances that this gene is indeed over-expressed?

This question sounds Bayesian: it asks for the posterior probability that $H_1$ is true conditional on the empirically observed value of $X$. But you can ask a similar question in a frequentist perspective: consider the collection of all one-sided normal tests (with known variance) that have been published over the last 10 years in evolutionary biology, and for which a marginal significance was reported. What fraction of them are false discoveries?

To answer that question, we need to contemplate the complete collection of all one-sided normal tests that were conducted (not just those that were published because they turned out to be significant): this represents a total of $N = N_0 + N_1$ tests, among which $N_0$ were true nulls and $N_1$ were true alternatives.  Let us call $p_0 = N_0 / N$ and $p_1 = N_1 / N$ the fractions of true nulls and alternatives, and $\theta_i$, $i=1..N_1$ the collection of true effect sizes across all true alternatives.

The probability of obtaining a marginally significant result ($x_0 \pm \delta x$) for a null case is $ p(X=x_0 \mid H_0) \delta x$. For a given non-null case with effect size $\theta_i$, it is $ p(X=x_0 \mid \theta_i) \delta x$. Therefore, the expected total number of marginally significant discoveries over the $N$ tests is just:

$n = N_0  p(X = x_0 \mid H_0) \delta x + \sum_{i=1}^{N_1} p(X = x_0 \mid \theta_i) \delta x$

which, upon dividing by $N \delta x$, can be rewritten as:

$\frac{n}{N \delta x} = p_0  L_0 + p_1 \bar L_1$

where $L_0 = p(X=x_0 \mid H_0)$ and $\bar L_1$ is the average likelihood under the alternative cases:

$\bar L_1 = \frac{1}{N_1} \sum_{i=1}^{N_1} p(X = x_0 \mid \theta_i)$.

The fraction of false discoveries is simply the contribution of the first term:

$fdr = p_0 L_0 / (p_0 L_0 + p_1 \bar L_1) = p_0 / (p_0 + p_1 B)$

where $B$ is:

$B = \bar L_1 / L_0$.

($B$ can be seen as an empirical version of the Bayes factor between the two hypotheses, but this is not essential for the present derivation.)

The average likelihood under $H_1$, $\bar L_1$, is less than the maximum likelihood under $H_1$, $\hat L_1$, which is here attained for $\hat \theta = x_0$. Using the formula for a normal density gives:

$B < B_{max} = \frac{\hat L_1}{L_0} = e^{\frac{1}{2} x_0^2}$

or equivalently:

$fdr > fdr_{min} = \frac{p_0}{p_0 + p_1 B_{max}}$.

Some numbers here. For $x_0 = 1.645$, $B_{max} = 3.87$. If $p_0 = p_1 = 0.5$, $fdr > 0.20$. In other words, at least 20% of your marginally significant discoveries are false. And still, this assumes that half of the tests were conducted on true alternatives, which is a generous assumption. If only 10% of the tested hypotheses are true ($p_1 = 0.1$), then, $fdr > 0.70$.

And all this is generous for yet another reason: it assumes that all your true alternatives are at $\theta=1.645$, the configuration that gives you the smallest possible local fdr. Reasonable distributions of effect sizes under the alternative can easily result in even higher false discovery rates. For example, if, under $H_1$, $\theta$ is distributed according to the positive half of a normal distribution of mean 0 and standard deviation of 3, then $B$ is less than 1, which implies more than 50% of false discoveries if half of the tested hypotheses are true, and more than 90% of false discoveries if 10% of the tested hypotheses are true.

I guess it is fair to say that most tested hypotheses are in fact false ($p_1$ is most probably less than 0.1) -- if most tested hypotheses were true, then this would mean that we already know most of what we are inquiring about, and thus research would just be an idle confirmatory exercise. It is also fair to say that the whole point of scientific hypothesis testing is to reverse this unfavorable proportion by filtering out the majority of non-findings and publishing an enriched set hopefully composed of a majority of true effects. Yet, as we can see here, this is not what happens, at least if we focus on marginal discoveries.

The entire argument above assumes that the variance of $X$ around $\theta$ is known. If it is unknown, or more generally if there are nuisance parameters under the null, things are a bit more complicated. The exact quantitative results also depend on the sampling model (normal or other). However, for the most part, the message is probably valid in many other circumstances and is very simple: what we call marginally significant findings are most often false discoveries.

Now, we can quibble over many details here, discuss the practical relevance of this result (do we really need to define a threshold for p-values?), object that significance is just one aspect of the problem (you can get very significant but weak effects), etc etc. Nevertheless, one should probably admit one thing: many of us (me included) have perhaps not adopted the correct psychological calibration in the face of p-values and have tended to over-estimate the significance of marginally significant findings.

In other words, it is probably fair to admit that we should indeed revise our standards for statistical evidence.

Also, we should more often think in terms of false discovery rate: it tells us important things that cannot be understood by just looking at p-values and type I errors.

Independently of this theoretical argument, it would certainly be interesting to conduct meta-studies here: based on the collection of reported p-values, and using standard algorithms like the one of Benjamini and Hochberg, 1995, one can retrospectively estimate the fraction of false discoveries across all published results in the context of molecular evolution, comparative or diversification studies, for instance (all of which have often relied on relatively weak effects), over the last ten years. I would be curious about the outcome of such meta-analyses.


Berger, J. O., & Sellke, T. (1987). Testing a point null hypothesis: the irreconcilability of P values and evidence. Journal of the American Statistical Association, 82:112.

Johnson, V. E. (2013). Revised standards for statistical evidence. Proceedings of the National Academy of Sciences, 110:19313.

Nuzzo, R. (2014, February 13). Scientific method: statistical errors. Nature, pp. 150–152. doi:10.1038/506150a

Thursday, February 13, 2014

Blending p-values and posterior probabilities

Interesting column in Nature this week, by Regina Nuzzo, about p-values and why so many published findings are not true (see also Johnson, 2013, as well as many other earlier articles, e.g. Berger and Sellke, 1987).

Just one little point I find amusing: the entire column is about the problem of evaluating the "chances" that a scientific hypothesis is true or false given that a marginally significant effect was detected (thus leading to a publication of a result that may fail to replicate). The figure accompanying the column (the box entitled "probable cause") makes this point even more obvious, emphasizing that a p-value of 0.05 or even 0.01 does not imply that the chances are high that the effect is true -- in fact, in many situations, it implies that the chances are rather low.

It is of course a good idea to re-emphasize, one more time, the slippery nature of p-values. As an aside, the column proposes a globally interesting and integrated discussion, with a good dose of common sense, about many other aspects of the problem of statistical inference and scientific reproducibility.

Still, I am a bit surprised that the text does not make it clear that the "false alarm probability" is fundamentally a Bayesian posterior probability (of being in front of a true null given that the test was rejected). In particular, what the the figure shows is just that: it is a series of rough estimates of Bayesian posterior probabilities assuming different reasonable prior odds and assuming that different p-values have obtained.

The article does mention the idea of using Bayesian inference for addressing these problems, but only very late and very incidentally (and mentioning that this entail some "subjectivity"). As for the expression "posterior probability", it is not mentioned any single time in the whole article. Yet, for me, the figure should read as: prior probabilities (top), p-values (middle), and posterior probabilities (bottom).

Why not mentioning the name of the central concept of your whole discourse? I suspect this is because classical frequentism has censored this question for 70 years: you were not supposed to even talk about the probability that a hypothesis is true or false given the available evidence, since hypotheses were not supposed to be random events. Therefore, presumably, if you want to reach a sufficiently large audience, then, you should perhaps better not wave a bright red flag (posterior probabilities? subjective!).

Just to be clear: what I am saying here does not imply that Bayesian hypothesis testing, at least the current state-of-the-art version of it, is more reliable than p-values. Personally, I think that there are many, many problems to be addressed there as well. I am not even suggesting that, ultimately, Bayes factors or posterior probabilities should necessarily be used as the reference for assessing significance of scientific findings. I still don't have a very clear opinion about that.

Also, by Bayesian, I do not mean subjectivist. I just refer to the concept of evidential probabilities, i.e. probabilites of hypotheses conditional on observed facts (in this respect, I think I am true to the spirit of Bayes and Laplace.)

What I am saying is just that one should perhaps call things by their names.

In any case, the whole question of the relation between p-values and the lack of replication of scientific discoveries cannot be correctly conceptualized without articulating together ideas traditionally attached to both the Bayesian and the classical frequentist schools. That's what makes the question theoretically interesting, in addition to being urgent for its practical consequences.


Johnson, V. E. (2013). Revised standards for statistical evidence. Proceedings of the National Academy of Sciences, 110:19313.

Berger, J. O., & Sellke, T. (1987). Testing a point null hypothesis: the irreconcilability of P values and evidence. Journal of the American Statistical Association, 82:112.

Monday, February 10, 2014

Parameter estimation: optimizing versus conditioning

Misunderstandings about how Bayesian inference works are very common. In particular, people often don't understand that, in Bayesian inference, you do not have the problem of parameter-rich models being inherently more prone to over-fitting.

The thing is, much of currently accepted wisdom about statistics has been acquired from a nearly universal and exclusive use, over decades, of classical methods such as least square or maximum likelihood approaches, all of which are based on the idea of optimizing a score (e.g. maximizing a likelihood or minimizing a sum of squares).

Bayesian inference, however, works differently. It does not rely on any optimization principle. In fact, strictly speaking, in the context of Bayesian inference, you do not fit a model to the data. Instead, you condition a model on the data. And this conditioning, as opposed to optimizing, means a radically different logic of inference.

I think that this logical difference is one of the main reasons why common intuition, being too reliant on a long-lasting experience of optimization-oriented paradigms, regularly fails when trying to understand some key aspects of Bayesian inference -- and in particular, totally fails when it comes to correctly visualizing how Bayesian inference deals with model complexity.

Optimization is, in some sense, an aggressive methodology, selecting the single parameter configuration that best explains the data. As a consequence, the more parameters you have, the more you can satisfy your greed -- which means in particular that optimizing an overly rich model will lead to fitting both the structural patterns (the signal) and the irrelevant aspects (the noise) in the data.

Conditioning, on the other hand, is more temperate. It works by eliminating, among all possible model configurations, those that are ruled out by the data, so as to keep all of those that are good enough (and not just the best one). One can see it very clearly in the context of approximate Bayesian computation: there, one repeatedly draws random model configurations from the prior, simulates data and discards those parameter values for which the simulation does not match the empirical data (up to a certain tolerance interval). Then, estimation, and more generally decision making, is typically done by averaging over all configurations remaining after this elimination step.

Thus, optimization and conditioning cannot be more different; a bit like positive versus negative selection. Optimization actively searches for the best fitting configuration and rules out everything else. Conditioning actively rules out what does not fit at all and keeps everything else. 

Given such a fundamental difference between the two paradigms, one may expect that intuitions gained in one of two contexts may be counter-productive in the other context.

For instance, what happens under a parameter-rich model? In the context of an optimization-oriented approach, it is easy to tweak the parameters so as to make the model fit the irrelevant details of the data: you get over-fit. In contrast, in the context of Bayesian inference, such highly contrieved parameter configurations will not be typical among the set of model configurations that have not been eliminated by conditioning, so that their impact on the final estimate or decision will be completely swamped by all other configurations that have been kept.

Therefore, the whole problem that rich models might over-fit irrelevant aspects of the data is just not present in a Bayesian framework. Or, to put it differently, the apparent propensity of parameter-rich models to over-fit is merely a consequence of aggressive optimization -- not an inherent problem of parameter-rich models.

Now, all this does not mean that Bayesian inference is necessarily better. In particular, Bayesian inference may well have replaced over-fitting problems by prior sensitivity issues (although this is exactly where hierarchical priors have something to propose).

But it does raise the following question (for you to ponder...): how much of what you consider as obvious universal facts about statistics, such as the inherent ability of richer models to "have it easier" and their correlative propensity to over-fit, the fact that potentially useless extra-parameters are necessarily harmful, the necessity of externally imposed penalizations on richer models, the bias-variance tradeoff, among other things, are in fact consequences of the aggressive nature of one particular approach to statistics?

Again, this is a purely logical question, not a normative one. Some of the consequences of optimization (in particular, the possibility of playing with the bias-variance balance, which you cannot really do in a Bayesian context) may not be problematic at all. But it is just that one should perhaps not consider as "laws of nature" things that are in fact specific to one particular statistical paradigm.

More fundamentally, I find it interesting to exercise one's intuition by working out the differences between these two approaches to parameter estimation, by optimization or by conditioning. They offer complementary insights about statistical inference. Also, they can be combined (e.g. empirical Bayes by maximum marginal likelihood). Therefore, it is important to get used to both of them.

Statistical pluralism is not a problem, but an opportunity for us to diversify our thinking routines.

Monday, February 3, 2014

Soft and hard shrinkage

Shrinkage based on hierarchical priors can be seen as a way of reducing the effective number of parameters of a model.

Consider for instance the problem of allowing for amino-acid compositional variation through time in the context of phylogenetic estimation. In a maximum likelihood framework, and for small datasets (short sequences), we cannot afford unconstrained estimation of a distinct equilibrium frequency vector over the 20 amino-acids (19 free parameters) on each branch: this would presumably lead to over-fitting problems (although, again, I would like to see whether it would have such dire consequences in practice, but let us assume that it is the case).

A clever solution to this problem (Groussin et al, 2013) is to identify the first one or two principal components in the empirically observed compositional variation among taxa. Biologically, most of the compositional variation of proteomes is due to two independent factors: an underlying GC bias and a correlation with growth temperature (although the latter is true only in prokaryotes). Therefore, the first two components should absorb a large proportion of the total compositional variation.Then, a constrained model is implemented, allowing for compositional variation across branches only along these two principal directions. This reduces the number of parameters from 19 to 2 per branches, thus making the model suitable for shorter alignments.

Now, what would be the hierarchical Bayesian solution to this problem? Typically, one would consider the equilibrium frequencies across branches i.i.d. from a multivariate normal prior (for this to be well defined, equilibrium frequencies have to be log-transformed and reduced to 19 free parameters, see Lartillot, 2014). If $x_j$ is the 19-dimensional parameter associated with branch j, then:

$x_j \sim N(\mu, \Sigma)$

Here, $\mu$ is a 19-vector and $\Sigma$ a 19x19 covariance matrix. Both are hyperparameters, which would then be estimated by borrowing strength across branches.

This hierarchical prior will implement self-tuned shrinkage of equilibrium frequency profiles across branches in a direction-dependent manner (in the 19-dimensional space). Mathermatically $\Sigma$ can be decomposed into 19 independent variance components (corresponding to its eigenvalues). The prior will be loose in the directions corresponding to large eigenvalues, and tight in the directions corresponding to small eigenvalues. Thanks to this mechanism, the model will allow for compositional variation mostly along the first principal directions of $\Sigma$ (large eigenvalues), while shrinking more strongly along all other directions (small eigenvalues). Since $\Sigma$ is estimated on the dataset, its first principal directions will roughly correspond to the first few principal directions of the empirically observed compositional variation among taxa, which are also the ones selected by the method of Groussin et al (2013).

So, the two methods should do very similar things.

In this sense, hierarchical Bayes shrinkage indeed amounts to reducing the effective number of parameters, although in a soft manner: there is no truncation, no thresholding, and therefore no parameter-counting nor any model selection by AIC or BIC. Instead, there is an implicit context-dependent penalization imposed on the parameter vector.

In fact, we do not need to be Bayesian to use this kind of soft penalization. Alternatively, one could imagine to maximize a penalized likelihood, such as:

$\ln L - \lambda tr(V)$

Here $\ln L$ is the log of the likelihood, $V$ would be the empirical covariance matrix (the scatter matrix) of the (suitably log-transformed) equilibrium frequencies across branches, and $\lambda$ would be the strength of penalization. I don't know if this specific penalization scheme is the most reasonable one, but intuitively, the trace of $V$ is the sum of the variances along the principal directions, and therefore, the penalization term will implement shrinkage by allowing compositional variation only along the principal directions where a significant increase of the likelihood can be gained.

At first sight, this penalization scheme looks a bit like what is called ridge regression in statistics.

In any case, I just wanted to illustrate the idea of "soft" shrinkage: both in the Bayesian and the penalized likelihood framweork, we still have all the degrees of freedom in the model, and yet, the model is not over-parameterized -- its effective number of parameters is under control.

I tend to believe that, when modeling nuisances, soft shrinkage is better than hard shrinkage by dimensional reduction, fundamentally because soft shrinkage does not "throw away" part of the nuisances, as is done when using hard truncation. This may be a relatively minor problem in the case of protein composition, where the first two components explain a large proportion of the total variance. In other cases, however, hard truncation may result in models turning a blind eye to a significant proportion of the nuisances that were supposed to be accounted for. Soft shrinkage, in contrast, will embrace the whole of those nuisances, and should therefore lead to more accurate estimation of the parameters of interest.

Hierarchical Bayes will implement regularization by soft-shrinkage more or less automatically, and this is fundamentally what makes it such an attractive paradigm. On the other hand, if one really wants to stay in a maximum likelihood paradigm (in particular, for computational reasons), while being flexible with respect to the nuisances, then perhaps maximum penalized likelihood, instead of classical maximum likelihood, should be considered more seriously, in particular in phylogenetics (Kim and Sanderson, 2008).


Groussin, M., Boussau, B., & Gouy, M. (2013). A branch-heterogeneous model of protein evolution for efficient inference of ancestral sequences. Syst Biol, 62(4), 523–538. doi:10.1093/sysbio/syt016

Kim, J., & Sanderson, M. J. (2008). Penalized likelihood phylogenetic inference: bridging the parsimony-likelihood gap. Syst Biol, 57(5), 665–674. doi:10.1080/10635150802422274

Lartillot, N. (2014). A phylogenetic Kalman filter for ancestral trait reconstruction using molecular data. Bioinformatics. doi:10.1093/bioinformatics/btt707

Monday, January 27, 2014

Overcoming the fear of over-parameterization

I am always suprised to see many phylogeneticists and evolutionary biologists often so afraid of parameter-rich models. For some reason, nothing scares people more than the risk of overfitting the data. Sometimes, this goes to the point that, in spite of overwhelming evidence of model violations and systematic biases of all kinds, clearly suggesting that the models used are in fact awfully under-parameterized, people will nevertheless stick to rigid models, refusing to abandon them in favor of more flexible alternatives just because they are afraid of overfitting.

For me, shunning rich and flexible models out of fear of over-parameterization is pure superstition -- a bit like refraining from the pleasures of life out of fear of going to hell.

There are at least three different reasons why I think that this fear of over-parameterization is irrational.

First, although overfitting may indeed be a true concern in some contexts, in particular using classical maximum likelihood, its consequences are greatly overestimated. In fact, if one has to choose between under or over-parameterization, the latter is, by far, the lesser of two evils. An over-parameterized model will typically result in high variance. In the context of a phylogenetic analysis, for instance, this will mean low boostrap support values. Low bootstrap support may not always be a good news, but at least, one does not get a strong support for false claims. In contrast, the use of an excessively rigid model often leads to bias, i.e. to a strong support for wrong trees -- we have seen that but too often in applied phylogenetics over the last twenty years. Thus, in case of doubt, and if we want to be on the safe side, we should in fact prefer more, not less, parameter-rich model configurations.

Second, overfitting is much less likely to occur today than it used to be. Multi-gene, not to mention genome-wide, datasets almost certainly contain enough information to estimate even the richest models currently available in phylogenetics. Similar situations are probably encountered in many other domains of applied statistics today. I think that people have kept some Pavlovian reflexes from old times, when data used to be a scarce resource. Clearly, these reflexes are totally outdated.

Third, and more fundamentally, over-parameterization problems are evitable -- parameter-rich models can easily be regularized. In particular, in a Bayesian framework, they can be regularized through the use of hierarchical priors.

The idea is very simple and can be illustrated by making an analogy with the idea of allowing for among site rate variation when reconstructing phylogenies. In this context, since maximizing the likelihood with respect to site-specific rates would result in over-fitting problems, the likelihood is instead integrated over site-specific rates. The integral is done over a distribution that is itself parameterized, so that its overall shape (in particular its variance) can be adjusted according to the information collected across all sites. In practice, the distribution is classically chosen to be a gamma of mean 1, parameterized by its shape parameter $\alpha$ tuning the variance of the distribution (Yang, 1994).

This model has a justification in the context of the classical maximum likelihood paradigm, as a random-effect model. But it can also be seen as a particular case of shrinkage. The amount of shrinkage is tuned by $\alpha$: if $\alpha$ is large, the variance of rates across sites is small, and shrinkage is strong. In turn, the $\alpha$ parameter is tuned by the data. In particular, if the sequences under study are such that there is no among site rate variation, then the estimated value of $\alpha$ will be very large, and rates across sites will be shrunk so strongly toward their mean that the model will effectively collapse into the uniform-rate model. In other words, shrinkage is self-tuned -- it automatically adjusts to what is needed by the data.

Thanks to this self-tuning behavior, explicit model selection (here, between the uniform or the variable rate models) it is not even necessary: one can directly use the more complex model (with variable rates) by default, given that it will automatically reduce to the simpler one if needed*. Note how this contradicts accepted widsom, according to which you are supposed to use the simpler model by default, unless it is firmly rejected by the data. 

This idea of self-tuned shrinkage can be used much more generally. Consider for instance the situation where you want to reconstruct a phylogeny using ribosomal RNA sequences from many bacterial species. We know that there is substantial compositional variation among lineages. We also know that compositional variation is an important source of phylogenetic error, potentially leading to an artifactual grouping of species with similar GC composition. Thus, we want to accomodate variation in GC content over the tree. A simple phenomenological way to do that is to invoke a distinct equilibrium GC for the substitution process on each branch. Doing this entails quite a few additional parameters (1 per branch), but we can shrink those GC parameters toward their mean over branches, e.g. by considering them i.i.d. from a distribution whose mean and variance are themselves hyperparameters of the model. The model will automatically regulate the level of shrinkage through the estimation of the variance parameter.

Endless variations on that theme can be imagined.  One can afford, not just an equilibrium GC, but more generally a distrinct equilbrium frequency profile of even a distinct substitution matrix on each branch, for each gene, for each site, etc. 

It is even possible to invoke several levels of shrinkage. For instance, in the context of a partitioned analysis, one would shrink branch lengths across partitions toward branch-specific means and according to branch-specific variances that are themselves shrunk across branches through global means and variances. In this way, the empirical signal is ultimately funnelled onto a small set of hyperparameters, thus providing a very flexible tuning of the global structure and intensity of shrinkage.

There are of course a few delicate issues here: how to choose the distributions, the architecture of the hierarchy, as well as the hyperpriors. In fact, in some cases, not doing those things properly can result in biases that might be worse than the excess variance that would have been incurred by bold maximization of the likelihood without any consideration for shrinkage (although not as bad as using a less parameter-rich model).

But in any case, what all this means is that over-parameterization is just not, in itself, a problem.

Now, I am not saying that parameter-rich models do not suffer from important limitations. In particular, complex models obviously raise substantial computational challenges. It is generally not very convenient to have as many substitution matrices as there are branches, genes or sites in a phylogenetic analysis. So, clearly, simpler models have an edge on the side of computational efficiency.

On a more philosophical note, it is also true that being able to capture the essence of a phenomenon through a compact model invoking as few parameters as possible is indeed an elegant approach to statistical modeling. I can of course see the value of such an ideal of conceptual parcimony.

But then, even if we sometimes want or need less parameter-rich models, this is for other reasons than just the issue of avoiding over-fitting. It is stupid to avoid parameter-rich models at all costs. Instead, one should learn to enjoy the freedom of choosing between rich or compact models, depending on what we want to do. Exactly as one can sometimes appreciate the idea of refraining from the pleasures of life, in praise of moderation and frugality -- but not out of fear of going to hell.

There are ways to deal with rich and flexible models. So, it is time to move on from our primitive fear of over-parameterization.


* The fact that the simpler model is obtained by setting $\alpha$ to infinity is a bit inconvenient, but this can easily be fixed by reparameterizing the model (e.g. in terms of $v = 1/\alpha$, so that $v=0$ is now the uniform-rate model).

Yang, Z. (1993). Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. Molecular Biology and Evolution, 10(6), 1396–1401.

Friday, January 17, 2014

The adventurous Bayesian and the cautious Frequentist

Although the idea of controlling for false discovery rate (FDR) was first proposed in a purely classical context (Benjamini and Hochberg, 1995), the FDR has a simple empirical Bayes interpretation (Efron, 2010), which I would like to illustrate here on a simple case.

Also, I find it useful to contrast the two approaches, classical and empirical Bayes, on the issue of controlling for FDR and, more generally, on hypothesis testing, for it reveals interesting things about the fundamental difference between them.

To fix the ideas, let us consider the classical textbook case of a blood test for detecting some particular disease. The test has a rate of type I error equal to $\alpha$ (an expected fraction $\alpha$ of the people not affected by the disease will show up as positive), and a rate of type II error of $\beta$ (a fraction $\beta$ of affected individuals will be missed by the test). The population is composed of a fraction $p_0$ of unaffected people, and a fraction $p_1 = 1 - p_0$ of individuals affected by the disease.

Conducting the test on a large sample of individuals taken at random, the total fraction of positive outcomes will be:

$\pi = p_0 \alpha + p_1 (1-\beta)$,

that is, the total number of discoveries will be the sum of a fraction $\alpha$ of the unaffected individuals (the false positives) and a fraction ($1-\beta$) of affected individuals (the true positives). The rate of false discovery is then simply the proportion of all discoveries corresponding to false positives, i.e. the relative contribution of the first term:

$FDR = \frac{p_0 \alpha}{\pi}$.

Letting $H_0$ denote the hypothesis that a given individual is unaffected, and $H_1$ the hypothesis that the individual is affected, this can be rewritten as:

$\pi = p(positive) = p(H_0) p(positive \mid H_0) + p(H_1) p(positive \mid H_1)$,

$FDR = \frac{p(H_0) p(positive \mid H_0) } { p(positive) }  = p(H_0 \mid positive)$,

which makes the Bayesian nature of the concept of FDR more than apparent: the FDR is just the posterior probability of not being affected given that the test was positive.

An empirical Bayes approach typically tries to achieve an (asymptotically) exact control for the FDR, by estimating $p_0$ and $p_1$ directly from the data.

On the other hand, it is possible to control for FDR -- although now conservatively -- in a non-Bayesian context. The idea is very simple: the fraction $\pi$ is empirically observed, $\alpha$ is known by design, and $p_0 < 1$. Therefore:

$FDR = \frac{p_0 \alpha}{\pi} < \frac{\alpha}{\pi}$,

which gives an upper bound on the FDR. In plain words: at $\alpha=0.05$, I expect at most 5% of positive tests just by chance. I observe 15%. Therefore, at most one third of my disoveries are false.

If $p_1 << 1$, $p_0$ is very close to 1 and, therefore, the upper bound is tight (I expect only slightly less than 5% just by chance). Even if $p_1$ is not that small, as long as it is not very large (as long as we are not in a situation where the vast majority of the population is affected by the disease), the bound is still very useful. For instance, even if half of the population is affected, which is already a lot, and the nominal control is at 5%, then the true FDR is between 2 and 3 %, which is of the same order of magnitude as the nominal rate.

What is also interesting here is that, in order to obtain this useful upper bound on the FDR, we do not need to assume anything about the test nor about the population, except that $p_1$ is small. Thus, to control for FDR, we do not need to take the trouble to explicitly model the whole situation by empirical Bayes: we get good bounds based on elementary arguments and direct counting of the number of positive outcomes that we observe.

On the other hand, if we care, not just about FDR, but more generally about the sensitivity-specificity tradeoff, then we have to look at the power of the test: are we detecting a sufficient proportion of the true cases?

The problem, however, is that power often depends on some underlying effect size (let us call it $\theta$). Typically, the null hypothesis is just $\theta = 0$ and the alternative is $\theta \ne 0$. In our example, $\theta$ could be the concentration of some antibody or viral antigen in the blood. The power of the test will generally increase with $\theta$: values of $\theta$ close to 0 will be less easily detected than very large values of $\theta$.

In this context, what matters for the sensitivity-specificity tradeoff is the average power of the test over the population under $H_1$. Statistically, this corresponds to an average over the prior distribution of $\theta$ conditional on $H_1$. Thus, we see that the realized power of a test also has a natural empirical Bayes interpretation, like the FDR.

Again, typically, an empirical Bayes procedure will try to estimate the true law of $\theta$ under $H_1$. If it succeeds in this estimation, then the control of the sensitivity-specificity compromise is optimal: for a given level $\alpha$, we can obtain good estimate of the exact FDR (not an upper bound), as well as a good estimate of the average power of the test, so we are in the best position to fine-tune the balance between specificity and sensitivity.

On the other hand, all this assumes that we are able to correctly estimate the true law of $\theta$ under $H_1$, which is not so trivial. Failure to do so may result in some discrepancy between the sensitivity-specificity balance that we think we achieve and the true performances of the procedure. In other words, the whole empirical Bayes enterprise is attractive, in terms of its theoretically optimal properties, but potentially risky if the practical implementation details turn out to be difficult to control.

As an alternative, we could try to guarantee good bounds on the power of the test using non-Bayesian arguments. In particular, in some cases, there may be a lower bound on $\theta$ under $H_1$ -- say, we are reasonably certain that almost all affected individuals will have an effect size (e.g. a viral load) larger than some $\theta_0$. If we are lucky, this lower bound on $\theta_0$ may be sufficiently large to simultaneously guarantee a high power and a low FDR, and this, even in the worst-case situation where the entire affected population would be characterized by an effect size $\theta$ no larger than $\theta_0$. Such an argument provides the basis for an acceptable sensitivity-specificity tradeoff that may not be optimal, but good enough and, more importantly, robust with respect to the exact distribution of effect sizes (as long as it is indeed bounded below by $\theta_0$).

There are situations, however, where things are not so easy. In particular, there are many cases where $\theta$ can take values arbitrarily close to 0 under $H_1$, in which case tight bounds cannot be found. There, if we really want to have a good control and good sensitivity-specificity tradeoff, we might need to adopt a Bayesian perspective on the problem and explicitly model the statistical behavior of the effect size $\theta$ across invididuals belonging to $H_1$.

All this illustrates the fundamental difference between classical frequentism and empirical Bayes. Both care about the operational properties of their statistical decisional procedures in the long-run (thus, in a sense, both are frequentist). But it is just that empirical Bayes is more confident in its ability to come up with a complete probabilistic model of the situation and to reliably estimate this model from the data. In principle, the resulting procedure is asymptotically optimal. However, this optimality critically depends on the overall validity of the model and the estimation procedure. If these conditions are not met, then, the empirical Bayes approach may fail to keep its promise to deliver a reliable frequentist control.

In contrast, classical frequentism is more like a bet-hedging strategy, trying to guarantee some good properties even in the worst-case situations, provided a minimum set of critical, generally non-distributional, assumptions. In the case of FDR, it turns out that it is possible to get very useful bounds -- FDR is a case where classical worst-case thinking pays off. On the other hand, a good estimation of the power of a test, and therefore a good compromise between sensitivity and specificity, is typically more difficult to obtain using a worst-case driven statistical approach.

Both strategies are useful. In fact, they could be combined more systematically. Typically, the empirical Bayesian perspective on a given problem could be first contemplated, as the ideal approach, and then used as a starting point from which to derive more robust procedures by trying to obtain tight bounds that are less dependent on the exact prior distributions.

Efron, again: "One definition says that a frequentist is a Bayesian trying to do well, or at least not too badly, against any possible prior distribution."

See also Bickel (2012), who elaborates on the idea of Bayesian and frequentist inference being opposite extremes along a "degree-of-caution" axis.


Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological), 289–300.

Bickel, D.R. (2012). Controlling the degree of caution in statistical inference with the Bayesian and frequentist approaches as opposite extremes. Electron. J. Statist. 6:1. Arxiv version here.

Efron, B. (2005). Bayesians, Frequentists, and Scientists. Journal of the American Statistical Association 100 (469), 1–5.

Efron, B. (2010). Large-scale inference: empirical Bayes methods for estimation, testing and prediction. (Institute of Mathematical Statistics Monographs). Cambridge University Press.

Saturday, January 11, 2014

The empirical Bayes future of genomics

Empirical Bayes is the future of genomics. None explains this point more clearly and more convincingly than Brad Efron, in a book that should really belong to your personal library if you are into statistical bioinformatics (Efron, 2010, see also Efron 2008).

The fundamental and very simple idea is that large amounts of data make it possible to estimate the priors directly from them, thus ensuring good frequentist properties for Bayesian estimation and hypothesis testing in this context.

In fact, it is not just that you have good frequentist properties: in many respects, you have a better frequentist paradigm than the classical one -- a paradigm that speaks to your intuition (it gives you the probability that your hypothesis is true!), is more relevant in what it controls for (see below), often has optimal properties, and finally, can much more easily be adapted, modulated, shrinked in various directions, depending on the specific needs of the empirical question of interest.

In evolutionary biology, empirical Bayes is used, among other things, for estimating gene trees, site-specific rates, site-specific selective regimes (dN/dS), or for reconstructing ancestral sequences. In genomics, for calling SNPs, detecting differentially expressed genes or for conducting genome-wide association studies. Empirical Bayes is a very natural idea in a fully Bayesian context but also in a maximum likelihood framework (in fact, it is more often associated with the latter). So, empirical Bayes is already at home in evolutionary biology.

Still, I have the impression that it is not always clearly realized, at least in our community, that these approaches should in principle have good asymptotic frequentist calibration properties. For instance, it is rarely emphasized that clade posterior probabilities might be well calibrated in a gene-tree species-tree reconciliation framework, that the credible intervals attached to site-specific estimates of evolutionary rates or dN/dS should be true confidence intervals (group-wise, i.e. 95% of them across positions contain the true value), and that there is a built-in asymptotic control for the false discovery rate (FDR) when testing for positively selected coding positions.

In particular, I think it is under-appreciated that the good frequentist property that you get nearly immediately in an empirical Bayes hypothesis testing context is, not control for type I error, but control for the false discovery rate.

For instance, if you decide to call "positively selected" all positions that have posterior probablity $p>0.95$ of having $dN/dS>1$, you are not ensuring that you will get at most 5% false positives among sites not under positive selection, which corresponds to the classical type I error idea. On the other hand, you are guaranteeing at most 5% of false discoveries among the set of all sites which you declared significantly under positive selection. Which is perhaps more useful in practice.

There are ways to obtain a good estimate of the actual false discovery rate, instead of just an upper bound. You just need to calculate the average of $(1 - p)$ over all sites for which $p>0.95$ (you can check that this will indeed be smaller than 5%). Conversely, defining a target FDR of, say 5%, the idea would be to identify the threshold $p_0$ such that the average of $(1-p)$ over all sites for which $p>p_0$ is exactly 5% and then call "positively selected" all positions with $p>p_0$.

All this has already been said and done, including in the context of detecting positive selection across sites (Guindon et al, 2005). Yet I have the impression that this is not yet totally integrated by our research community. Many people still seem to think that posterior probabilities, even in an empirical Bayes context, are not to be interpreted in frequentist terms (because they are somehow subjective), and that, conversely, being a frequentist means that you should think within the frame of a "type I error" paradigm.

Of course, all this is true only asymptotically and under correct model specification. Concerning the asymptotic condition, simulations suggest that calibration does indeed deteriorate for very small alignments and under weak selective effects (Guindon et al, 2005). Concerning model specification, the conditions to get good calibration are in particular that that the law of effect sizes (e.g. distribution of dN/dS) under the null and under the alternative should be correctly modelled. 

This may look like strict conditions. Yet, for me, this just means that we should move beyond gene-by-gene analyses and implement genome-wide studies instead, using hierarchical priors simultaneously accounting for gene-specific and site-specific effects. In this genome-wide context, we also start to get sufficient amounts of empirical signal for modelling the laws of effect sizes with more generality, using semi-parametric or non-parametric methods.

In brief, with genome-wide sequences, large computers and MPI parallelism, it now becomes possible to experiment the power and the frequentist properties of empirical Bayes in practically relevant situations.


Efron, B. (2008). Microarrays, empirical Bayes and the two-groups model. Statistical Science, 1–22.

Efron, B. (2010). Large-scale inference: empirical Bayes methods for estimation, testing and prediction. (Institute of Mathematical Statistics Monographs). Cambridge University Press.

Guindon, S., Black, M., & Rodrigo, A. (2006). Control of the false discovery rate applied to the detection of positively selected amino acid sites. Molecular biology and evolution, 23(5), 919-926.