Tuesday, October 7, 2014

Cases where Bayes factors can be useful

Here are some situations where explicit computation of Bayes factors could still be meaningful and useful:

Bayes factors can be used for comparing a small number of alternative tree topologies in the context of a phylogenetic analysis. In particular, if the topology that would have been considered the most reasonable one before the analysis turns out to be totally absent from the sample obtained by Monte Carlo, we would typcially want to now how strong the empirical evidence is against what we have deemed thus far to be true. It would therefore make sense to compute a Bayes factor in that case. This would have to be done explicitly, using thermodynamic integration or the stepping-stone method.

Another reason for explicitly comparing the marginal likelihoods of several candidate tree topologies is to make sure that the MCMC did not get stuck in a local maximum, returning a high apparent posterior probability for a tree that has in fact a low marginal likelihood.

On the other hand, the differences in log-marginal likelihoods between alternative (reasonable) topologies are usually very small compared to the absolute values of the log marginal likelihoods, pushing the numerical methods to their limits in terms of numerical accuracy.

Even in the situations mentioned in my previous post, where automatic model selection can easily be implemented, it can still be useful to compute marginal likelihoods or Bayes factors: just to make a point, for instance, about how a new model improves on the models currently considered as the standards of the field or, more generally, what features of a model are important in terms of empirical fit.

Finally, in those cases where it is difficult to implement an efficient reversible-jump MCMC for averaging over non-nested models, model averaging may sometimes be replaced by model selection through explicit calculation of marginal likelihoods (although it is often easy to come up with a more general model encompassing all of the non-nested models that we want to compare).

But then, these are relatively specific situations, quite distinct from the idea of systematically computing Bayes factors in order to first select the best model, and then conduct inference by parameter estimation, which is more specifically what I was arguing against.

For me, systematic and explicit model selection is one of those old reflexes inherited from maximum likelihood thinking -- but not really adapted to the context of Bayesian inference.


Friday, October 3, 2014

Model averaging versus model selection

Coming back to my last post about model selection using Bayes factors, in which I was suggesting that it is most often not necessary to implement explicit Bayes factor evaluation: the model-averaging aspect is in fact more important than I suggested. Indeed, it is precisely because you average over models that you do not need to go through explicit model selection.

I think there is something fundamental to understand here, concerning the difference between the maximum likelihood and the Bayesian approaches to the question of choosing models, so let me belabor the point.

In a maximum likelihood context, one would generally not use the more complex model by default. Often, one would instead use the simplest model, unless this model is rejected by the data in favor of a more complex alternative. This approach to model selection is implemented, for instance, in sequential tests that go from very simple substitution models like Jukes-Cantor to the general time-reversible parameterization, going through the whole series of intermediates (F81, HKY85, TN93, etc) sorted by increasing dimensionality.

However, this paradigm is, I think, inadequate in a Bayesian framework. In fact, the model ‘auto-tuning’ example I was mentioning in my last post suggests exactly the opposite: to always use the most general model by default, the one that has the highest dimension, given that it will automatically include whichever simpler alternative may turn out to be adequate for the particular dataset under consideration.

Why this difference between the two frameworks? I think there are two (related) reasons:

(1) point estimation versus marginalization: maximum-likelihood is an either/or method: either JTT or GTR. Of course, as a model, GTR does include JTT as a particular case, but in practice, you still have to choose between two distinct point estimates for the exchangeability parameters: either those of JTT, or those obtained by maximum likelihood estimation. So you really have two distinct, non-overlapping models at the end, and you will have to choose between them.

In contrast, in a Bayesian context, if you are using GTR, you are in fact averaging over all possible configurations for the exchangeability parameters, including the one represented by JTT. Seen from this perspective, it is clear that you are not choosing between two mutually exclusive options. Instead, choosing between GTR and JTT really means, choosing between a risky bet on one single horse (JTT) or a bet-hedging strategy consisting of averaging over all possible configurations for the unknown exchangeability parameters.

(2) optimization versus conditioning: as I previously explained, maximum likelihood is inherently prone to overfitting. Therefore, you should have good reasons to use a model that invokes additional parameters. In contrast, Bayesian inference works by conditioning, and therefore, does not have over-fitting problems.

Point (1) gives some intuition as to why we need explicit model selection in a maximum likelihood but not in a Bayesian analysis, while point (2) explains why we would typically conduct model selection by starting from the simplest model in a maximum likelihood framework, whereas we can safely opt directly for the more general model in a Bayesian context.

Now, averaging over models also means that we have to correctly define the measure over which to take this average: in other words, we have to correctly design our prior over the models, which is not necessarily something obvious to do. I have suggested some ways to do it properly in my last post (mostly, by having key hyperparameters that are able to drive the whole model configuration down to some remarkable submodels), but there is probably much more to say about this.

In any case, what all this suggests is that the focus, in Bayesian inference, should not be on explicit model selection. Instead, it should be on prior design — including the prior over models.


Wednesday, October 1, 2014

Selecting models without computing Bayes factors


Model comparison or selection is often considered to be a very important step in a typical Bayesian analysis. It is often done by calculating Bayes factors or, more generally, marginal likelihoods.

I used to be a fan of Bayes factors. I even spent quite some time on the numerical methods for reliably computing them, in the aim of comparing amino-acid substitution models or alternative relaxed molecular clocks. But I have progressively changed my mind about Bayes factors, or at least about the need to explicitly calculate them. 

First, numerical evaluation of marginal likelihoods is tricky. For quite some time, it was not even done properly: unreliable methods such as the harmonic mean estimator (‘the worst Monte Carlo method ever’) were most often used for that purpose. Now, better numerical methods are employed, such as thermodynamic integration or its discretized equivalent, the stepping-stone approach. But the reliability of these numerical methods comes at a price: they are computationally expensive.

Given the costs associated with the whole enterprise, it is therefore important to realize that these laborious numerical methods are most often unnecessary. A much simpler alternative is to make a general model encompassing all of the models that we wish to compare. Then, by just carefully designing the prior over the parameters of this model, what we obtain is a probabilistic system that, upon being conditioned on a given dataset, automatically selects the parameter regime effectively implementing the model that happens to have the highest fit for these data.

As a simple but not too trivial example, let us assume we have an amino-acid alignment and want to choose between two alternative substitution models: say, the empirical JTT matrix and the most general time-reversible model (GTR). The two models greatly differ in their dimensionality: the 190 exchangeability parameters are fixed to pre-determined values in the case of JTT, whereas they are estimated directly from the data in the case of GTR.

In this specific case, instead of numerically evaluating the marginal likelihoods of the two models, one could design a GTR model such that the prior on its exchangeability parameters would be centered on the exchangeabilities of JTT. The concentration of this prior around its center would be controlled by one single concentration parameter, which should in turn be endowed with a broad prior. Typically, I would use a log-uniform prior (possibly truncated to make it proper) or a half-Cauchy. Then, I would just condition the model on the data, without any further complicated model selection process. If the JTT model is really fit, then the concentration parameter will automatically take on large values under the posterior, and as a result, the exchangeabilities of the model will shrink towards those defined by JTT. In other words, the GTR model will effectively implement the JTT model if this is what the data prefer. Importantly, this automatic 'collapse' of the model is driven by just one hyperparameter, so that the approach should work well even for relatively small datasets.

Note that the model will never exactly collapse onto the simpler model: using this approach, the exchangeabilities of the GTR model are never exactly equal to those of the JTT model. In all regimes, the model always has its full dimensionality. Is this really a problem?

A typical objection here would be that our estimation might suffer increased variance. However, strictly speaking, this is true only if the data have been produced by a JTT model, which is never the case in practice, except in simulation studies. Even in simulation studies, I would be surprised to see any detectable difference in terms of estimation error (above the MCMC noise) between the hierarchical model settings suggested above and the true model: thanks to the specific hierarchical design of the prior, the extra-variance induced by the more flexible parameterization should be very small — technically, the effective number of parameters is much smaller than the nominal number of parameters.

More fundamentally, nobody believes that JTT is the true model for any real data, and therefore, it does not really make sense to see it as a relevant point null hypothesis to be tested against a more general alternative, as would be implied by a Bayes factor analysis between JTT and GTR. Instead, JTT at best represents a good first guess for the exchangeabilities, to be used when the sequence alignment under investigation is too small to provide this information entirely from scratch. But then, a first guess should probably not be formalized as a point null. Instead, it should be seen as a good center for the prior — which is exactly what the hierarchical prior suggested here does: using JTT as a first guess, while implementing some empirically adjustable flexibility around it.

This method can easily be generalized to deal with all sorts of model selection problems. In all cases, this will be done by defining hierarchical priors in a way that ensures that the model will be able to effectively collapse onto a simpler model if needed. Many examples can be imagined: a large value for the shape parameter of the gamma distribution of rates across sites effectively implements the uniform-rate model; efficient shrinkage of branch-specific equilibrium frequencies for the substitution model can effectively emulate a time-homogeneous model; or a suitably chosen prior over partition models gives the data the opportunity to efficiently funnel the model onto uniform substitution matrices or uniform branch lengths across partitions, if relevant.

Technically, the approach could be seen as a case of model averaging. However, the expression ‘model averaging’ is meant to emphasize the ‘averaging’ aspect: the fact that you don’t want to bet 100% on one single model. Here, the averaging is perhaps not the most important point (even if it does play some role). For that reason, I would call this procedure model auto-tuning, thus referring to the fact that the model automatically finds the most adequate parameter regime.
In any case, I believe that substantial amounts of computational resources could be saved, simply by not computing Bayes factors between phylogenetic models: in most cases, this is at best unnecessary and, at worst, just an idle exercise.


Tuesday, April 22, 2014

FDR and the single-gene experiment



Imagine that you want to test for the presence of positive selection in a given gene in the human lineage, based on a protein alignment and using a codon model. Your data consist of aligned sequences from 5 mammals (human, chimp, macaque, mouse, rat and dog). You apply the branchwise test of codeml (Nielsen and Yang, 1998) on this alignment, with the branch leading to humans after the divergence from chimps taken as the focal branch. The test returns a p-value of 0.01 against the null model (i.e. the model in which the human lineage is just like all other branches making up the background).

With a p-value of 0.01, is it legimitate to consider that this gene is under positive selection in the human lineage? I guess that most of us would consider that it is legitimate.

Yet, in the context of a genome-wide scan for positive selection, based on the complete genomes of the same 5 mammalian species, Kosiol et al (2008) found that, in order to control for a false discovery rate (FDR) of 5% when detecting episodes of positive selection along the human lineage, one has to impose a significance threshold of p<0.000025. This threshold is 400 times smaller than 0.01. I have not made the exact calculation here, but I would not be surprised if a threshold of 0.01 in this context would imply something like a FDR of more than 50%, and a local fdr (i.e. among findings with a p-value at 0.01) of more than 80 % --  for the purpose of the following discussion, let us assume that this is the case.

Thus, if your gene of interest were analyzed, not in isolation, but in conjunction with all other protein coding sequences across these 5 mammalian genomes, then one would not have considered it as significantly under positive selection. Instead, one would have concluded that it was most probably a false discovery.

So, what should we do here? Are we supposed to adopt double standards and react differently when confronted to the same p-value, depending on whether it is considered in the context of a single-gene or a multiple-gene analysis? Or should we somehow transpose what we have obtained in the context of the FDR-controlled genome-wide scan to a new individual case? More fundamentally, what are we to conclude here, concerning the presence of positive selection on our gene in the human lineage?

At first sight, one could possibly argue that, technically, the two situations are not comparable. FDR is often presented as something meant to correct for multiple comparisons. When analyzing a gene in isolation, there is no multiple comparison involved, and therefore, no correction is needed. Thus, one should not transpose FDR results to a single case.

However, I think that this would be a fallacious argument. Fundamentally, what the FDR calibration tells you is that 50% of the genes for which p<0.01 are false discoveries in the present context. If your gene of interest has nothing special, then it is just like any gene taken at random and for which p<0.01, and therefore it has at least 50% chances of being a false discovery (in fact, 80% because it is at 0.01). As far as I can see, it does not matter whether or not this gene was tested in isolation or in conjunction with other genes.

Personally, I think that it is misleading to see FDR just as a way to 'correct for multiple comparisons'. For me, FDR is fundamentally an empirical Bayes posterior probability: the probability that a case is a true null given that it was rejected. This posterior probability is empirical in the sense that it is based on a prior that has been empirically estimated on a large series of cases. But apart from that, it is just an evidential probability.

Seen from this perspective, it becomes clear that the only difference between the multiple-testing and the single-gene settings is that the multiple-testing framework gives you the means to estimate your priors -- and therefore, in the present case, the means to know that you are in an unfavorable situation where a p-value of 0.01 just does not represent sufficient evidence for positive selection. In contrast, the single-gene experiment merely prevents you from knowing that a p-value of 0.01 does not represent much in the present case -- but of course, not knowing a problem does not imply that the problem does not exist.

Now, you could still object one point against this argument: you could claim that a single-gene analysis is different from a genome-wide scan, in that the gene under study has generally not been chosen uniformly at random among all possible genes. After all, one of the reasons that make the FDR so stringent is that the overall probability that a gene taken at random along the genome is under positive selection in humans is just very low. Therefore it takes strong empirical evidence to turn this very low prior probability into a reasonably high posterior probability. However, your gene may be different in this respect: you may have decided to study that gene precisely because you have good reason to believe that it is under positive selection -- for instance, a gene involved in the immune response against pathogens. If this is the case, then perhaps you could be a bit less stringent when interpreting the p-value returned by the test specifically for that gene.

Maybe. But, then, first, you should note that this is fundamentally a Bayesian argument, which amounts to invoking a special prior in the particular case of your gene of interest.

Second, assuming that you accept to play this Bayesian game, it remains to be determined how you can then justify your special prior. The typical subjective Bayes approach, relying on introspection and verbal arguments to that effect, may be fine in other situations in which one can show that the end result is not too sensitive to the prior anyway. But here, it is not the case: the prior is precisely the critical aspect of the question: it is the only factor that can make a difference between your gene and any gene with p=0.01.

For me, if you really want to claim that immunity-related genes are different and are more likely to be under positive selection, then you have to let the data speak by themselves on this issue. Which means in practice that, instead of searching your soul and coming up with 'your' prior probability for immunity-related genes to be under positive selection, you should use some well-accepted gene ontology, select the subset of genes that are involved in immunity and redo the FDR analysis on this subset of genes. If genes involved in immunity are indeed more likely to be under positive selection, this will automatically show up in the form of a less stringent significance threshold compared to the global genome-wide analysis -- perhaps up to the point that your gene will now be included in the genes significantly under positive selection at a FDR of 5% or 10%.

But as long as you do not let empirical data confirm your prior guess, any claim that your gene is different has just not been vindicated. In which case the global genome-wide FDR control represents, by default, our best calibration for assessing the chances that your gene, with p=0.01, is indeed under positive selection -- and based on what this FDR calibration says, these chances appear to be rather low.

So, in conclusion, once the fundamental meaning of FDR as an evidential (posterior) probability is understood, then it becomes clear that ignoring FDR calibrations derived from genome-wide analyses when interpreting the results of small-scale experiments is not rational. For me, it is like refusing to use what has been learned in the past on a large series of examples when considering a new similar but isolated case.

And indeed, there tends to be a surprising disconnect in our collective statistical practice, between the rather stringent thresholds, usually in the range of p<1e-4 or even p<1e-6, most often imposed in typical genome-wide analyses in which FDR control is explicitly implemented (like GWAS, or genome-wide scans for positive selection), and the substantially more liberal attitude, in the range of p<5e-2, most often adopted in the context of small-scale studies in which FDR is not explicitly considered.

The only possible logical justification for this difference in our attitude would be that we are generally confident that our small scale studies typically concern genes, or SNPs, for which the prior odds in favor of the existence of a true effect are typically 100x higher than for a random gene or SNP along the genome. Yet, I am not totally sure that we have such good reasons to be so confident in our pre-selected candidates.

I am not saying that one should always conduct FDR calibrations. Sometimes it is just too difficult to find a good reference set. But if, each time we can find a reasonably good reference set (and genes or SNPs represent an easy case in this respect), we are surprised by how stringent we have to be in order to control for a decent rate of false discovery, then perhaps it means something about how excessively liberal we tend to be when we don't have the means to know our true FDR.

==

Nielsen R, Yang Z (1998) Likelihood models for detecting positively selected amino acid sites and applications to the HIV-1 envelope gene. Genetics 148: 929–936.

Kosiol, C., Vinar, T., da Fonseca, R. R., Hubisz, M. J., Bustamante, C. D., Nielsen, R., & Siepel, A. (2008). Patterns of positive selection in six Mammalian genomes. PLoS Genet, 4(8), e1000144. doi:10.1371/journal.pgen.1000144

Waking up from hibernation


This blog has not been very active lately.. I was too busy on other things. But there are still many different subjects that I would like to discuss here.

Let me post one or two additional comments about posterior probabilities, false discovery rates and general issues in current statistics. After that, I would like to move on to other subjects more directly related to statistical evolutionary biology.

stay tuned!



Thursday, February 27, 2014

Revised standards: p-values and false discoveries



Since we are again talking about p-values, evidence and scientific false discoveries (see my last post), I would like to come back to Valen Johnson's article of last fall (Johnson, 2013). Even if the main ideas are apparently not totally new (see e.g. Berger and Sellke, 1983), Johnson's article revives an important question, and this, apparently, at a very good moment (see also a similar point in Nuzzo, 2014).

The argument of Johnson's article relies on Bayes factors and uses a relatively technical concept of uniformly most powerful Bayesian tests. For that reason, it may give the impression that the whole question is about the choice between Bayes or frequentism, or between p-values or Bayes factors. Yet, for me, the whole problem could more directly be presented of as a false discovery rate issue.

In this direction, what I propose here is a relatively simple derivation of Johnson's main point (at least as I understand it). Apart from emphasizing the false discovery aspect, this derivation does not rely on uniformly most powerful Bayesian tests and is therefore hopefully easier to understand.

Suppose you conduct a one-sided test: you observe $X$ normally distributed of mean $\theta$ and variance 1 and want to test $H_0$: $\theta=0$ against $H_1$: $\theta > 0$. For instance, you observe a log-ratio of expression levels for a given gene between two conditions, and want to deternine whether the gene is over-expressed.

Imagine that you observe $x = 1.646$. This is slighty greater than $x_0 = 1.645$, the usual threshold for a type I error rate of 5%. Thus, you have obtained what is usually considered as marginally significant evidence in favor of over-expression, with a p-value just slightly less than 0.05.

Now, what does that imply in terms of the chances that this gene is indeed over-expressed?

This question sounds Bayesian: it asks for the posterior probability that $H_1$ is true conditional on the empirically observed value of $X$. But you can ask a similar question in a frequentist perspective: consider the collection of all one-sided normal tests (with known variance) that have been published over the last 10 years in evolutionary biology, and for which a marginal significance was reported. What fraction of them are false discoveries?

To answer that question, we need to contemplate the complete collection of all one-sided normal tests that were conducted (not just those that were published because they turned out to be significant): this represents a total of $N = N_0 + N_1$ tests, among which $N_0$ were true nulls and $N_1$ were true alternatives.  Let us call $p_0 = N_0 / N$ and $p_1 = N_1 / N$ the fractions of true nulls and alternatives, and $\theta_i$, $i=1..N_1$ the collection of true effect sizes across all true alternatives.

The probability of obtaining a marginally significant result ($x_0 \pm \delta x$) for a null case is $ p(X=x_0 \mid H_0) \delta x$. For a given non-null case with effect size $\theta_i$, it is $ p(X=x_0 \mid \theta_i) \delta x$. Therefore, the expected total number of marginally significant discoveries over the $N$ tests is just:

$n = N_0  p(X = x_0 \mid H_0) \delta x + \sum_{i=1}^{N_1} p(X = x_0 \mid \theta_i) \delta x$

which, upon dividing by $N \delta x$, can be rewritten as:

$\frac{n}{N \delta x} = p_0  L_0 + p_1 \bar L_1$

where $L_0 = p(X=x_0 \mid H_0)$ and $\bar L_1$ is the average likelihood under the alternative cases:

$\bar L_1 = \frac{1}{N_1} \sum_{i=1}^{N_1} p(X = x_0 \mid \theta_i)$.

The fraction of false discoveries is simply the contribution of the first term:

$fdr = p_0 L_0 / (p_0 L_0 + p_1 \bar L_1) = p_0 / (p_0 + p_1 B)$

where $B$ is:

$B = \bar L_1 / L_0$.

($B$ can be seen as an empirical version of the Bayes factor between the two hypotheses, but this is not essential for the present derivation.)

The average likelihood under $H_1$, $\bar L_1$, is less than the maximum likelihood under $H_1$, $\hat L_1$, which is here attained for $\hat \theta = x_0$. Using the formula for a normal density gives:

$B < B_{max} = \frac{\hat L_1}{L_0} = e^{\frac{1}{2} x_0^2}$

or equivalently:

$fdr > fdr_{min} = \frac{p_0}{p_0 + p_1 B_{max}}$.

Some numbers here. For $x_0 = 1.645$, $B_{max} = 3.87$. If $p_0 = p_1 = 0.5$, $fdr > 0.20$. In other words, at least 20% of your marginally significant discoveries are false. And still, this assumes that half of the tests were conducted on true alternatives, which is a generous assumption. If only 10% of the tested hypotheses are true ($p_1 = 0.1$), then, $fdr > 0.70$.

And all this is generous for yet another reason: it assumes that all your true alternatives are at $\theta=1.645$, the configuration that gives you the smallest possible local fdr. Reasonable distributions of effect sizes under the alternative can easily result in even higher false discovery rates. For example, if, under $H_1$, $\theta$ is distributed according to the positive half of a normal distribution of mean 0 and standard deviation of 3, then $B$ is less than 1, which implies more than 50% of false discoveries if half of the tested hypotheses are true, and more than 90% of false discoveries if 10% of the tested hypotheses are true.

I guess it is fair to say that most tested hypotheses are in fact false ($p_1$ is most probably less than 0.1) -- if most tested hypotheses were true, then this would mean that we already know most of what we are inquiring about, and thus research would just be an idle confirmatory exercise. It is also fair to say that the whole point of scientific hypothesis testing is to reverse this unfavorable proportion by filtering out the majority of non-findings and publishing an enriched set hopefully composed of a majority of true effects. Yet, as we can see here, this is not what happens, at least if we focus on marginal discoveries.

The entire argument above assumes that the variance of $X$ around $\theta$ is known. If it is unknown, or more generally if there are nuisance parameters under the null, things are a bit more complicated. The exact quantitative results also depend on the sampling model (normal or other). However, for the most part, the message is probably valid in many other circumstances and is very simple: what we call marginally significant findings are most often false discoveries.

Now, we can quibble over many details here, discuss the practical relevance of this result (do we really need to define a threshold for p-values?), object that significance is just one aspect of the problem (you can get very significant but weak effects), etc etc. Nevertheless, one should probably admit one thing: many of us (me included) have perhaps not adopted the correct psychological calibration in the face of p-values and have tended to over-estimate the significance of marginally significant findings.

In other words, it is probably fair to admit that we should indeed revise our standards for statistical evidence.

Also, we should more often think in terms of false discovery rate: it tells us important things that cannot be understood by just looking at p-values and type I errors.

Independently of this theoretical argument, it would certainly be interesting to conduct meta-studies here: based on the collection of reported p-values, and using standard algorithms like the one of Benjamini and Hochberg, 1995, one can retrospectively estimate the fraction of false discoveries across all published results in the context of molecular evolution, comparative or diversification studies, for instance (all of which have often relied on relatively weak effects), over the last ten years. I would be curious about the outcome of such meta-analyses.

===

Berger, J. O., & Sellke, T. (1987). Testing a point null hypothesis: the irreconcilability of P values and evidence. Journal of the American Statistical Association, 82:112.

Johnson, V. E. (2013). Revised standards for statistical evidence. Proceedings of the National Academy of Sciences, 110:19313.

Nuzzo, R. (2014, February 13). Scientific method: statistical errors. Nature, pp. 150–152. doi:10.1038/506150a

Thursday, February 13, 2014

Blending p-values and posterior probabilities



Interesting column in Nature this week, by Regina Nuzzo, about p-values and why so many published findings are not true (see also Johnson, 2013, as well as many other earlier articles, e.g. Berger and Sellke, 1987).

Just one little point I find amusing: the entire column is about the problem of evaluating the "chances" that a scientific hypothesis is true or false given that a marginally significant effect was detected (thus leading to a publication of a result that may fail to replicate). The figure accompanying the column (the box entitled "probable cause") makes this point even more obvious, emphasizing that a p-value of 0.05 or even 0.01 does not imply that the chances are high that the effect is true -- in fact, in many situations, it implies that the chances are rather low.

It is of course a good idea to re-emphasize, one more time, the slippery nature of p-values. As an aside, the column proposes a globally interesting and integrated discussion, with a good dose of common sense, about many other aspects of the problem of statistical inference and scientific reproducibility.

Still, I am a bit surprised that the text does not make it clear that the "false alarm probability" is fundamentally a Bayesian posterior probability (of being in front of a true null given that the test was rejected). In particular, what the the figure shows is just that: it is a series of rough estimates of Bayesian posterior probabilities assuming different reasonable prior odds and assuming that different p-values have obtained.

The article does mention the idea of using Bayesian inference for addressing these problems, but only very late and very incidentally (and mentioning that this entail some "subjectivity"). As for the expression "posterior probability", it is not mentioned any single time in the whole article. Yet, for me, the figure should read as: prior probabilities (top), p-values (middle), and posterior probabilities (bottom).

Why not mentioning the name of the central concept of your whole discourse? I suspect this is because classical frequentism has censored this question for 70 years: you were not supposed to even talk about the probability that a hypothesis is true or false given the available evidence, since hypotheses were not supposed to be random events. Therefore, presumably, if you want to reach a sufficiently large audience, then, you should perhaps better not wave a bright red flag (posterior probabilities? subjective!).

Just to be clear: what I am saying here does not imply that Bayesian hypothesis testing, at least the current state-of-the-art version of it, is more reliable than p-values. Personally, I think that there are many, many problems to be addressed there as well. I am not even suggesting that, ultimately, Bayes factors or posterior probabilities should necessarily be used as the reference for assessing significance of scientific findings. I still don't have a very clear opinion about that.

Also, by Bayesian, I do not mean subjectivist. I just refer to the concept of evidential probabilities, i.e. probabilites of hypotheses conditional on observed facts (in this respect, I think I am true to the spirit of Bayes and Laplace.)

What I am saying is just that one should perhaps call things by their names.

In any case, the whole question of the relation between p-values and the lack of replication of scientific discoveries cannot be correctly conceptualized without articulating together ideas traditionally attached to both the Bayesian and the classical frequentist schools. That's what makes the question theoretically interesting, in addition to being urgent for its practical consequences.

===

Johnson, V. E. (2013). Revised standards for statistical evidence. Proceedings of the National Academy of Sciences, 110:19313.

Berger, J. O., & Sellke, T. (1987). Testing a point null hypothesis: the irreconcilability of P values and evidence. Journal of the American Statistical Association, 82:112.


Monday, February 10, 2014

Parameter estimation: optimizing versus conditioning



Misunderstandings about how Bayesian inference works are very common. In particular, people often don't understand that, in Bayesian inference, you do not have the problem of parameter-rich models being inherently more prone to over-fitting.

The thing is, much of currently accepted wisdom about statistics has been acquired from a nearly universal and exclusive use, over decades, of classical methods such as least square or maximum likelihood approaches, all of which are based on the idea of optimizing a score (e.g. maximizing a likelihood or minimizing a sum of squares).

Bayesian inference, however, works differently. It does not rely on any optimization principle. In fact, strictly speaking, in the context of Bayesian inference, you do not fit a model to the data. Instead, you condition a model on the data. And this conditioning, as opposed to optimizing, means a radically different logic of inference.

I think that this logical difference is one of the main reasons why common intuition, being too reliant on a long-lasting experience of optimization-oriented paradigms, regularly fails when trying to understand some key aspects of Bayesian inference -- and in particular, totally fails when it comes to correctly visualizing how Bayesian inference deals with model complexity.

Optimization is, in some sense, an aggressive methodology, selecting the single parameter configuration that best explains the data. As a consequence, the more parameters you have, the more you can satisfy your greed -- which means in particular that optimizing an overly rich model will lead to fitting both the structural patterns (the signal) and the irrelevant aspects (the noise) in the data.

Conditioning, on the other hand, is more temperate. It works by eliminating, among all possible model configurations, those that are ruled out by the data, so as to keep all of those that are good enough (and not just the best one). One can see it very clearly in the context of approximate Bayesian computation: there, one repeatedly draws random model configurations from the prior, simulates data and discards those parameter values for which the simulation does not match the empirical data (up to a certain tolerance interval). Then, estimation, and more generally decision making, is typically done by averaging over all configurations remaining after this elimination step.

Thus, optimization and conditioning cannot be more different; a bit like positive versus negative selection. Optimization actively searches for the best fitting configuration and rules out everything else. Conditioning actively rules out what does not fit at all and keeps everything else. 

Given such a fundamental difference between the two paradigms, one may expect that intuitions gained in one of two contexts may be counter-productive in the other context.

For instance, what happens under a parameter-rich model? In the context of an optimization-oriented approach, it is easy to tweak the parameters so as to make the model fit the irrelevant details of the data: you get over-fit. In contrast, in the context of Bayesian inference, such highly contrieved parameter configurations will not be typical among the set of model configurations that have not been eliminated by conditioning, so that their impact on the final estimate or decision will be completely swamped by all other configurations that have been kept.

Therefore, the whole problem that rich models might over-fit irrelevant aspects of the data is just not present in a Bayesian framework. Or, to put it differently, the apparent propensity of parameter-rich models to over-fit is merely a consequence of aggressive optimization -- not an inherent problem of parameter-rich models.

Now, all this does not mean that Bayesian inference is necessarily better. In particular, Bayesian inference may well have replaced over-fitting problems by prior sensitivity issues (although this is exactly where hierarchical priors have something to propose).

But it does raise the following question (for you to ponder...): how much of what you consider as obvious universal facts about statistics, such as the inherent ability of richer models to "have it easier" and their correlative propensity to over-fit, the fact that potentially useless extra-parameters are necessarily harmful, the necessity of externally imposed penalizations on richer models, the bias-variance tradeoff, among other things, are in fact consequences of the aggressive nature of one particular approach to statistics?

Again, this is a purely logical question, not a normative one. Some of the consequences of optimization (in particular, the possibility of playing with the bias-variance balance, which you cannot really do in a Bayesian context) may not be problematic at all. But it is just that one should perhaps not consider as "laws of nature" things that are in fact specific to one particular statistical paradigm.

More fundamentally, I find it interesting to exercise one's intuition by working out the differences between these two approaches to parameter estimation, by optimization or by conditioning. They offer complementary insights about statistical inference. Also, they can be combined (e.g. empirical Bayes by maximum marginal likelihood). Therefore, it is important to get used to both of them.

Statistical pluralism is not a problem, but an opportunity for us to diversify our thinking routines.

Monday, February 3, 2014

Soft and hard shrinkage



Shrinkage based on hierarchical priors can be seen as a way of reducing the effective number of parameters of a model.

Consider for instance the problem of allowing for amino-acid compositional variation through time in the context of phylogenetic estimation. In a maximum likelihood framework, and for small datasets (short sequences), we cannot afford unconstrained estimation of a distinct equilibrium frequency vector over the 20 amino-acids (19 free parameters) on each branch: this would presumably lead to over-fitting problems (although, again, I would like to see whether it would have such dire consequences in practice, but let us assume that it is the case).

A clever solution to this problem (Groussin et al, 2013) is to identify the first one or two principal components in the empirically observed compositional variation among taxa. Biologically, most of the compositional variation of proteomes is due to two independent factors: an underlying GC bias and a correlation with growth temperature (although the latter is true only in prokaryotes). Therefore, the first two components should absorb a large proportion of the total compositional variation.Then, a constrained model is implemented, allowing for compositional variation across branches only along these two principal directions. This reduces the number of parameters from 19 to 2 per branches, thus making the model suitable for shorter alignments.

Now, what would be the hierarchical Bayesian solution to this problem? Typically, one would consider the equilibrium frequencies across branches i.i.d. from a multivariate normal prior (for this to be well defined, equilibrium frequencies have to be log-transformed and reduced to 19 free parameters, see Lartillot, 2014). If $x_j$ is the 19-dimensional parameter associated with branch j, then:

$x_j \sim N(\mu, \Sigma)$

Here, $\mu$ is a 19-vector and $\Sigma$ a 19x19 covariance matrix. Both are hyperparameters, which would then be estimated by borrowing strength across branches.

This hierarchical prior will implement self-tuned shrinkage of equilibrium frequency profiles across branches in a direction-dependent manner (in the 19-dimensional space). Mathermatically $\Sigma$ can be decomposed into 19 independent variance components (corresponding to its eigenvalues). The prior will be loose in the directions corresponding to large eigenvalues, and tight in the directions corresponding to small eigenvalues. Thanks to this mechanism, the model will allow for compositional variation mostly along the first principal directions of $\Sigma$ (large eigenvalues), while shrinking more strongly along all other directions (small eigenvalues). Since $\Sigma$ is estimated on the dataset, its first principal directions will roughly correspond to the first few principal directions of the empirically observed compositional variation among taxa, which are also the ones selected by the method of Groussin et al (2013).

So, the two methods should do very similar things.

In this sense, hierarchical Bayes shrinkage indeed amounts to reducing the effective number of parameters, although in a soft manner: there is no truncation, no thresholding, and therefore no parameter-counting nor any model selection by AIC or BIC. Instead, there is an implicit context-dependent penalization imposed on the parameter vector.

In fact, we do not need to be Bayesian to use this kind of soft penalization. Alternatively, one could imagine to maximize a penalized likelihood, such as:

$\ln L - \lambda tr(V)$

Here $\ln L$ is the log of the likelihood, $V$ would be the empirical covariance matrix (the scatter matrix) of the (suitably log-transformed) equilibrium frequencies across branches, and $\lambda$ would be the strength of penalization. I don't know if this specific penalization scheme is the most reasonable one, but intuitively, the trace of $V$ is the sum of the variances along the principal directions, and therefore, the penalization term will implement shrinkage by allowing compositional variation only along the principal directions where a significant increase of the likelihood can be gained.

At first sight, this penalization scheme looks a bit like what is called ridge regression in statistics.

In any case, I just wanted to illustrate the idea of "soft" shrinkage: both in the Bayesian and the penalized likelihood framweork, we still have all the degrees of freedom in the model, and yet, the model is not over-parameterized -- its effective number of parameters is under control.

I tend to believe that, when modeling nuisances, soft shrinkage is better than hard shrinkage by dimensional reduction, fundamentally because soft shrinkage does not "throw away" part of the nuisances, as is done when using hard truncation. This may be a relatively minor problem in the case of protein composition, where the first two components explain a large proportion of the total variance. In other cases, however, hard truncation may result in models turning a blind eye to a significant proportion of the nuisances that were supposed to be accounted for. Soft shrinkage, in contrast, will embrace the whole of those nuisances, and should therefore lead to more accurate estimation of the parameters of interest.

Hierarchical Bayes will implement regularization by soft-shrinkage more or less automatically, and this is fundamentally what makes it such an attractive paradigm. On the other hand, if one really wants to stay in a maximum likelihood paradigm (in particular, for computational reasons), while being flexible with respect to the nuisances, then perhaps maximum penalized likelihood, instead of classical maximum likelihood, should be considered more seriously, in particular in phylogenetics (Kim and Sanderson, 2008).


==

Groussin, M., Boussau, B., & Gouy, M. (2013). A branch-heterogeneous model of protein evolution for efficient inference of ancestral sequences. Syst Biol, 62(4), 523–538. doi:10.1093/sysbio/syt016

Kim, J., & Sanderson, M. J. (2008). Penalized likelihood phylogenetic inference: bridging the parsimony-likelihood gap. Syst Biol, 57(5), 665–674. doi:10.1080/10635150802422274

Lartillot, N. (2014). A phylogenetic Kalman filter for ancestral trait reconstruction using molecular data. Bioinformatics. doi:10.1093/bioinformatics/btt707