Friday, October 31, 2014

Algorithmic models of probabilistic inference


Rejection sampling is clearly not the most efficient Monte Carlo estimator. Even so, it is a conceptually beautiful idea: fundamentally, rejection sampling represents an algorithmic model of Bayes theorem.

The algorithm works as follows. Given some observed data $d$ and a model parameterized by $\theta$:
- randomly choose parameter $\theta$ from the prior 
- simulate data $D$, using $\theta$ as your parameter vector
- reject if $D \neq d$, otherwise, keep sample aside
Repeat until exhaustion.

Rejection sampling represents an operational definition of what we mean by conditioning a distribution on the data: it identifies the posterior distribution with what you get by filtering out, from the random draws from the prior distribution, those parameter configurations that did not reproduce what you observed. We can get a quantitative derivation of this posterior by just reading out our algorithm: the probability of getting a particular parameter configuration $\theta$ in our final sample is proportional to the probability of first drawing this configuration from the prior, $p(\theta)$, multiplied by the probability of accepting it  which is just the probability that we simulate a dataset exactly equal to $d$, $p(d \mid \theta)$. Therefore, my final sample is distributed according to:

$p(\theta \mid d) \propto p(\theta) p(d \mid \theta)$.

Thus, rejection sampling gives us a simple intuitive 'proof' of Bayes theorem.

Personally, I find the idea of rejection sampling not just useful for explaining Bayesian inference in the context of an introductory course. I also use it regularly, reformulating my most complex problems of the moment in terms of this algorithmic representation, just to clarify my ideas. As I will try to show through some examples in later posts, rejection sampling and, more generally, algorithmic models of probabilistic settings, can sometimes be very helpful to clarify the meaning our models and to sort out the calculations.

But first, a few more preliminaries, in the form of three little examples of increasing complexity, illustrating how this works, and what this fundamentally means, in practice. Example 1 and 2 are rather trivial, but they are interesting in that they represent two different epistemological meanings that Bayes theorem can have in practical situations. They also set the stage for example 3, which deals with conditional likelihoods.

Example 1. Imagine we want to infer the allele frequencies at a neutral bi-allelic locus, based on a random sample of chromosomes from a given population. Say we have sampled $n=5$ chromosomes, of which $k=3$ had allele 1 and $n-k=2$ allele 0, and we want to infer $\pi$, the frequency of allele 1 in the population. Of course, we could just take $k/n$ as a quick estimate, but if $n$ is small, there will be quite some uncertainty, which we also want to quantify.

If we are dealing with a randomly mating species, with known scaled mutation rate $\theta = 4 N_e u$, we know that the frequency distribution of neutral alleles in the population is a Beta distribution of parameter $\theta$ (figure below, thin line). We can use this distribution as our prior on $\pi$ and use the rejection sampling algorithm:

- randomly choose $\pi$ from prior 
- simulate $n$ chromosomes, each with prob. $\pi$ of being of type 1.
- count number $K$ of chromosomes carrying allele 1.
- reject if $K \neq k$.

Or, more compactly:

- randomly choose $\pi$ from prior 
- simulate $K \sim$ Binomial($\pi,n)$
- reject if $K \neq k$.
Reading out this algorithm leads to the following posterior distribution (figure, thick line):

$p(\pi \mid k) \propto p(\pi) p(k \mid \pi)$.



This is a rather trivial case, although one for which the inferential semantics of the algorithm is particularly compelling: our locus of interest is just any neutral locus for which $k$ out of $n$ randomly chosen chromosomes happen to have allele 1. Therefore, simulating a large number $P$ of random instances of neutral loci, each with its own population frequency $\pi$, reproducing the data-acquistion protocol on each of them and then selecting only those instances that match our observation (conditioning step), gives us a typical set, of which our true observed instance is then supposed to be just an additional random member. In particular, if we could sort in ascending order the $P+1$ values of $\pi$ (the $P$ that we have simulated + the true one), then, our unknown $\pi$ is indistinguishable from simulated replicates, it should have an equal chance of being the first, or the second, or more generally, of having rank $p$ for any $p=1..P+1$. Therefore, the unknown $\pi$ has equal chance of being contained within any one among the successive $P+1$ intervals of our simulated series (including the two open-ended intervals at both ends): this is the epistemological meaning of a posterior distribution. 

Example 2. As another slightly less trivial problem, suppose now that we have genotyped, not just one, but $M$, unlinked neutral bi-allelic loci (or sites) in our $n$ individuals. We now we want to infer $\theta = 4 N_e u$, based on the allele counts $k_i$, $i=1..M$, across loci. Let us call $k$ the vector of observations. We have prior information about $\theta$, summarized in some prior distribution $p(\theta)$. In that case, the algorithm would run as follows:

randomly choose $\theta$ from $p(\theta)$

for site i=1..M
- randomly choose $\pi_i$ from $p(\pi_i \mid \theta)$
- simulate $K_i \sim$ Binomial$(\pi_i, n)$

reject unless $K_i = k_i$ for all $i=1..M$.

Here, the $\pi_i$’s are now just temporary random variables. The algorithm can therefore be written more compactly as:

randomly choose $\theta$ from $p(\theta)$
for i=1..M
- randomly simulate $K_i \sim p(K_i \mid \theta) = \int p(K_i \mid \pi_i) p(\pi_i \mid \theta) d \pi_i$
reject unless $K_i = k_i$ for all $i=1..M$.

where we have averaged out over all possible ways we could have sampled the $\pi_i$'s. The 'for' loop leads to a product, which gives me the following posterior distribution:

$p(\theta \mid k) \propto p(\theta) \prod_i p(k_i \mid \theta)$.

Still fairly simple. One little detail, however, concerning the epistemological meaning of the whole procedure in this new case: thus far, the formulation of the algorithm has been slightly incomplete. When I say ‘choose’, perhaps I should be more specific about who exactly ‘chooses’ here. In the case of problem 1:

Nature randomly chooses $\pi$ from the Beta distribution $p(\pi \mid \theta)$ 
Nature simulates $K \sim p(K \mid \pi)$
I observe and reject if $K \neq k$.

The fact that random draws are entirely under Nature's control is precisely why inference by conditioning is so compelling in that case. But for problem (2), it is a bit different:

I randomly choose $\theta$ from my prior $p(\theta)$ and imagine it is the true $\theta$

for i=1..M
- Nature randomly chooses $\pi_i \sim p(\pi_i \mid \theta)$
- Nature simulates $K_i \sim p(K_i \mid \pi_i)$

reject unless $K_i = k_i$ for all $i=1..M$.

This (somewhat less compelling) application of rejection sampling is called Bayesian inference: happily mixing subjective priors ('I choose') with distributions that are meant to represent objective processes ('Nature chooses'). Epistemologically, what we are now doing is this: we are filtering out, from our prior beliefs about $\theta$, those that don't easily reproduce the observed data — which still makes sense, although this assumes that you are comfortable with translating beliefs into probability distributions and vice versaI won’t discuss here the philosophical implications of this methodological choice. I will in due time, but for the moment, I just wanted to be as accurate (and as honest...) as possible in my formulation. Making this distinction beween the subjective and the objective components of the model can be useful in less trivial contexts.

Examples 3. Now, let us consider a more complicated problem: I still want to estimate $\theta = 4 N_e u$ based on $M$ unlinked loci. But the crucial difference with the previous example is that I have removed all the constant sites of my dataset, so as to consider only segregating sites: that is, only loci that happen to show at least one variant in my sample of $n$ individuals. So, by design, for all $i=1..M$, one has $0 < k_i < n$ (whereas in the previous case, one had instead that $0 \le k_i \le n$). Let us denote this additional constraint by $S$: the sample is entirely made of segregating sites. For $i=1..M$, $S_i$ stands for the constraint that site $i$ is segregating.

I should really take into account this new mode of data acquisition in my inference. If I do not, I will greatly overestimate $\theta$ (since it would take very high values of $\theta$ to make it likely that $M$ randomly chosen neutral sites were all segregating). Thus, I clearly need to work out the probabilistic calculations for my constrained model. The algorithmic representation of this model would look like this:

I randomly choose $\theta$ from $p(\theta)$ and imagine it is the true $\theta$

for i=1..M
  repeat
  - Nature randomly chooses $\pi_i \sim p(\pi_i \mid \theta)$
  - Nature randomly simulates $K_i \sim p(K_i \mid \pi_i)$
  until $1 < K_i < n$

I reject unless $K_i = k_i$ for all $i=1..M$.

Since sites are unlinked and, therefore, independent, collecting $M$ segregating sites among a large collection contanining a majority of non-segregating sites is just like letting Nature ‘run’ the process, site by site, intercepting only those that turn out to be segregating, until obtaining a total number of $M$ accepted sites (and then I decide whether or not to reject the whole sample). Note that there are now two levels of rejection: one that is ‘for free’ (in the repeat-until loop), and one that is not (the final step). By ‘free’, I mean: not accounted for in our total acceptance rate -- and, therefore, not accounted for in our total probability.

As above, we can first eliminate the $\pi_i$'s:

Randomly choose $\theta$ from $p(\theta)$

for i=1..M
  repeat
  - Nature randomly simulates $K_i \sim p(K_i \mid \theta) = \int p(K_i \mid \pi_i) p(\pi_i \mid \theta) d \pi_i$
  until $1 < K_i < n$

Reject unless $K_i = k_i$ for all $i=1..M$.

Then, for a given $\theta$ and a given $i$, carefully interpreting the repeat-until loop leads to the following proability for $k_i$, which is now conditioned on the constraint $S_i$ being fulfilled:

(1) $p(k_i \mid \theta, S_i) = p(k_i \mid \theta) / p(S_i \mid \theta)$.

This conditional likelihood is defined only over $k_i = 1..n-1$ (now excluding 0 and n, since I won't exit the repeat-until loop as long as these values are being produced) and has been suitably normalized, by dividing it by the probability of getting indeed a value for $k_i$ in the allowed interval (i.e. by indeed getting a segregating site given $\theta$):

$p(S_i \mid \theta) = \sum_{l=1}^{n-1} p(k_i \mid \theta)$.

Then, according to the algorithm, I just need to take the product over the $M$ sites (the 'for' loop) and combine this with my prior:

$p(\theta \mid k) \propto p(\theta) \prod_i p(k_i \mid \theta, S_i)$,

or, for short:

(2)    $p(\theta \mid k) \propto p(\theta) p(k \mid \theta, S)$.

This particular problem illustrates one general rule:

filtering the data $\implies$ repeat-until loops $\implies$ conditional likelihoods.

However, it is also interesting for another reason: there is apparently no simple way we can have $S$ appear at the correct place, in equation 2, by just blindly manipulating symbolic equations, as we often do in applied Bayesian inference, shuffling symbols around, sometimes to the left, sometimes to the right of the conditioning sign. We could try to write, for instance:

$p(\theta \mid k, S) \propto p(\theta \mid S) p(k \mid \theta, S)$,

but it is different from equation 2. More fundamentally, the meaning of what we are doing is not very clear. In particular, we have an ambiguity in the meaning of $p(\theta \mid k, S)$: usually, this would mean ‘the probability of $\theta$, given the observed data $k$ and given that $S$ is true’. However, what we really want in the present case is: ‘the probability of $\theta$, given the observed data $k$ and given that my data acquisition protocol is constrained to deliver something compatible with $S$’, which is different. Symbolic notation in terms of '$p(A \mid B)$' is just too sloppy. 

We could of course try to come up with better notations and still play the game of abstract symbolic calculus. However, when facing complex problems, perhaps it is just more helpful, and intuitively more illuminating, to directly rely on a semantic model of our probabilities. Rejection sampling, and more generally algorithmic models of our sampling protocol, provide such a semantic model. As a result, they represent a relatively foolproof method for deriving the correct version of our posterior distribution (or of our likelihood): we just need to spell out each of the probabilities, such as operationally defined by the algorithm, which gives us the solution.

Tuesday, October 28, 2014

Bayesian automatic control of model complexity

Still concerning Bayes factors and model selection, I wanted to show a little example, illustrating how one can rely on Bayesian model averaging and hierarchical priors to implement efficient and automatic control of model complexity -- model auto-tuning, as I like to call it.

First, I have simulated a single-gene amino-acid alignment of ~600 positions under a very simple model, using the LG amino-acid empirical matrix (Le and Gascuel, 2008) and assuming no variation in substitution rates among sites. For this simulation, I used as a template an alignment of elongation factor 2 in 30 species across eukaryotes. The simulation was done using the tree estimated on this dataset, thus under empirically reasonable settings.

Then, I have analyzed this simulated dataset directly using one of the most complex models I have implemented thus far: CAT-GTR + Gamma (Lartillot and Philippe, 2004). This model implements variation among sites in the rate of substitution (distributed across sites according to a discretized gamma distribution, with 4 categories), but also in the equilibrium frequency profile over the 20 amino-acids, using a Dirichlet process non-parametric prior. Finally, the 190 relative exchangeability parameters are also considered as unknown and are therefore estimated directly on the dataset.

According to accepted wisdom, one would think that this model is way too complex, way too rich, to be applied to such a small dataset. Even using a GTR model, instead of a LG model, in the present case would often be considered as excessive (190 free parameters! on a 600-site alignment!). Therefore, my estimation will certainly suffer from excessive variance.

Is this true?

Let us see, by just trying both models: the complex model, CAT-GTR + Gamma, and the true model, LG without Gamma, and see what comes up in terms of phylogenetic estimation (I let you guess which picture corresponds to which model):



Where is the excess variance here? There is virtually no detectable difference between the two estimates.

This represents a typical example showing that a complex model can automatically adapt to the specific case you are dealing with, making model selection totally dispensable. All this, however, is true only if your priors have been correctly designed, so as to get efficient shrinkage of your parameters, in particular under those circumstances where the data in fact contain none of the subtle variation that your model is meant to account for. So, perhaps I should spend some time explaining how my priors have been elicited in the present case. The main point is that these priors are hierarchical.

For the shape parameter $\alpha$ of the gamma distribution of rates across sites, I have used a log-uniform prior restricted between 0.1 and 100, thus allowing a broad range of variance among sites in relative rates. Large values of $\alpha$ effectively implement uniform rates across sites.

A Dirichlet process prior is used over the unknown distribution of amino-acid equilibrium frequency profiles across sites. This non-parametric prior effectively runs over infinite mixtures of amino-acid profiles. There are two main things that have to be controlled in the distribution of the components of the mixture: the relative weights of the components and the dispersion of the profiles attached to them.

The weights are regulated by a granularity parameter $\kappa$. For small values of $\kappa$, the mixture is such that one component of the mixture has an overwhelmingly large weight compared to the infinite series of all other components. In this regime, nearly all sites will be allocated to the large-weight component, and the model will implement strictly identical amino-acid profiles across sites. In contrast, for large values of $\kappa$, the Dirichlet process implements many components of very small weights, so that each site has effectively its own equilibrium frequency profile over amino-acids. Here, $\kappa$ is endowed with an exponential prior of mean 10, so that there is some room for implementing potentially rich mixtures by setting $\kappa$ to reasonably large values, but also to reduce the number of components to 1 or 2 (essentially, when $\kappa$ is less than 0.1).

As for the distribution of profiles across components, it is regulated by a concentration parameter, $\delta$. For large values of $\delta$, amino-acid profiles implemented by the mixture will all be very similar, whereas for small value of $\delta$, they will tend to be very different. Here, $\delta$ is endowed with an exponential prior of mean 20. So, it does not especially favor very similar components, but it can, if needed (when its value is typically of the order of 50 or above).

Finally, the prior on the exchangeability parameters is centered around the LG exchangeabilities, with a concentration hyperparameters $\zeta$ itself endowed with a log-uniform distribution between 0.1 and 100. Large values of $\zeta$ effectively shrink the exchangeabilities very close to those of LG.

Therefore, the most important priors and distributions are tuned by four key-parameters, such that, with large $\alpha$, large $\zeta$, small $\kappa$ and/or large $\delta$, the model reduces to the LG model with uniform rates and profiles across sites. Note that there are two redundant ways the model can come back to virtually identical profiles across sites: either by implementing one single component of overwhelming weight (small $\kappa$), or by just letting the amino-acid profiles of the mixture be all very close to each other (large $\delta$).

So, what happens exactly when I run this model on the dataset simulated under the simple LG-Uniform model?

First, the posterior distribution on $\alpha$, the shape parameter of the distribution of rates across sites (left) is shifted to very large values: $\alpha$ is inferred to be so large that, essentially, all relative rates of substitution across sites are all strongly shrunken around 1 (right) -- in other words, the model effectively implements uniform rates across sites, as should be done in the present case.



Then, what about variation in amino-acid profiles among sites? The figure below shows the posterior distribution on the number of occupied components:


as you can see, we get a large posterior probability for the configuration where all sites are allocated to one single component (pp = 0.6). Even when sites are distributed over more than one component (which still happens around 40% of the time), the components effectively implement nearly identical frequency profiles: $\delta$ is on average around 80 (not shown). Thus, efficient shrinkage of equilibrium frequency profiles across sites is achieved by a combination of discrete model selection (by playing on the number of occupied profiles) and continuous model averaging (by shrinking component-specific frequency profiles toward their mean).

We can have a more direct look at the average profiles estimated across sites, and check that they are indeed, for all practical means, constant across sites: 


It is also interesting to get a look at the accuracy of our estimation of the estimated relative exchangeabilities, by comparing them to the true values:




Such a beautifully accurate estimation of the relative exchangeabilities may in fact suggest that using a model centered on LG is just too easy... So, what happens if I use instead a uniform prior over exchangeabilities? In that case, shrinkage is somewhat less efficient, both in terms of the accuracy of the estimated exchangeabilities:



and in terms of the number of occupied components in the mixture (the posterior probability of getting just one occupied component is now less than 0.2):



On the other hand, $\delta$ is still large (again of the order of 80), which probably contributes to inducing a series of very similar amino-acid profiles across sites:



Finally, the estimated consensus tree is exactly the same as with the model centered on LG:



Thus, globally, there is no real loss incurred by using the more complex model, when in fact a much simpler model would be applicable to our data.

In contrast, on the real sequence alignment, there is rate variation among sites, and there is also a great deal of variation in amino-acid preferences along the sequence. We can get a clear picture of that by just running the CAT-GTR + Gamma model directly on the original empirical dataset of elongation factor sequences. Here, the posterior distributions that we get are totally different, in particular on the distribution of rates across sites ($\alpha$ is now clearly less than 1, of the order of 0.55) and concerning the distribution of amino-acid profiles across sites:


Other analyses with CAT-GTR on a now relatively large set of cases suggest that this is a very general pattern: on empirical sequence data, one always sees substantial variation among sites of both rates and amino-acid propensities.

Overall, one should probably not over-interpret those results, which all entirely rely on the fact that the true model is included in the set of possible configurations explored by CAT-GTR. In the real world, things are much more complicated: you almost certainly have other types of model violations (compositional heterogeneity among taxa, epistatic interactions between positions, due to the tertiary structure of the protein), which have been totally ignored in our analysis. One big question is, for instance: how does a model such as CAT-GTR react to these violations, compared to LG or to other models? Not an easy question, for sure.

However, this is not the kind of issues that you can hope to address by just comparing the empirical fit of the models. Addressing such problems requires more substantive arguments, about exactly what you want to estimate and how violations are likely to influence your estimation. In this direction, we may not know the specific impact of other model violations on CAT-GTR, but we already know that, by using LG, we are for sure creating one additional model violation, since we are ignoring what appears to be substantial and ubiquitous variation among sites in amino-acid propensities -- and this does have an impact on the accuracy of phylogenetic estimation (Lartillot et al, 2007).

But in any case, these model violation problems distract us from the main point of this post, which is more theoretical, being concerned with how Bayesian inference deals with model complexity. And the main message is this: unlike what happens in the context of maximum likelihood estimation, where we have to compensate for the intrinsic advantage of more complex models by implementing explicit model penalization and explicit model selection, in a Bayesian inference context, and using correctly specified priors, model complexity is just under automatic control. As a result: (1) it is possible to use a complex model by default, even on small datasets; this model will automatically reduce to the simpler model if needed (2) therefore, we do not need Bayes factors to be sure that we are using the right model, with the right complexity; (3) nor do we not need Bayes factors to see that, on a real dataset, there is variation in rate and in equilibrium frequency profiles: we just see it in the posterior distribution. And if we are really anxious, we can always check, on simulations under simpler models, that our Bayesian analysis would not infer such a variation if there weren't any.

===

Le, S. Q., & Gascuel, O. (2008). An improved general amino acid replacement matrix. Molecular Biology and Evolution, 25(7), 1307–1320. doi:10.1093/molbev/msn067

Lartillot, N., & Philippe, H. (2004). A Bayesian mixture model for across-site heterogeneities in the amino-acid replacement process. Molecular Biology and Evolution, 21(6), 1095–1109. doi:10.1093/molbev/msh112

Lartillot, N., Brinkmann, H., & Philippe, H. (2007). Suppression of long-branch attraction artefacts in the animal phylogeny using a site-heterogeneous model. BMC Evolutionary Biology, 7 Suppl 1, S4. doi:10.1186/1471-2148-7-S1-S4


Monday, October 27, 2014

Multiple testing: a damnation or a blessing?

Just to follow up on a post by Joe Pickrell. I cannot agree more with Joe on the idea that, at least if your aim is to control for the false discovery rate (FDR), then multiple testing is not a damnation — it is, in fact, a blessing.

Why don't people see that the 'problem of multiple testing' has entirely disappeared with the advent of FDR ? Perhaps simply because FDR is still often presented as a method meant to 'correct for multiple testing', which I think is a very misleading way to see it.

As I already discussed, if you see FDR as a method correcting for multiple testing, then, somehow, this conveys the (wrong) idea that, the more tests you conduct, the more you have to fight against the combinatorial curse. But this is not what happens. FDR is just the fraction of true nulls among your rejected items, for a given threshold $\alpha$ on the p-values. Therefore, it depends only on the proportions of true nulls and true alternatives in your sample  not the absolute number of tests.

If, among 100 000 SNPs, using a cutoff of p < 1e-5 gives you an expected 10% FDR, this means that, among all findings with p-values below 1e-5, 10% are false discoveries. Now, you may randomly subsample your dataset, say, take one out of 10 of your original 100 000 SNPs: among those that have a p-value below 1e-5 in your subsample, you will obviously still have 10% of false discoveries. Thus, for a given threshold (here 1e-5), subsampling your original data does not decrease your FDR (or equivalently, for a given target FDR, you would not have to relax your threshold to analyze the subsample) -- which is however what you would expect if you were fighting against some combinatorial effect created by the sheer number of tests.

So, the absolute number of tests is not, per se, a damnation.

Conversely, an empirical Bayes perspective on the problem suggests that the FDR can be interpreted as the posterior probability that a rejected item is in fact a true null. This posterior probability is empirical, in the sense that it is based on priors (prior fraction of true nulls and prior distribution of effect sizes among true alternatives) that have been estimated directly on the dataset: these priors have a frequentist interpretation.

From this empirical Bayes perspective, it is immediately clear that, the more items you have, the better your priors are estimated. So, it is in fact a good idea to work on large datasets: it improves the quality of your FDR control -- and this, without causing any deterioration of your power.

So, the absolute number of tests is in fact, a blessing.

Now, of course, all this assumes that the proportion of true alternatives is the same, whichever the size of your dataset. So, this does not apply to the situations where you would for instance work on a subset of SNPs that have been pre-selected based on some objective criterion (say, SNPs close to protein coding genes), which you hope will increase the proportion of interesting items in your sample. Conversely, if working on large datasets generally means that you are working more indiscriminately (i.e. that you are working on sets that are likely to contain much more garbage), then, yes of course, you will have to use a more stringent threshold to achieve the same FDR. However, this is not, in itself, because you are working on a larger number of cases, but only because you have also swamped your interesting cases with a larger proportion of true nulls.

In fact, once you are comfortable with the empirical Bayes perspective on the concept of FDR, you can see ways of obtaining the best of both worlds: work on a large and indiscriminate set of cases, while at the same time enjoying the power that you would get from first enriching your sample based on good heuristic pre-selection criteria.

The idea is to make an explicit empirical Bayes model, in which your prior of being a true alternative now varies among items, in a way that systematically depends on the criteria that you would have used for pre-selection: essentially, doing a logistic regression model for the (now item-specific) prior of being a true alternative, with covariates provided by external criteria. The parameters of this model would be estimated on the dataset, by empirical Bayes. Doing this will automatically capture the potential enrichment effects within certain interesting subsets and will thus lead to an increase in your global number of findings, compared to the standard, uniform version of FDR analysis, while still controlling for the same level of global FDR.

Fundamentally, it is misleading to see FDR as a method for 'correcting for the devastating effects of multiple comparison'. It is far more adequate to see it as a method that takes advantage of the multiple comparison settings, to derive a frequentist control that is much more useful than the control of type I error.

So, yes, I am definitely up for a multiple-testing party!




Tuesday, October 7, 2014

Cases where Bayes factors can be useful

Here are some situations where explicit computation of Bayes factors could still be meaningful and useful:

Bayes factors can be used for comparing a small number of alternative tree topologies in the context of a phylogenetic analysis. In particular, if the topology that would have been considered the most reasonable one before the analysis turns out to be totally absent from the sample obtained by Monte Carlo, we would typcially want to now how strong the empirical evidence is against what we have deemed thus far to be true. It would therefore make sense to compute a Bayes factor in that case. This would have to be done explicitly, using thermodynamic integration or the stepping-stone method.

Another reason for explicitly comparing the marginal likelihoods of several candidate tree topologies is to make sure that the MCMC did not get stuck in a local maximum, returning a high apparent posterior probability for a tree that has in fact a low marginal likelihood.

On the other hand, the differences in log-marginal likelihoods between alternative (reasonable) topologies are usually very small compared to the absolute values of the log marginal likelihoods, pushing the numerical methods to their limits in terms of numerical accuracy.

Even in the situations mentioned in my previous post, where automatic model selection can easily be implemented, it can still be useful to compute marginal likelihoods or Bayes factors: just to make a point, for instance, about how a new model improves on the models currently considered as the standards of the field or, more generally, what features of a model are important in terms of empirical fit.

Finally, in those cases where it is difficult to implement an efficient reversible-jump MCMC for averaging over non-nested models, model averaging may sometimes be replaced by model selection through explicit calculation of marginal likelihoods (although it is often easy to come up with a more general model encompassing all of the non-nested models that we want to compare).

But then, these are relatively specific situations, quite distinct from the idea of systematically computing Bayes factors in order to first select the best model, and then conduct inference by parameter estimation, which is more specifically what I was arguing against.

For me, systematic and explicit model selection is one of those old reflexes inherited from maximum likelihood thinking -- but not really adapted to the context of Bayesian inference.


Friday, October 3, 2014

Model averaging versus model selection

Coming back to my last post about model selection using Bayes factors, in which I was suggesting that it is most often not necessary to implement explicit Bayes factor evaluation: the model-averaging aspect is in fact more important than I suggested. Indeed, it is precisely because you average over models that you do not need to go through explicit model selection.

I think there is something fundamental to understand here, concerning the difference between the maximum likelihood and the Bayesian approaches to the question of choosing models, so let me belabor the point.

In a maximum likelihood context, one would generally not use the more complex model by default. Often, one would instead use the simplest model, unless this model is rejected by the data in favor of a more complex alternative. This approach to model selection is implemented, for instance, in sequential tests that go from very simple substitution models like Jukes-Cantor to the general time-reversible parameterization, going through the whole series of intermediates (F81, HKY85, TN93, etc) sorted by increasing dimensionality.

However, this paradigm is, I think, inadequate in a Bayesian framework. In fact, the model ‘auto-tuning’ example I was mentioning in my last post suggests exactly the opposite: to always use the most general model by default, the one that has the highest dimension, given that it will automatically include whichever simpler alternative may turn out to be adequate for the particular dataset under consideration.

Why this difference between the two frameworks? I think there are two (related) reasons:

(1) point estimation versus marginalization: maximum-likelihood is an either/or method: either JTT or GTR. Of course, as a model, GTR does include JTT as a particular case, but in practice, you still have to choose between two distinct point estimates for the exchangeability parameters: either those of JTT, or those obtained by maximum likelihood estimation. So you really have two distinct, non-overlapping models at the end, and you will have to choose between them.

In contrast, in a Bayesian context, if you are using GTR, you are in fact averaging over all possible configurations for the exchangeability parameters, including the one represented by JTT. Seen from this perspective, it is clear that you are not choosing between two mutually exclusive options. Instead, choosing between GTR and JTT really means, choosing between a risky bet on one single horse (JTT) or a bet-hedging strategy consisting of averaging over all possible configurations for the unknown exchangeability parameters.

(2) optimization versus conditioning: as I previously explained, maximum likelihood is inherently prone to overfitting. Therefore, you should have good reasons to use a model that invokes additional parameters. In contrast, Bayesian inference works by conditioning, and therefore, does not have over-fitting problems.

Point (1) gives some intuition as to why we need explicit model selection in a maximum likelihood but not in a Bayesian analysis, while point (2) explains why we would typically conduct model selection by starting from the simplest model in a maximum likelihood framework, whereas we can safely opt directly for the more general model in a Bayesian context.

Now, averaging over models also means that we have to correctly define the measure over which to take this average: in other words, we have to correctly design our prior over the models, which is not necessarily something obvious to do. I have suggested some ways to do it properly in my last post (mostly, by having key hyperparameters that are able to drive the whole model configuration down to some remarkable submodels), but there is probably much more to say about this.

In any case, what all this suggests is that the focus, in Bayesian inference, should not be on explicit model selection. Instead, it should be on prior design — including the prior over models.


Wednesday, October 1, 2014

Selecting models without computing Bayes factors


Model comparison or selection is often considered to be a very important step in a typical Bayesian analysis. It is often done by calculating Bayes factors or, more generally, marginal likelihoods.

I used to be a fan of Bayes factors. I even spent quite some time on the numerical methods for reliably computing them, in the aim of comparing amino-acid substitution models or alternative relaxed molecular clocks. But I have progressively changed my mind about Bayes factors, or at least about the need to explicitly calculate them. 

First, numerical evaluation of marginal likelihoods is tricky. For quite some time, it was not even done properly: unreliable methods such as the harmonic mean estimator (‘the worst Monte Carlo method ever’) were most often used for that purpose. Now, better numerical methods are employed, such as thermodynamic integration or its discretized equivalent, the stepping-stone approach. But the reliability of these numerical methods comes at a price: they are computationally expensive.

Given the costs associated with the whole enterprise, it is therefore important to realize that these laborious numerical methods are most often unnecessary. A much simpler alternative is to make a general model encompassing all of the models that we wish to compare. Then, by just carefully designing the prior over the parameters of this model, what we obtain is a probabilistic system that, upon being conditioned on a given dataset, automatically selects the parameter regime effectively implementing the model that happens to have the highest fit for these data.

As a simple but not too trivial example, let us assume we have an amino-acid alignment and want to choose between two alternative substitution models: say, the empirical JTT matrix and the most general time-reversible model (GTR). The two models greatly differ in their dimensionality: the 190 exchangeability parameters are fixed to pre-determined values in the case of JTT, whereas they are estimated directly from the data in the case of GTR.

In this specific case, instead of numerically evaluating the marginal likelihoods of the two models, one could design a GTR model such that the prior on its exchangeability parameters would be centered on the exchangeabilities of JTT. The concentration of this prior around its center would be controlled by one single concentration parameter, which should in turn be endowed with a broad prior. Typically, I would use a log-uniform prior (possibly truncated to make it proper) or a half-Cauchy. Then, I would just condition the model on the data, without any further complicated model selection process. If the JTT model is really fit, then the concentration parameter will automatically take on large values under the posterior, and as a result, the exchangeabilities of the model will shrink towards those defined by JTT. In other words, the GTR model will effectively implement the JTT model if this is what the data prefer. Importantly, this automatic 'collapse' of the model is driven by just one hyperparameter, so that the approach should work well even for relatively small datasets.

Note that the model will never exactly collapse onto the simpler model: using this approach, the exchangeabilities of the GTR model are never exactly equal to those of the JTT model. In all regimes, the model always has its full dimensionality. Is this really a problem?

A typical objection here would be that our estimation might suffer increased variance. However, strictly speaking, this is true only if the data have been produced by a JTT model, which is never the case in practice, except in simulation studies. Even in simulation studies, I would be surprised to see any detectable difference in terms of estimation error (above the MCMC noise) between the hierarchical model settings suggested above and the true model: thanks to the specific hierarchical design of the prior, the extra-variance induced by the more flexible parameterization should be very small — technically, the effective number of parameters is much smaller than the nominal number of parameters.

More fundamentally, nobody believes that JTT is the true model for any real data, and therefore, it does not really make sense to see it as a relevant point null hypothesis to be tested against a more general alternative, as would be implied by a Bayes factor analysis between JTT and GTR. Instead, JTT at best represents a good first guess for the exchangeabilities, to be used when the sequence alignment under investigation is too small to provide this information entirely from scratch. But then, a first guess should probably not be formalized as a point null. Instead, it should be seen as a good center for the prior — which is exactly what the hierarchical prior suggested here does: using JTT as a first guess, while implementing some empirically adjustable flexibility around it.

This method can easily be generalized to deal with all sorts of model selection problems. In all cases, this will be done by defining hierarchical priors in a way that ensures that the model will be able to effectively collapse onto a simpler model if needed. Many examples can be imagined: a large value for the shape parameter of the gamma distribution of rates across sites effectively implements the uniform-rate model; efficient shrinkage of branch-specific equilibrium frequencies for the substitution model can effectively emulate a time-homogeneous model; or a suitably chosen prior over partition models gives the data the opportunity to efficiently funnel the model onto uniform substitution matrices or uniform branch lengths across partitions, if relevant.

Technically, the approach could be seen as a case of model averaging. However, the expression ‘model averaging’ is meant to emphasize the ‘averaging’ aspect: the fact that you don’t want to bet 100% on one single model. Here, the averaging is perhaps not the most important point (even if it does play some role). For that reason, I would call this procedure model auto-tuning, thus referring to the fact that the model automatically finds the most adequate parameter regime.
In any case, I believe that substantial amounts of computational resources could be saved, simply by not computing Bayes factors between phylogenetic models: in most cases, this is at best unnecessary and, at worst, just an idle exercise.


Tuesday, April 22, 2014

FDR and the single-gene experiment



Imagine that you want to test for the presence of positive selection in a given gene in the human lineage, based on a protein alignment and using a codon model. Your data consist of aligned sequences from 5 mammals (human, chimp, macaque, mouse, rat and dog). You apply the branchwise test of codeml (Nielsen and Yang, 1998) on this alignment, with the branch leading to humans after the divergence from chimps taken as the focal branch. The test returns a p-value of 0.01 against the null model (i.e. the model in which the human lineage is just like all other branches making up the background).

With a p-value of 0.01, is it legimitate to consider that this gene is under positive selection in the human lineage? I guess that most of us would consider that it is legitimate.

Yet, in the context of a genome-wide scan for positive selection, based on the complete genomes of the same 5 mammalian species, Kosiol et al (2008) found that, in order to control for a false discovery rate (FDR) of 5% when detecting episodes of positive selection along the human lineage, one has to impose a significance threshold of p<0.000025. This threshold is 400 times smaller than 0.01. I have not made the exact calculation here, but I would not be surprised if a threshold of 0.01 in this context would imply something like a FDR of more than 50%, and a local fdr (i.e. among findings with a p-value at 0.01) of more than 80 % --  for the purpose of the following discussion, let us assume that this is the case.

Thus, if your gene of interest were analyzed, not in isolation, but in conjunction with all other protein coding sequences across these 5 mammalian genomes, then one would not have considered it as significantly under positive selection. Instead, one would have concluded that it was most probably a false discovery.

So, what should we do here? Are we supposed to adopt double standards and react differently when confronted to the same p-value, depending on whether it is considered in the context of a single-gene or a multiple-gene analysis? Or should we somehow transpose what we have obtained in the context of the FDR-controlled genome-wide scan to a new individual case? More fundamentally, what are we to conclude here, concerning the presence of positive selection on our gene in the human lineage?

At first sight, one could possibly argue that, technically, the two situations are not comparable. FDR is often presented as something meant to correct for multiple comparisons. When analyzing a gene in isolation, there is no multiple comparison involved, and therefore, no correction is needed. Thus, one should not transpose FDR results to a single case.

However, I think that this would be a fallacious argument. Fundamentally, what the FDR calibration tells you is that 50% of the genes for which p<0.01 are false discoveries in the present context. If your gene of interest has nothing special, then it is just like any gene taken at random and for which p<0.01, and therefore it has at least 50% chances of being a false discovery (in fact, 80% because it is at 0.01). As far as I can see, it does not matter whether or not this gene was tested in isolation or in conjunction with other genes.

Personally, I think that it is misleading to see FDR just as a way to 'correct for multiple comparisons'. For me, FDR is fundamentally an empirical Bayes posterior probability: the probability that a case is a true null given that it was rejected. This posterior probability is empirical in the sense that it is based on a prior that has been empirically estimated on a large series of cases. But apart from that, it is just an evidential probability.

Seen from this perspective, it becomes clear that the only difference between the multiple-testing and the single-gene settings is that the multiple-testing framework gives you the means to estimate your priors -- and therefore, in the present case, the means to know that you are in an unfavorable situation where a p-value of 0.01 just does not represent sufficient evidence for positive selection. In contrast, the single-gene experiment merely prevents you from knowing that a p-value of 0.01 does not represent much in the present case -- but of course, not knowing a problem does not imply that the problem does not exist.

Now, you could still object one point against this argument: you could claim that a single-gene analysis is different from a genome-wide scan, in that the gene under study has generally not been chosen uniformly at random among all possible genes. After all, one of the reasons that make the FDR so stringent is that the overall probability that a gene taken at random along the genome is under positive selection in humans is just very low. Therefore it takes strong empirical evidence to turn this very low prior probability into a reasonably high posterior probability. However, your gene may be different in this respect: you may have decided to study that gene precisely because you have good reason to believe that it is under positive selection -- for instance, a gene involved in the immune response against pathogens. If this is the case, then perhaps you could be a bit less stringent when interpreting the p-value returned by the test specifically for that gene.

Maybe. But, then, first, you should note that this is fundamentally a Bayesian argument, which amounts to invoking a special prior in the particular case of your gene of interest.

Second, assuming that you accept to play this Bayesian game, it remains to be determined how you can then justify your special prior. The typical subjective Bayes approach, relying on introspection and verbal arguments to that effect, may be fine in other situations in which one can show that the end result is not too sensitive to the prior anyway. But here, it is not the case: the prior is precisely the critical aspect of the question: it is the only factor that can make a difference between your gene and any gene with p=0.01.

For me, if you really want to claim that immunity-related genes are different and are more likely to be under positive selection, then you have to let the data speak by themselves on this issue. Which means in practice that, instead of searching your soul and coming up with 'your' prior probability for immunity-related genes to be under positive selection, you should use some well-accepted gene ontology, select the subset of genes that are involved in immunity and redo the FDR analysis on this subset of genes. If genes involved in immunity are indeed more likely to be under positive selection, this will automatically show up in the form of a less stringent significance threshold compared to the global genome-wide analysis -- perhaps up to the point that your gene will now be included in the genes significantly under positive selection at a FDR of 5% or 10%.

But as long as you do not let empirical data confirm your prior guess, any claim that your gene is different has just not been vindicated. In which case the global genome-wide FDR control represents, by default, our best calibration for assessing the chances that your gene, with p=0.01, is indeed under positive selection -- and based on what this FDR calibration says, these chances appear to be rather low.

So, in conclusion, once the fundamental meaning of FDR as an evidential (posterior) probability is understood, then it becomes clear that ignoring FDR calibrations derived from genome-wide analyses when interpreting the results of small-scale experiments is not rational. For me, it is like refusing to use what has been learned in the past on a large series of examples when considering a new similar but isolated case.

And indeed, there tends to be a surprising disconnect in our collective statistical practice, between the rather stringent thresholds, usually in the range of p<1e-4 or even p<1e-6, most often imposed in typical genome-wide analyses in which FDR control is explicitly implemented (like GWAS, or genome-wide scans for positive selection), and the substantially more liberal attitude, in the range of p<5e-2, most often adopted in the context of small-scale studies in which FDR is not explicitly considered.

The only possible logical justification for this difference in our attitude would be that we are generally confident that our small scale studies typically concern genes, or SNPs, for which the prior odds in favor of the existence of a true effect are typically 100x higher than for a random gene or SNP along the genome. Yet, I am not totally sure that we have such good reasons to be so confident in our pre-selected candidates.

I am not saying that one should always conduct FDR calibrations. Sometimes it is just too difficult to find a good reference set. But if, each time we can find a reasonably good reference set (and genes or SNPs represent an easy case in this respect), we are surprised by how stringent we have to be in order to control for a decent rate of false discovery, then perhaps it means something about how excessively liberal we tend to be when we don't have the means to know our true FDR.

==

Nielsen R, Yang Z (1998) Likelihood models for detecting positively selected amino acid sites and applications to the HIV-1 envelope gene. Genetics 148: 929–936.

Kosiol, C., Vinar, T., da Fonseca, R. R., Hubisz, M. J., Bustamante, C. D., Nielsen, R., & Siepel, A. (2008). Patterns of positive selection in six Mammalian genomes. PLoS Genet, 4(8), e1000144. doi:10.1371/journal.pgen.1000144

Waking up from hibernation


This blog has not been very active lately.. I was too busy on other things. But there are still many different subjects that I would like to discuss here.

Let me post one or two additional comments about posterior probabilities, false discovery rates and general issues in current statistics. After that, I would like to move on to other subjects more directly related to statistical evolutionary biology.

stay tuned!



Thursday, February 27, 2014

Revised standards: p-values and false discoveries



Since we are again talking about p-values, evidence and scientific false discoveries (see my last post), I would like to come back to Valen Johnson's article of last fall (Johnson, 2013). Even if the main ideas are apparently not totally new (see e.g. Berger and Sellke, 1983), Johnson's article revives an important question, and this, apparently, at a very good moment (see also a similar point in Nuzzo, 2014).

The argument of Johnson's article relies on Bayes factors and uses a relatively technical concept of uniformly most powerful Bayesian tests. For that reason, it may give the impression that the whole question is about the choice between Bayes or frequentism, or between p-values or Bayes factors. Yet, for me, the whole problem could more directly be presented of as a false discovery rate issue.

In this direction, what I propose here is a relatively simple derivation of Johnson's main point (at least as I understand it). Apart from emphasizing the false discovery aspect, this derivation does not rely on uniformly most powerful Bayesian tests and is therefore hopefully easier to understand.

Suppose you conduct a one-sided test: you observe $X$ normally distributed of mean $\theta$ and variance 1 and want to test $H_0$: $\theta=0$ against $H_1$: $\theta > 0$. For instance, you observe a log-ratio of expression levels for a given gene between two conditions, and want to deternine whether the gene is over-expressed.

Imagine that you observe $x = 1.646$. This is slighty greater than $x_0 = 1.645$, the usual threshold for a type I error rate of 5%. Thus, you have obtained what is usually considered as marginally significant evidence in favor of over-expression, with a p-value just slightly less than 0.05.

Now, what does that imply in terms of the chances that this gene is indeed over-expressed?

This question sounds Bayesian: it asks for the posterior probability that $H_1$ is true conditional on the empirically observed value of $X$. But you can ask a similar question in a frequentist perspective: consider the collection of all one-sided normal tests (with known variance) that have been published over the last 10 years in evolutionary biology, and for which a marginal significance was reported. What fraction of them are false discoveries?

To answer that question, we need to contemplate the complete collection of all one-sided normal tests that were conducted (not just those that were published because they turned out to be significant): this represents a total of $N = N_0 + N_1$ tests, among which $N_0$ were true nulls and $N_1$ were true alternatives.  Let us call $p_0 = N_0 / N$ and $p_1 = N_1 / N$ the fractions of true nulls and alternatives, and $\theta_i$, $i=1..N_1$ the collection of true effect sizes across all true alternatives.

The probability of obtaining a marginally significant result ($x_0 \pm \delta x$) for a null case is $ p(X=x_0 \mid H_0) \delta x$. For a given non-null case with effect size $\theta_i$, it is $ p(X=x_0 \mid \theta_i) \delta x$. Therefore, the expected total number of marginally significant discoveries over the $N$ tests is just:

$n = N_0  p(X = x_0 \mid H_0) \delta x + \sum_{i=1}^{N_1} p(X = x_0 \mid \theta_i) \delta x$

which, upon dividing by $N \delta x$, can be rewritten as:

$\frac{n}{N \delta x} = p_0  L_0 + p_1 \bar L_1$

where $L_0 = p(X=x_0 \mid H_0)$ and $\bar L_1$ is the average likelihood under the alternative cases:

$\bar L_1 = \frac{1}{N_1} \sum_{i=1}^{N_1} p(X = x_0 \mid \theta_i)$.

The fraction of false discoveries is simply the contribution of the first term:

$fdr = p_0 L_0 / (p_0 L_0 + p_1 \bar L_1) = p_0 / (p_0 + p_1 B)$

where $B$ is:

$B = \bar L_1 / L_0$.

($B$ can be seen as an empirical version of the Bayes factor between the two hypotheses, but this is not essential for the present derivation.)

The average likelihood under $H_1$, $\bar L_1$, is less than the maximum likelihood under $H_1$, $\hat L_1$, which is here attained for $\hat \theta = x_0$. Using the formula for a normal density gives:

$B < B_{max} = \frac{\hat L_1}{L_0} = e^{\frac{1}{2} x_0^2}$

or equivalently:

$fdr > fdr_{min} = \frac{p_0}{p_0 + p_1 B_{max}}$.

Some numbers here. For $x_0 = 1.645$, $B_{max} = 3.87$. If $p_0 = p_1 = 0.5$, $fdr > 0.20$. In other words, at least 20% of your marginally significant discoveries are false. And still, this assumes that half of the tests were conducted on true alternatives, which is a generous assumption. If only 10% of the tested hypotheses are true ($p_1 = 0.1$), then, $fdr > 0.70$.

And all this is generous for yet another reason: it assumes that all your true alternatives are at $\theta=1.645$, the configuration that gives you the smallest possible local fdr. Reasonable distributions of effect sizes under the alternative can easily result in even higher false discovery rates. For example, if, under $H_1$, $\theta$ is distributed according to the positive half of a normal distribution of mean 0 and standard deviation of 3, then $B$ is less than 1, which implies more than 50% of false discoveries if half of the tested hypotheses are true, and more than 90% of false discoveries if 10% of the tested hypotheses are true.

I guess it is fair to say that most tested hypotheses are in fact false ($p_1$ is most probably less than 0.1) -- if most tested hypotheses were true, then this would mean that we already know most of what we are inquiring about, and thus research would just be an idle confirmatory exercise. It is also fair to say that the whole point of scientific hypothesis testing is to reverse this unfavorable proportion by filtering out the majority of non-findings and publishing an enriched set hopefully composed of a majority of true effects. Yet, as we can see here, this is not what happens, at least if we focus on marginal discoveries.

The entire argument above assumes that the variance of $X$ around $\theta$ is known. If it is unknown, or more generally if there are nuisance parameters under the null, things are a bit more complicated. The exact quantitative results also depend on the sampling model (normal or other). However, for the most part, the message is probably valid in many other circumstances and is very simple: what we call marginally significant findings are most often false discoveries.

Now, we can quibble over many details here, discuss the practical relevance of this result (do we really need to define a threshold for p-values?), object that significance is just one aspect of the problem (you can get very significant but weak effects), etc etc. Nevertheless, one should probably admit one thing: many of us (me included) have perhaps not adopted the correct psychological calibration in the face of p-values and have tended to over-estimate the significance of marginally significant findings.

In other words, it is probably fair to admit that we should indeed revise our standards for statistical evidence.

Also, we should more often think in terms of false discovery rate: it tells us important things that cannot be understood by just looking at p-values and type I errors.

Independently of this theoretical argument, it would certainly be interesting to conduct meta-studies here: based on the collection of reported p-values, and using standard algorithms like the one of Benjamini and Hochberg, 1995, one can retrospectively estimate the fraction of false discoveries across all published results in the context of molecular evolution, comparative or diversification studies, for instance (all of which have often relied on relatively weak effects), over the last ten years. I would be curious about the outcome of such meta-analyses.

===

Berger, J. O., & Sellke, T. (1987). Testing a point null hypothesis: the irreconcilability of P values and evidence. Journal of the American Statistical Association, 82:112.

Johnson, V. E. (2013). Revised standards for statistical evidence. Proceedings of the National Academy of Sciences, 110:19313.

Nuzzo, R. (2014, February 13). Scientific method: statistical errors. Nature, pp. 150–152. doi:10.1038/506150a