Monday, February 3, 2014

Soft and hard shrinkage

Shrinkage based on hierarchical priors can be seen as a way of reducing the effective number of parameters of a model.

Consider for instance the problem of allowing for amino-acid compositional variation through time in the context of phylogenetic estimation. In a maximum likelihood framework, and for small datasets (short sequences), we cannot afford unconstrained estimation of a distinct equilibrium frequency vector over the 20 amino-acids (19 free parameters) on each branch: this would presumably lead to over-fitting problems (although, again, I would like to see whether it would have such dire consequences in practice, but let us assume that it is the case).

A clever solution to this problem (Groussin et al, 2013) is to identify the first one or two principal components in the empirically observed compositional variation among taxa. Biologically, most of the compositional variation of proteomes is due to two independent factors: an underlying GC bias and a correlation with growth temperature (although the latter is true only in prokaryotes). Therefore, the first two components should absorb a large proportion of the total compositional variation.Then, a constrained model is implemented, allowing for compositional variation across branches only along these two principal directions. This reduces the number of parameters from 19 to 2 per branches, thus making the model suitable for shorter alignments.

Now, what would be the hierarchical Bayesian solution to this problem? Typically, one would consider the equilibrium frequencies across branches i.i.d. from a multivariate normal prior (for this to be well defined, equilibrium frequencies have to be log-transformed and reduced to 19 free parameters, see Lartillot, 2014). If $x_j$ is the 19-dimensional parameter associated with branch j, then:

$x_j \sim N(\mu, \Sigma)$

Here, $\mu$ is a 19-vector and $\Sigma$ a 19x19 covariance matrix. Both are hyperparameters, which would then be estimated by borrowing strength across branches.

This hierarchical prior will implement self-tuned shrinkage of equilibrium frequency profiles across branches in a direction-dependent manner (in the 19-dimensional space). Mathermatically $\Sigma$ can be decomposed into 19 independent variance components (corresponding to its eigenvalues). The prior will be loose in the directions corresponding to large eigenvalues, and tight in the directions corresponding to small eigenvalues. Thanks to this mechanism, the model will allow for compositional variation mostly along the first principal directions of $\Sigma$ (large eigenvalues), while shrinking more strongly along all other directions (small eigenvalues). Since $\Sigma$ is estimated on the dataset, its first principal directions will roughly correspond to the first few principal directions of the empirically observed compositional variation among taxa, which are also the ones selected by the method of Groussin et al (2013).

So, the two methods should do very similar things.

In this sense, hierarchical Bayes shrinkage indeed amounts to reducing the effective number of parameters, although in a soft manner: there is no truncation, no thresholding, and therefore no parameter-counting nor any model selection by AIC or BIC. Instead, there is an implicit context-dependent penalization imposed on the parameter vector.

In fact, we do not need to be Bayesian to use this kind of soft penalization. Alternatively, one could imagine to maximize a penalized likelihood, such as:

$\ln L - \lambda tr(V)$

Here $\ln L$ is the log of the likelihood, $V$ would be the empirical covariance matrix (the scatter matrix) of the (suitably log-transformed) equilibrium frequencies across branches, and $\lambda$ would be the strength of penalization. I don't know if this specific penalization scheme is the most reasonable one, but intuitively, the trace of $V$ is the sum of the variances along the principal directions, and therefore, the penalization term will implement shrinkage by allowing compositional variation only along the principal directions where a significant increase of the likelihood can be gained.

At first sight, this penalization scheme looks a bit like what is called ridge regression in statistics.

In any case, I just wanted to illustrate the idea of "soft" shrinkage: both in the Bayesian and the penalized likelihood framweork, we still have all the degrees of freedom in the model, and yet, the model is not over-parameterized -- its effective number of parameters is under control.

I tend to believe that, when modeling nuisances, soft shrinkage is better than hard shrinkage by dimensional reduction, fundamentally because soft shrinkage does not "throw away" part of the nuisances, as is done when using hard truncation. This may be a relatively minor problem in the case of protein composition, where the first two components explain a large proportion of the total variance. In other cases, however, hard truncation may result in models turning a blind eye to a significant proportion of the nuisances that were supposed to be accounted for. Soft shrinkage, in contrast, will embrace the whole of those nuisances, and should therefore lead to more accurate estimation of the parameters of interest.

Hierarchical Bayes will implement regularization by soft-shrinkage more or less automatically, and this is fundamentally what makes it such an attractive paradigm. On the other hand, if one really wants to stay in a maximum likelihood paradigm (in particular, for computational reasons), while being flexible with respect to the nuisances, then perhaps maximum penalized likelihood, instead of classical maximum likelihood, should be considered more seriously, in particular in phylogenetics (Kim and Sanderson, 2008).


Groussin, M., Boussau, B., & Gouy, M. (2013). A branch-heterogeneous model of protein evolution for efficient inference of ancestral sequences. Syst Biol, 62(4), 523–538. doi:10.1093/sysbio/syt016

Kim, J., & Sanderson, M. J. (2008). Penalized likelihood phylogenetic inference: bridging the parsimony-likelihood gap. Syst Biol, 57(5), 665–674. doi:10.1080/10635150802422274

Lartillot, N. (2014). A phylogenetic Kalman filter for ancestral trait reconstruction using molecular data. Bioinformatics. doi:10.1093/bioinformatics/btt707


  1. This is a very nice paper indeed.

    I know that the methods and goals are different, but I'm a little surprised that they don't cite work by Zoller and Schneider (, which also applies a dimension-reduction approach to model estimation.

  2. thanks for pointing out. I must admit that I did not know that paper myself (I should probably have).

  3. Hi Erick and Nicolas,
    We knew this paper, and it's a mistake we did not cite it. It was definitely not on purpose, we just forgot to cite it. Sorry about that!