Thursday, January 9, 2014
Shrinking gene trees
A particularly interesting example of empirical Bayes in current phylogenetics is provided by the gene-tree species-tree reconciliation methods (Arvestad et al, 2003, Akerborg et al, 2009, see also a related approach by maximum a posteriori, Boussau et al, 2013).
There are many reasons to be interested in reconstructing gene trees, and then trying to interpret these trees in terms of gene duplication and loss (and horizontal gene transfer, but let us ignore that for the moment). In particular, it is useful for assessing whether genes are paralogous or orthologous (you just need to look at their last common ancestor in the tree, and see whether this is a duplication or a speciation node).
However, genes usually have relatively short coding sequences. After aligning genes and trimming poorly aligned regions, we typically end up with something between 100 and 1000 coding positions. Thus, not much signal to infer a phylogenetic tree with high support.
On the other hand, the process of gene duplication and loss along the species tree can be modelled. If the species tree, the duplication rate and the loss rate are known, this process represents a reasonable mechanistic prior distribution over the set of all possible gene phylogenies.
If we do not know the species tree and the duplication and loss rates, we can adopt an empirical Bayes approach, i.e. devise a hierarchical model that will borrow strength across gene families for estimating the global parameters (the species tree and the duplication and loss rates), while inferring gene trees by local application of Bayes rule.
The Bayesian gene-tree species-tree reconciliation approach therefore can be ssen as a typical empirical Bayes shrinkage estimator -- gene trees are shrunk toward the species tree. And there is indeed quite some shrinkage here: gene trees are substantially more accurate and more supported than if reconstructed using standard phylogenetic methods that do not consider the species tree (Akerborg et al, 2009, Boussau et al, 2013).
Also, clade posterior probabilities (at the level of gene trees) are asymptotically calibrated. In the context of probabilistic orthology analysis (Sennblad and Lagergren, 2009), this means in that there is a good asymptotic control for the rate of false discovery (false orthology/paralogy assignment).
All this is true only asymptotically, but I guess that, for most practical purposes, we can equate "genome-wide" with "asymptotic".
And all this is true only under the assumptions of the model, but good frequentist properties are always conditional on correct model specification.
Still, if we want to question the assumptions of the model: one dubious assumption is the hypothesis that duplication and loss rates are homogeneous across gene families and across the species tree. If needed, however, it is always possible to add some flexibility, such as branch-specific or gene-specific rates, or a combination of both. This simply adds additional levels to the hierarchical model, without representing major theoretical challenges in terms of asymptotic accuracy and calibration.
Other potentially important issues, concerning model specification, is whether the other aspects of the model (substitution process, relaxed clock, etc) are reasonable. But again, this can be worked out.
Finally, there are still difficult computational challenges on the side of species tree estimation. In this respect, I sometimes have the impression that even the most sophisticated Monte Carlo approaches will not be sufficient to efficiently sample species trees. For this, some computational short-cuts are probably needed. One possibility is to use a maximum a posteriori approach instead of integrating over gene trees (Boussau et al, 2013).
Akerborg, O., Sennblad, B., Arvestad, L., & Lagergren, J. (2009). Simultaneous Bayesian gene tree reconstruction and reconciliation analysis. Proceedings of the National Academy of Sciences of the United States of America, 106(14), 5714–5719.
Arvestad, L., Berglund, A.-C., Lagergren, J., & Sennblad, B. (2003). Bayesian gene/species tree reconciliation and orthology analysis using MCMC. Bioinformatics, 19 Suppl 1, i7–15.
Boussau, B., Szöllosi, G. J., Duret, L., Gouy, M., Tannier, E., & Daubin, V. (2013). Genome-scale coestimation of species and gene trees. Genome Res, 23(2), 323–330.
Sennblad, B., & Lagergren, J. (2009). Probabilistic orthology analysis. Syst Biol, 58(4), 411–424.