Tuesday, 21 June 2011

Bayesian vs Frequentist. A pragmatic bioinformatician's view

Ian Holmes, who I did my graduate studies together in Sanger (two biologists/coders in a room with happy house blearing out in the late 90s. Makes me feel old just thinking about it) was chastising me about my last post about statistical methods not having anything about Bayesian statistics.

Which, let's face it, was an oversight, because Bayesian statistics totally rocks in some scenarios in bioinformatics. At one level Bayesian statistics is straightforward (just write down as a probability model what you believe, sum over any uncertainity, eh voila, your answer), but at another level is profound; almost a way of life for some statisticians.

If you haven't met Bayesian statistics, it is really worth learning about it (I highly recommend the "Biological sequence analysis book" by Durbin, Eddy, Krogh, Mitchenson which introduces probabilistic modelling in sequence bioinformatics if you are a bioinformatician and used to sequences). The rather profound viewpoint of Bayesian statistics is that your confidence in believing something is true is related to the probability you think this will happen given a model. By either having alternative models, or by having different parameters inside of one model one can choose which model "is most likely" (ie, fits the observed data the best).

In contrast, the other view of statistics - so called "Frequentist" statistics - sets up the problem as a multiple trial problem with a null hypothesis. The null hypothesis is usually some interpretation of "random chance". By imaging the probability of a particular data point under the null hypothesis one either accepts (ie, it was reasonably likely to occur by chance) or rejects the null hypothesis; if you reject the null hypothesis there is normal an alternative which represents "means are different" or "correlation exists". The Frequentist viewpoint is to focus on modelling the "null hypothesis".

Better people than I have written about the difference between Bayesian and Frequentist view points - and have provided the arguments that these unify conceptually. Previously this was a almost religious schism in statistics, and now my feeling is there is more agreement that both perspectives are valid. People often get distracted by either the philosophical aspects of Bayesian statistics - for example, Bayesian statistics insists you provide a "prior belief" of your model before you see any data. This sounds like something you don't ask for on the Frequentist side, though Bayesians would argue that the choice of null and alternative hypotheses and in particular the distribution of the null hypothesis is basically all "prior belief". In Bayesian statistics there is a particularly neat business of being able to factor out variables which you know are important, but you don't know how to measure/estimate - in Bayesian statistics you can "just" sum over all possible values of that variable, weighted by your prior belief. (The "just" in the previous sentence hides often quite a bit of magic - doing the straightforward "sum" is often hard to impossible, and much of magic in Bayesian statistics is working out a way to do this sum in maths - ie, as a calculable integral - rather than explicitly. However, Bayesians can always fall back on Markov Chain Monte Carlo - which is in effect randomly sampling and walking around the space of possibilities. It's a bit like the frequentists approach of using permutation to get an effective null).

But the real benefit of Bayesian statistics is that it is fully compatible with models of the data. And those models can be as sophisticated as you like, as long as you can calculate the probability of observing a dataset, given the model (this is called the "likelihood"). In sequence based bioinformatics this is invaluable - all our understanding of sequence evolution on phylogenetic trees, or base substitutions, or codon behaviour, or exon/intron rules we can write down sensible models of this behaviour. If we then had to work out some "frequentist" based null model as background for this it is just plain... weird... to think about some null model, and almost impossible outside of generating "random" sequence (itself a rather non trivial thing to do right) and feeding it into metrics. Basically Bayesian statistics, due to the fact that it's totally at home with explicit models of the data, is a great fit to sequence analysis, and frequentist statistics just doesn't work as well.

In my PhD and the early days of running Ensembl, everything I did had to do with one or other aspect of sequence analysis. In particular Genewise, which is a principled combination of homology matching and gene prediction, was impossible to think about outside of Bayesian style statistics. Genewise was at the core of both Ensembl and in fact Celera genomics models of the human and mouse genome, and this rather venerable algorithm still happily chugs away at the core of a variety of genome annotation tools (including still some parts of Ensembl) although newer methods - such as Exonerate - work far faster with almost no loss in power.

This got me thinking about why I am not using Bayesian (or let's face it, it's poor cousin, maximum likelihood methods; most "Bayesian" methods actually don't do so much summing over uncertainity but rather take a maximum likelihood point. Whenever you get out an alignment, you are doing maximum likelihood) so much at the moment. And that's really because I've shifted into more open analysis of ENCODE data, where I don't trust my own internal models of what is going on - can anyone really "know" what the right model of a promoter or a enhancer is? Or whether even these words "enhancer" and "promoter" are the right things? And when we compare experimental results to predictions, I want to make the least number of assumptions and just ask the simple question - did my prediction process "predict" the outcome of the experiment? Even though sometimes those predictions are set up with a Bayesian style formalism, but often with a very naive model, I want to be as naive as possible in the next part of the test. So - I find myself most comfortable with the non-parameteric statistics (either a Chi-sq - category vs category, or Wilcoxon/Kruskal - category vs continuous, or Spearman's Rho - continuous vs continuous test). As a good Bayesian perhaps one can write down the "most naive models of association as possible" - itself something I don't quite know how to - and then do a likelihood/posterior calculation, but ... it feels like this is going to be the same as the non parametric statistics anyway. And it's far easier to say "Success in the Medaka fish enhancer assays were significantly associated with the prediction class of element (Chi-sq test, Pvalue, 0.0007)" rather than "We then set up a naive Bayesian model for the the association of the Medaka fish enhancer assays to the prediction class of elements, with flat prior beliefs of the association or not, and considering all possible proportions of association; The Bayes factor of 78.9 indicates that there is high confidence in the model with association" - not least because I think it would give the same result.

Perhaps in the future this area of research will come back to Bayesian statistics, but I think this will be because we are confident in our models. Once you want to have more sophisticated models, one is almost forced to be more Bayesian.

But, I should have had Bayesian statistics in my top 5 statistical techniques. Or said the top 6. Because they do totally rock in the right place.

So - you are right Ian :)

Saturday, 18 June 2011

Five statistical things I wished I had been taught 20 years ago

I came through the English educational system, which meant that although I was mathematically minded, because I had chosen biochemistry for my undergraduate, my maths teaching rapidly stopped - in university I took the more challenging "Maths for Chemists" option in my first year, though in retrospect that was probably a mistake because it was all about partial differentiation, and not enough stats. Probably the maths for biologists was a better course, but even that I think spent too much time on things like t-test and ANOVA, and not enough on what you need. To my subsequent regret, no one took my aside and said "listen mate, you're going to be doing alot of statistics, so just get the major statistical tools under your belt now".

Biology is really about stats. Indeed, the foundation of much of frequentist statistics - RA Fisher and colleagues - were totally motivated by biological problems. We just lost the link the heyday of molecular biology when you could get away with n=2 (or n=1!) experiments due the "large effect size" - ie, band is either there or not - style of experiment. But now we're back to working out far more intertwined and subtle things. So - every biologists - molecular or not - is going to have to become a "reasonable" statistician.

These are the pieces of hard won statistical knowledge I wish someone had taught me 20 years ago rather than my meandering, random walk approach.

1. Non parametric statistics. These are statistical tests which make a bare minimum of assumptions of underlying distributions; in biology we are rarely confident that we know the underlying distribution, and hand waving about central limit theorem can only get you so far. Wherever possible you should use a non parameteric test. This is Mann-Whitney (or Wilcoxon if you prefer) for testing "medians" (Medians is in quotes because this is not quite true. They test something which is closely related to the median) of two distributions, Spearman's Rho (rather pearson's r2) for correlation, and the Kruskal test rather than ANOVAs (though if I get this right, you can't in Kruskal do the more sophisticated nested models you can do with ANOVA). Finally, don't forget the rather wonderful Kolmogorov-Smirnov (I always think it sounds like really good vodka) test of whether two sets of observations come from the same distribution. All of these methods have a basic theme of doing things on the rank of items in a distribution, not the actual level. So - if in doubt, do things on the rank of metric, rather than the metric itself.

2. R (or I guess S). R is a cranky, odd statistical language/system with a great scientific plotting package. Its a package written mainly by statisticians for statisticians, and is rather unforgiving the first time you use it. It is defnitely worth persevering. It's basically a combination of excel spreadsheets on steriods (with no data entry. an Rdata frame is really the same logical set as a excel workbook - able to handle millions of points, not 1,000s), a statistical methods compendium (it's usually the case that statistical methods are written first in R, and you can almost guarantee that there are no bugs in the major functions - unlike many other scenarios) and a graphical data exploration tool (in particular lattice and ggplot packages). The syntax is inconsistent, the documentation sometimes wonderful, often awful and the learning curve is like the face of the Eiger. But once you've met p.adjust(), xyplot() and apply(), you can never turn back.

3. The problem of multiple testing, and how to handle it, either with the Expected value, or FDR, and the backstop of many of piece of bioinformatics - large scale permutation. Large scale permutation is sometimes frowned upon by more maths/distribution purists but often is the only way to get a sensible sense of whether something is likely "by chance" (whatever the latter phrase means - it's a very open question) given the complex, hetreogenous data we have. 10 years ago perhaps the lack of large scale compute resources meant this option was less open to people, but these days basically everyone should be working out how to appropriate permute the data to allow a good estimate of "surprisingness" of an observation.

4. The relationship between Pvalue, Effect size, and Sample size - this needs to be drilled into everyone - we're far too trigger happy quoting Pvalues, when we should often be quoting Pvalues and Effect size. Once a Pvalue is significant, it's higher significance is sort of meaningless (or rather it compounds Effect size things with Sample size things, the latter often being about relative frequency). So - if something is significantly correlated/different, then you want to know about how much of an effect this observation has. This is not just about GWAS like statistics - in genomic biology we're all too happy about quoting some small Pvalue not realising that with a million or so points often, even very small deviations will be significant. Quote your r2, Rhos or proportion of variance explained...

5. Linear models and PCA. There is a tendency often to jump to quite complex models - networks, or biologically inspired combinations, when our first instinct should be to crack out the well established lm() (linear model) for prediction and princomp() (PCA) for dimensionality reduction. These are old school techniques - and often if you want to talk about statistical fits one needs to make gaussian assumptions about distributions - but most of the things we do could be either done well in a linear model, and most of the correlation we look at could have been found with a PCA biplot. The fact that these are 1970s bits of statistics doesn't mean they don't work well.

Kudos to Leonid Kruglyak for spotting some minor errors