Almost immediately upon beginning to learn about statistics, students are introduced to the “frequentist vs. Bayesian debate”. The debate (so the story goes) is about whether probabilities refer strictly to outcomes of repeated experiments, or whether they refer more broadly to subjective degrees of belief. Thus, the Bayesian/frequentist debate concerns the philosophical foundation of probability.
There is something strange about this explanation. Usually, when there’s some debate over the foundations of a subject in maths or physics, it has no impact whatsoever on 90% of the work going on in that field. Since the different interpretations don’t alter the equation set, most practitioners can ignore the debate and carry on as normal.
The debate between frequentist and Bayesian interpretations is different. It is not possible as a practitioner to just ignore it and carry on as normal. This is because the debate is not really about foundations, at least not primarily. It’s about methodology. Here, I’m using methodology in a broad sense: which questions are worth asking, and what constitutes an answer. Even if you’re not interested in philosophical foundations, by posing and answering a research question you are working within a Bayesian or a frequentist paradigm.
So, what are the methodologies favoured by Bayesians and frequentists? Broadly speaking, frequentist statistics is about approximating and optimising, while Bayesian statistics is about conditioning and sampling.
Frequentist verbs: approximating and optimising
Put simply, the goal of frequentist statistics is to give a single (best) estimate of some parameters. Often, this comes up when trying to estimate the parameters of a probability distribution. The best estimate is the one which is most likely given the observed data. The problem of finding this estimate is known as maximum likelihood estimation (MLE).
If we know how to calculate the likelihood, and possibly its derivatives, the MLE problem can be solved using standard tools from the field of optimisation. This is great news, because numerical optimisation is a highly developed field; many efficient algorithms have been developed, most of which are conveniently wrapped into standard software packages. A few examples are gradient descent, conjugate gradients, Newton’s method and interior-point methods.
Bayesian verbs: conditioning and sampling
Instead of seeking a single best estimate of parameters, the goal of Bayesian statistics is to characterise their distribution given the observed data. This is known as inference, or conditioning. The evidence from observations is combined with a prior distribution to give a posterior distributions – our answer – using Bayes’ rule:
In the simplest case, when working with nice analytical distributions like Gaussians, inference can be carried out exactly. If the distributions are not analytic, it is necessary to use techniques of approximate inference. There are two main families of techniques. One option is to approximate the posterior distribution stochastically using Monte Carlo sampling methods; Markov chain Monte Carlo (MCMC) is probably the most influential of these. The alternative is to try and approximate the distribution using some simpler, analytic family of distributions. This latter approach is known as variational inference.
Interestingly, variational inference works by minimising a measure of the difference between the target and surrogate distributions (specifically, the KL-divergence). In a sense, it is a return to frequentist ideas about optimisation, but in this case carried out one level higher.
I will try to revisit some of these concepts in later posts. Meanwhile, I hope this post has helped to explain what people mean by a Bayesian or a frequentist approach, and what the two look like in practice.