Bayesian Data Science: The What, Why, and How

Choosing between frequentist and Bayesian approaches is the great debate of the last century, with a recent surge in Bayesian adoption in the sciences.

Number of articles referring Bayesian statistics in sciencedirect.com (April 2024) — Graph by the author

What’s the difference?

The philosophical difference is actually quite subtle, where some propose that the great bayesian critic, Fisher, was himself a bayesian in some regard. While there are countless articles that delve into formulaic differences, what are the practical benefits? What does Bayesian analysis offer to the lay data scientist that the vast plethora of highly-adopted frequentist methods do not already? This article aims to give a practical introduction to the motivation, formulation, and application of Bayesian methods. Let’s dive in.

Prior Beliefs

While frequentists deal with describing the exact distributions of any data, the bayesian viewpoint is more subjective. Subjectivity and statistics?! Yes, it’s actually compatible.

Let’s start with something simple, like a coin flip. Suppose you flip a coin 10 times, and get heads 7 times. What is the probability of heads?

P(heads) = 7/10 (0.7)?

Obviously, here we are riddled with low sample size. In a Bayesian POV however, we are allowed to encode our beliefs directly, asserting that if the coin is fair, the chance of heads or tails must be equal i.e. 1/2. While in this example the choice seems pretty obvious, the debate is more nuanced when we get to more complex, less obvious phenomenon.

Yet, this simple example is a powerful starting point, highlighting both the greatest benefit and shortcoming of Bayesian analysis:

Benefit: Dealing with a lack of data. Suppose you are modeling spread of an infection in a country where data collection is scarce. Will you use the low amount of data to derive all your insights? Or would you want to factor-in commonly seen patterns from similar countries into your model i.e. informed prior beliefs. Although the choice is clear, it leads directly to the shortcoming.

Shortcoming: the prior belief is hard to formulate. For example, if the coin is not actually fair, it would be wrong to assume that P (heads) = 0.5, and there is almost no way to find true P (heads) without a long run experiment. In this case, assuming P (heads) = 0.5 would actually be detrimental to finding the truth. Yet every statistical model (frequentist or Bayesian) must make assumptions at some level, and the ‘statistical inferences’ in the human mind are actually a lot like bayesian inference i.e. constructing prior belief systems that factor into our decisions in every new situation. Additionally, formulating wrong prior beliefs is often not a death sentence from a modeling perspective either, if we can learn from enough data (more on this in later articles).

Bayes’ Rule

So what does all this look like mathematically? Bayes’ rule lays the groundwork. Let’s suppose we have a parameter θ that defines some model which could describe our data (eg. θ could represent the mean, variance, slope w.r.t covariate, etc.). Bayes’ rule states that

Thomas Bayes formulated the Bayes’ theorem in 1700’s, published posthumously. [Image via Wikimedia commons licensed under Creative Commons Attribution-Share Alike 4.0 International, unadapted]

P (θ = t|data) ∝ P (data|θ = t) * P (θ=t)

In more simple words,

  • P (θ = t|data) represents the conditional probability that θ is equal to t, given our data (a.k.a the posterior).
  • Conversely, P (data|θ) represents the probability of observing our data, if θ = t (a.k.a the ‘likelihood’).
  • Finally, P (θ=t) is simply the probability that θ takes the value t (the infamous ‘prior’).

So what’s this mysterious t? It can take many possible values, depending on what θ means. In fact, you want to try a lot of values, and check the likelihood of your data for each. This is a key step, and you really really hope that you checked the best possible values for θ i.e. those which cover the maximum likelihood area of seeing your data (global minima, for those who care).

And that’s the crux of everything Bayesian inference does!

  1. Form a prior belief for possible values of θ,
  2. Scale it with the likelihood at each θ value, given the observed data, and
  3. Return the computed result i.e. the posterior, which tells you the probability of each tested θ value.

Graphically, this looks something like:

Prior (left) scaled with the likelihood (middle) forms the posterior (right) (figures adapted from Andrew Gelmans Book). Here, θ encodes the east-west location coordinate of a plane. The prior belief is that the plane is more towards the east than west. The data challenges the prior and the posterior thus lies somehwere in the middle. [image using data generated by author]

Which highlights the next big advantages of Bayesian stats-

  • We have an idea of the entire shape of θ’s distribution (eg, how wide is the peak, how heavy are the tails, etc.) which can enable more robust inferences. Why? Simply because we can not only better understand but also quantify the uncertainty (as compared to a traditional point estimate with standard deviation).
  • Since the process is iterative, we can constantly update our beliefs (estimates) as more data flows into our model, making it much easier to build fully online models.

Easy enough! But not quite…

This process involves a lot of computations, where you have to calculate the likelihood for each possible value of θ. Okay, maybe this is easy if suppose θ lies in a small range like [0,1]. We can just use the brute-force grid method, testing values at discrete intervals (10, 0.1 intervals or 100, 0.01 intervals, or more… you get the idea) to map the entire space with the desired resolution.

But what if the space is huge, and god forbid additional parameters are involved, like in any real-life modeling scenario?

Now we have to test not only the possible parameter values but also all their possible combinations i.e. the solution space expands exponentially, rendering a grid search computationally infeasible. Luckily, physicists have worked on the problem of efficient sampling, and advanced algorithms exist today (eg. Metropolis-Hastings MCMC, Variational Inference) that are able to quickly explore high dimensional spaces of parameters and find convex points. You don’t have to code these complex algorithms yourself either, probabilistic computing languages like PyMC or STAN make the process highly streamlined and intuitive.

STAN

STAN is my favorite as it allows interfacing with more common data science languages like Python, R, Julia, MATLAB etc. aiding adoption. STAN relies on state-of-the-art Hamiltonian Monte Carlo sampling techniques that virtually guarantee reasonably-timed convergence for well specified models. In my next article, I will cover how to get started with STAN for simple as well as not-no-simple regression models, with a full python code walkthrough. I will also cover the full Bayesian modeling workflow, which involves model specification, fitting, visualization, comparison, and interpretation.

Follow & stay tuned!