Dark Mode Toggle

Berkson's Paradox

2025 Jun 18 See all posts

Berkson's Paradox

Berkson's paradox occurs when sampling bias makes two independent variables appear correlated.

In particular the sampling bias happens because you condition on a "collider variable" (an output variable which is influenced by both the independent variables). This causes the two independent variable to appear correlated.
In Berkson's paradox, conditioning creates false correlation.

Example 1: college admissions

A mid-tier college admits students based on basketball ability OR math scores. What's the probability a student has good math scores, given they're good at basketball?

It's counter-intuitive to see that the two might be negatively correlated.

for a low-tier college, the students are not qualifying in either.
for a high-tier college, the students have to be good at both.
but for a mid-tier college, the students are probably good at one things - either maths or basketball.

So if you find a basketball player at a mid-tier college, they're likely weak in math. Two independent variables appear correlated due to sampling bias.

Here the collider variable is "admitted to college", while the independent variables are basketball ability and maths score.

Example 2: who wrote this bad book?

Consider 4 categories of author:

A&R (Adept & Resilient): 40 books, 75% good.
A&NR (Adept & Not Resilient): 5 books, 80% good
NA&R (Not Adept & Resilient): 40 books, 20% good
NA&NR (Not Adept & Not Resilient): 5 books, 20% good

Suppose you find a bad book, for each author, what's the probability each author type wrote it?

$$ \begin{align} P(A,R|Bad) &= P(Bad|A,R)*P(A,R)/P(Bad) \\ &= 0.25*0.44/0.52 \\ &= 0.21 \end{align} $$ $$ \begin{align} P(A,NR|Bad) &= P(Bad|A,NR)*P(A,NR)/P(Bad) \\ &= 0.20*0.0556/0.52 \\ &= 0.021 \end{align} $$ $$ \begin{align} P(NA,R|Bad) &= P(Bad|NA,R)*P(NA,R)/P(Bad) \\ &= 0.8*0.44/0.52 \\ &= 0.67 \end{align} $$ $$ \begin{align} P(NA,NR|Bad) &= P(Bad|NA,NR)*P(NA,NR)/P(Bad) \\ &= 0.8*0.0556/0.52 \\ &= 0.0856 \end{align} $$

As expected, the not-adept, resilient author stands out. What might be surprising is that the adept, resilient author has a high posterior probability too. Because of their resiliency, the A&R author also churns out quite a few bad books.

Here the collider variable is "wrote a bad book", while the independent variables are "adeptness" and "resilience".