Our First Causal Question in Real Life
Selection Bias
Access to health care insurance is a huge political issue in the US. Subsidizing the provision and mandating the adoption of insurance was at the core of the, heavily debated, Affordable Health Care Act, also known as Obamacare.
Policy: Subsidize, and/or enforce, a health care insurance for the entire population.
Rationale: Increasing access to health care (through insurance), can improve the health outcomes of the population.
Let's look at some data to investigate this rationale.
This is just a random sample of 100 observations from the real dataset. The complete data contains 80634 observations (individuals).
What tools from the course (so far) should we use to look at this data?
Causality: what are we talking about?
Causality: what are we talking about?
We say that X causes Y
Causality: what are we talking about?
We say that X causes Y
if we were to intervene and change the value of X without changing anything else...
then Y would also change as a result.
Causality: what are we talking about?
We say that X causes Y
if we were to intervene and change the value of X without changing anything else...
then Y would also change as a result.
The key point here is the without changing anything else, often referred as the other things equal assumption (or ceteris paribus if you want to sound fancy).
Causality: what are we talking about?
We say that X causes Y
if we were to intervene and change the value of X without changing anything else...
then Y would also change as a result.
The key point here is the without changing anything else, often referred as the other things equal assumption (or ceteris paribus if you want to sound fancy).
Correlation does not equal causation has become a ubiquitous mantra, but can you tell why it is true?
Correlation does not equal causation has become a ubiquitous mantra, but can you tell why it is true?
Some correlations obviously don't imply causation (e.g. spurious correlation website).
Correlation does not equal causation has become a ubiquitous mantra, but can you tell why it is true?
Some correlations obviously don't imply causation (e.g. spurious correlation website).

But not all correlations are so easy to rule out
Does smoking cause lung cancer?
Today, we know the answer is YES!
But let's go back in the 1950's
We are at the start of a big increase in deaths from lung cancer...
... which is happening after a fast growth in cigarette consumption
It's very tempting to claim that smoking causes lung cancer based on this graph.
At the time many people were still skeptical, including some famous statisticians:
Macro confounding factors:
Other macro factors which can cause cancers also changed between 1900 and 1950:
Tarring of roads,
Inhalation of motor exhausts (leaded gasoline fumes),
General greater air pollution.
Self selection:
Smokers and non-smokers may be different in the first place:
Selection on observable characteristics: age, education, income, etc.
Selection on unobservable characteristics: genes (the hypothetical confounding genome theory of Fisher).
Can we interpret
these differences
causally?
Are all other
things equal between
insured and uninsured?
Selection bias is the bias introduced by the selection of individuals, groups, or data for analysis in such a way that proper randomization is not achieved, thereby failing to ensure that the sample obtained is representative of the population intended to be analyzed.
Econometric textbooks, tend to define selection bias in term of a regression or (as MM) a randomized controlled trial.
We will start from this more general definition to connect with the concept of conditional expectation.
Then we will connect with regression and experiments.
How would you use conditional expectations to characterize this problem?
Let's start by simplifying the problem by assuming that each plane only had two sections. Now define two random variables: binary variables (bernulli) to indicate if the plane received damage in locations one, and two. (DL1:{No damaged in lct 1, Damaged in lct1}→{0,1}, same for DL2).
We also need to define random variable for that we are conditioning on. In this case, let's use a binary variable for return (R:{Plane didn't return, Plane returned}→{0,1})
One way of characterizing the problem would be that the engineers thought they where observing E(DL1) and E(DL2) and concluding E(DL1)>E(DL2).
But in they were actually observing E(DL1|R=1) and E(DL2|R=1) and most likely E(DL1|R=0)<E(DL2|R=0)
If you don't like the math notation, you can provide the same answer, but in narrative form.
This is called survivorship bias, and is a type of selection bias.
Characterization of Americans according to foreigners visiting Berkeley.
Characterization of Chinese according to foreigner visiting a specific city.
Our First Causal Question in Real Life
Selection Bias
Keyboard shortcuts
| ↑, ←, Pg Up, k | Go to previous slide |
| ↓, →, Pg Dn, Space, j | Go to next slide |
| Home | Go to first slide |
| End | Go to last slide |
| Number + Return | Go to specific slide |
| b / m / f | Toggle blackout / mirrored / fullscreen mode |
| c | Clone slideshow |
| p | Toggle presenter mode |
| t | Restart the presentation timer |
| ?, h | Toggle this help |
| Esc | Back to slideshow |