Multivariate Regression: Statistical Control
Statistical Control for Causal Inference: Good and Bad Controls
Selection Bias: Bias through stratification
In other words, our goals are to:
First, we have another look at the US presidential elections.
load("raw-data/uspresnew.Rdata") # presidential elections
us
## year vote party growth incumbent approval
## 1 1948 52.370 Democrat 3.579 Truman 1948 39
## 2 1952 44.595 Democrat 0.691 Stevenson 1952 32
## 3 1956 57.764 Republican -1.451 Eisenhower 1956 69
## 4 1960 49.913 Republican 0.377 Nixon 1960 49
## 5 1964 61.344 Democrat 5.109 Johnson 1964 74
## 6 1968 49.596 Democrat 5.043 Humphrey 1968 40
## 7 1972 61.789 Republican 5.914 Nixon 1972 56
## 8 1976 48.948 Republican 3.751 Ford 1976 45
## 9 1980 44.697 Democrat -3.597 Carter 1980 33
## 10 1984 59.170 Republican 5.440 Reagan 1984 52
## 11 1988 53.902 Republican 2.178 Bush 1988 51
## 12 1992 46.545 Republican 2.662 Bush 1992 31
## 13 1996 54.736 Democrat 3.121 Clinton 1996 57
## 14 2000 50.265 Democrat 1.219 Gore 2000 59
## 15 2004 51.233 Republican 2.690 Bush 2004 47
Remember our basic bivariate model?
lm(vote ~ growth, data = us)
##
## Call:
## lm(formula = vote ~ growth, data = us)
##
## Coefficients:
## (Intercept) growth
## 49.699 1.127
Let’s estimate effect of approval while controlling for growth (i.e., holding growth constant) step by step (like in the lecture).
It is helpful to think of statistical control as Venn Diagrams. Each circle represents the variance of the variable, and the intersection of the circles is the covariance of the two variables, i.e. the part of variance in one variable that can be explained by the variance of another variable.
First, we need the residuals from growth on vote, i.e. the part in variance of vote that is not explained by growth. With this regression we “remove the effect of growth on vote”.
aux1 <- lm(vote ~ growth, data = us)
Let’s have another look at the Venn Diagram. The colored blue part here indicates the variance in vote that can be explained by growth.
Now we can get the residuals. The residuals in this model are the part of vote unexplained by growth, so in other words, the unexplained part of variance of the dependent variable vote.
resid_aux1 <- residuals(aux1)
Next, we remove the effect of growth on approval. For this, we need the residuals from regressing approval on growth. Again, the residuals would be the unexplained variance in the dependent variable, hence we’re treating approval as the dependent variable here:
aux2 <- lm(approval ~ growth, data = us)
The blue part again corresponds to the explained by growth variance in approval, so their covariance.
And get the residuals, colored as red in the Venn diagram below.
resid_aux2 <- residuals(aux2)
We see that some of this red part is not explained by any variable in the plot (does not intersect with anything), and the other part intersects with vote. This intersection is what we are interested in.
Once we have the residuals from the two regressions, we can now estimate the unconfounded effect of approval on vote.
aux3 <- lm(resid_aux1 ~ resid_aux2)
summary(aux3)
##
## Call:
## lm(formula = resid_aux1 ~ resid_aux2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.4076 -1.1555 -0.8039 2.1181 4.2504
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.185e-16 6.558e-01 0.000 1
## resid_aux2 3.195e-01 5.383e-02 5.935 4.94e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.54 on 13 degrees of freedom
## Multiple R-squared: 0.7304, Adjusted R-squared: 0.7097
## F-statistic: 35.23 on 1 and 13 DF, p-value: 4.942e-05
Of course we do not need to do this step-by-step. R knows how to handle multivariate OLS models.
mv <- lm(vote ~ growth + approval, data = us)
mv
##
## Call:
## lm(formula = vote ~ growth + approval, data = us)
##
## Coefficients:
## (Intercept) growth approval
## 34.8293 0.8146 0.3195
summary(mv)
##
## Call:
## lm(formula = vote ~ growth + approval, data = us)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.4076 -1.1555 -0.8039 2.1181 4.2504
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.82929 2.77295 12.560 2.90e-08 ***
## growth 0.81459 0.27126 3.003 0.011 *
## approval 0.31950 0.05603 5.702 9.88e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.644 on 12 degrees of freedom
## Multiple R-squared: 0.808, Adjusted R-squared: 0.776
## F-statistic: 25.25 on 2 and 12 DF, p-value: 5.01e-05
Let’s check if it worked. Compare the unconfounded effect of approval on vote with the multivariate model.
coef(aux3)[2] == coef(mv)[3]
## resid_aux2
## TRUE
What would we have to do to get the coefficient of growth step-by-step?
Whenever we run regression models, deciding which variables to include or not to include as control variables in a regression model is a decisive task that requires careful thinking. We learned about three types of variables: confounders, mediators, and colliders.
While we want to control for confounders, we should avoid controlling for mediators and colliders.
A team of researchers wants to study whether attending a Trump rally increases Trump’s popularity among visitors of the rally. The team fields a survey in a town where Trump just held a rally. Respondents were asked whether they like Trump (like_trump
), about their ideology (ideology
) and whether they attended the rally (attendance
). Importantly, they only have data from after the rally, but not from before the rally.
load("raw-data/trump_rally.Rdata")
head(trump_rally)
## ideology attendance like_trump
## 1 -8 0 -0.4118509
## 2 3 0 -1.2290978
## 3 2 0 -0.4148800
## 4 3 0 0.3874313
## 5 8 1 3.4798378
## 6 3 0 0.7476413
summary(trump_rally)
## ideology attendance like_trump
## Min. :-10.000 Min. :0.000 Min. :-9.5866
## 1st Qu.: -5.000 1st Qu.:0.000 1st Qu.:-2.5361
## Median : 0.000 Median :0.000 Median : 0.1024
## Mean : 0.138 Mean :0.229 Mean : 0.1399
## 3rd Qu.: 5.000 3rd Qu.:0.000 3rd Qu.: 2.8918
## Max. : 10.000 Max. :1.000 Max. :10.0000
Let’s have a look at the distributions of the variables:
attendance_table <- table(trump_rally$attendance)
barplot(attendance_table,
main = "Attendance at Trump Rally",
xlab = "",
names.arg = c("was not at rally", "was at rally"),
col = viridis(1),
border = F,
las = 1)
hist(trump_rally$ideology,
main = "Histogram of Ideology",
xlab = "Ideology (Liberal-Conservative)",
col = viridis(1),
border = F,
las = 1)
hist(trump_rally$like_trump,
main = "Histogram of Like/Dislike Trump",
xlab = "Dislike - Like: Trump",
col = viridis(1),
border = F,
las = 1)
Let’s compare those who were at the Trump rally with those who were not at the rally.
attendance_jitter <- jitter(trump_rally$attendance) # What does jitter do?
plot(attendance_jitter,
trump_rally$like_trump,
xlab = "Attended the Trump rally",
ylab = "Dislike - Like: Trump",
col = viridis(1, alpha = 0.5),
pch = 19,
xaxt = "n",
bty = "n")
axis(1,
at = 0:1,
labels = c("was not at rally",
"was at rally"),
tick = F)
Let’s have a look at the distributions of the variables:
ggplot(
data = trump_rally,
aes(attendance)
) +
geom_bar(
stat = "count",
color = "white",
fill = viridis(1)
) +
theme_minimal() +
theme(panel.grid.minor = element_blank()) +
scale_x_continuous(
"",
breaks = c(0, 1),
labels = c("was not at rally", "was at rally")
) +
labs(
title = "Attendance at Trump Rally",
y = ""
)
ggplot(
data = trump_rally,
aes(ideology)
) +
geom_histogram(
boundary = 10,
binwidth = 2,
color = "white",
fill = viridis(1)
) +
labs(
title = "Histogram of Ideology",
x = "Ideology (Liberal-Conservative)",
y = "Frequency"
) +
theme_minimal() +
scale_x_continuous(
breaks = c(seq(-10, 10, by = 5))
)
ggplot(
data = trump_rally,
aes(like_trump)
) +
geom_histogram(
boundary = 10,
binwidth = 2,
color = "white",
fill = viridis(1)
) +
labs(
title = "Histogram of Like/Dislike Trump",
x = "Dislike - Like: Trump",
y = "Frequency"
) +
theme_minimal() +
scale_x_continuous(
breaks = c(seq(-10, 10, by = 5))
)
Let’s compare those who were at the Trump rally with those who were not at the rally.
trump_rally$attendance_jitter <- jitter(trump_rally$attendance) # What does jitter do?
ggplot(data = trump_rally, # data used for plotting
mapping = aes(x = attendance_jitter,
y = like_trump)
) +
geom_point(
color = viridis(1, alpha = 0.5),
size = 2
) +
theme_minimal() + # change the appearance
theme(panel.grid.minor = element_blank()) +
labs(
x = "Attended the Trump rally",
y = "Dislike - Like: Trump"
) +
scale_x_continuous(
"",
breaks = c(0,1),
labels = c("was not at rally","was at rally")
)
trump_rally$attendance_jitter <- NULL
A naive regression model:
lm1 <- lm(
like_trump ~ attendance,
data = trump_rally
)
summary(lm1)
##
## Call:
## lm(formula = like_trump ~ attendance, data = trump_rally)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.6665 -2.2202 0.0215 2.0731 8.5448
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.9201 0.1132 -8.126 1.3e-15 ***
## attendance 4.6287 0.2366 19.563 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.144 on 998 degrees of freedom
## Multiple R-squared: 0.2772, Adjusted R-squared: 0.2765
## F-statistic: 382.7 on 1 and 998 DF, p-value: < 2.2e-16
Should we control for ideology? Let’s look at a causal graph.
The DAG indicates that ideology is a confounder: Ideology affects whether voters attend the rally and whether they like or dislike Trump. We should thus control for ideology. Let’s see whether this changes the result.
lm2 <- lm(like_trump ~ attendance + ideology,
data = trump_rally)
summary(lm2)
##
## Call:
## lm(formula = like_trump ~ attendance + ideology, data = trump_rally)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.2541 -1.3940 0.0316 1.3660 6.2687
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.01793 0.07881 0.228 0.820
## attendance 0.23563 0.19724 1.195 0.233
## ideology 0.49297 0.01361 36.222 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.067 on 997 degrees of freedom
## Multiple R-squared: 0.6879, Adjusted R-squared: 0.6873
## F-statistic: 1099 on 2 and 997 DF, p-value: < 2.2e-16
The effect of attending the Trump rally on whether someone likes or dislikes Trump almost entirely disappears. To get an idea why including ideology as a control variable makes such a difference, let’s look at the bivariate relationship between ideology and the popularity of Trump and color the dots according to whether someone was at the rally or not.
col_vec <- viridis(2, alpha = 0.5)
attend_col <- ifelse(trump_rally$attendance == 1,
col_vec[1],
ifelse(trump_rally$attendance == 0,
col_vec[2], NA)) # Why do we nest two ifelse statements?
# instead of nesting two ifelse statements, we can also use case_when form dplyr
# attend_lab <- case_when(trump_rally$attendance == 1 ~ "did attend",
# trump_rally$attendance == 0 ~ "did not attend",
# TRUE ~ NA)
plot(jitter(trump_rally$ideology), # Now that we know what jitter does, let's make good use of it
trump_rally$like_trump,
xlab = "Ideology (Liberal-Conservative)",
ylab = "Dislike - Like: Trump",
pch = 19,
col = attend_col,
bty = "n")
legend("topleft",
col = col_vec,
pch = 19,
legend = c("was at rally",
"was not at rally"),
bty = "n")
col_vec <- viridis(2, alpha = 0.5)
# Why do we nest two ifelse statements?
attend_lab <- ifelse(trump_rally$attendance == 1,
"did attend",
ifelse(trump_rally$attendance == 0,
"did not attend", NA))
# instead of nesting two ifelse statements, we can also use case_when form dplyr
# attend_lab <- case_when(trump_rally$attendance == 1 ~ "did attend",
# trump_rally$attendance == 0 ~ "did not attend",
# TRUE ~ NA) # assign NA if trump_rally$attendance is not 1 or 0
ggplot(data = trump_rally, # data used for plotting
mapping = aes(x = ideology,
y = like_trump,
color = attend_lab)
) +
geom_jitter(
size = 2,
width = 0.25
) +
theme_minimal() + # change the appearance
labs(
x = "Ideology (Liberal-Conservative)",
y = "Dislike - Like: Trump",
title = ""
) +
scale_color_manual(
name = "",
values = c("did attend" = col_vec[1], "did not attend" = col_vec[2])
)
It becomes apparent that people who already like Trump were way more likely to go to the rally than people who dislike Trump. This supports the causal model we specified in the DAG and constitutes what we call selection into treatment. At the same time, ideology affects Trumps popularity: more conservative people are more in favor of Trump than liberal people. Because ideology affects the independent variable, as well as the dependent variable, we must control for it. Otherwise, we compare groups that were systematically different from one another already before the Trump rally and the regression coefficient will pick up this difference as long as we do not control for it.
The next team of researchers wants to find out whether civic education programs can increase turnout. In search of empirical evidence, they conduct an experiment in which they randomly assign participants to attend a civic education program (educ_program
= 1) or not (educ_program
= 0). After the experiment, they conduct a survey among all participants asking them to indicate on a scale from 0 to 10 how willing they are to cast a vote in the next election (turnout
). Furthermore, to understand the effect of the civic training on turnout independent of subjects’ political awareness, they also ask respondents about their political interest.
Let’s have a look at the resulting data.
load("raw-data/experiment.Rdata")
barplot(table(experiment$educ_program),
main = "Assignment to civic education program",
xlab = "",
names.arg = c("Control Group", "Treatment Group"),
col = viridis(1),
border = F,
las = 1)
hist(experiment$pol_interest,
main = "Histogram of political interest",
xlab = "Political Interest",
col = viridis(1),
border = F,
las = 1)
hist(experiment$turnout,
main = "Histogram of willingness to turn out",
xlab = "Willingness to turn out",
col = viridis(1),
border = F,
las = 1)
ggplot(
data = experiment,
aes(educ_program)
) +
geom_bar(
stat = "count",
color = "white",
fill = viridis(1)
) +
theme_minimal() +
theme(panel.grid.minor = element_blank()) +
scale_x_continuous(
"",
breaks = c(0,1),
labels = c("Control Group","Treatment Group")
) +
labs(
title = "Assignment to civic education program",
y = ""
)
ggplot(
data = experiment,
aes(pol_interest)
) +
geom_histogram(
boundary = 10,
binwidth = 1,
color = "white",
fill = viridis(1)
) +
labs(
title = "Histogram of political interest",
x = "Political Interest",
y = "Frequency"
) +
theme_minimal() +
scale_x_continuous(
breaks = c(seq(0, 10, by = 2))
)
ggplot(
data = experiment,
aes(turnout)
) +
geom_histogram(
boundary = 10,
binwidth = 1,
color = "white",
fill = viridis(1)
) +
labs(
title = "Histogram of willingness to turn out",
x = "Willingness to turn out",
y = "Frequency"
) +
theme_minimal() +
scale_x_continuous(
breaks = c(seq(0, 10, by = 2))
)
When the team turns to analyzing their data, they are unsure whether they should use the political interest variable as a control variable in their model or not. Let’s see whether this makes a difference:
lm1 <- lm(turnout ~ educ_program,
data = experiment)
summary(lm1)
##
## Call:
## lm(formula = turnout ~ educ_program, data = experiment)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4019 -0.8897 0.0038 0.8887 2.5734
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.22731 0.05726 73.83 <2e-16 ***
## educ_program 3.26597 0.07956 41.05 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.257 on 998 degrees of freedom
## Multiple R-squared: 0.628, Adjusted R-squared: 0.6277
## F-statistic: 1685 on 1 and 998 DF, p-value: < 2.2e-16
lm2 <- lm(turnout ~ educ_program + pol_interest,
data = experiment)
summary(lm2)
##
## Call:
## lm(formula = turnout ~ educ_program + pol_interest, data = experiment)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.07142 -0.55983 -0.00923 0.51489 2.91203
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.28361 0.05999 38.064 <2e-16 ***
## educ_program -0.22366 0.09990 -2.239 0.0254 *
## pol_interest 0.77427 0.01929 40.141 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7776 on 997 degrees of freedom
## Multiple R-squared: 0.8578, Adjusted R-squared: 0.8575
## F-statistic: 3008 on 2 and 997 DF, p-value: < 2.2e-16
Which model should they use? Let’s consider a DAG again. Because the civic education program was randomly assigned to participants, political interest cannot affect the ‘education program’ variable. However, if the civic education program worked, it may also affect participant’s level of political interest. The arrow between ‘education program’ and ‘political interest’ is thus directed from ‘education program’ to ‘political interest’. This results in the following DAG:
The DAG shows that pol_interest
is a mediator variable and thus should not be included in the regression model. By including a mediator, we run in danger to control away the causal effect of interest. In the present case, when controlling for the mediator, the estimate of the causal effect even goes in the opposite direction. This is due to unobserved background variables (you can read Montgomery et al (2018) to learn more about this).
In the final example, a group of university researchers wants to find out how attendance in lectures affects exam grades. They hypothesize that the more lectures a student attends to, the better they will do in the final exam. To test their hypothesis, they poll students online on their ‘attendance’ in lectures and ‘grade’ in the exam at the end of the semester.
load("raw-data/studentpoll.Rdata")
glimpse(poll)
## Rows: 3,472
## Columns: 2
## $ grade <dbl> 2.3, 1.0, 2.0, 3.3, 3.7, 1.7, 3.0, 2.3, 1.7, 2.7, 1.7, 2.3,~
## $ attendance <dbl> 8, 13, 10, 13, 7, 12, 9, 13, 11, 9, 8, 9, 11, 9, 11, 11, 14~
The data includes the survey results for each participant, namely their reported grade and attendance.
barplot(
table(poll$grade),
main = "Exam grades",
col = viridis(1),
border = F,
las = 1
)
barplot(
table(poll$attendance),
main = "Number of lectures",
col = viridis(1),
border = F,
las = 1,
)
The data includes the survey results for each participant, namely their reported grade and attendance.
grades <- sort(unique(poll$grade))
ggplot(
data = poll,
aes(grade)
) +
geom_bar(
stat = "count",
color = "white",
fill = viridis(1)
) +
theme_minimal() +
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) +
scale_x_continuous("", breaks = grades) +
labs(title = "Exam grades", y = "", x = "")
rm(grades)
ggplot(
data = poll,
aes(attendance)
) +
geom_bar(
stat = "count",
color = "white",
fill = viridis(1)
) +
theme_minimal() +
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) +
scale_x_continuous("", breaks = seq(0, 14, 1)) +
labs(title = "Number of lectures attended", y = "", x = "")
After exploring their poll results, the team of researchers start by calculating a linear regression.
lm1 <- lm(
grade ~ attendance,
data = poll
)
summary(lm1)
##
## Call:
## lm(formula = grade ~ attendance, data = poll)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.4768 -0.6764 -0.1403 0.5779 2.6510
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.349015 0.032693 71.850 < 2e-16 ***
## attendance 0.009130 0.003428 2.664 0.00776 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9013 on 3470 degrees of freedom
## Multiple R-squared: 0.002041, Adjusted R-squared: 0.001753
## F-statistic: 7.095 on 1 and 3470 DF, p-value: 0.007765
Surprisingly, the team of researchers do not find any empirical support for their hypothesis. According to their bivariate model, on average, each additional lecture attended actually increases the numerical value of exam grades received by 0.00913. That means, our prediction would be that not attending the lecture at all increases students’ exam grades by 0.128 (as we are working with German grading system, this means that attendance has a negative association with performance, which is counterintuitive). A small, but nonetheless unexpected estimated effect.
Do you have any ideas how this result can be explained?
Luckily, this is simulated data. So let’s have a look at the population to better explain this finding. First, we will take a sample of our population exactly the same size as our poll. The population consists of all students, whereas the sample only included students who participated in the online survey.
Perhaps, it is helpful to look at a scatterplot. Due to the scale of our variables, we can use jitter on both x
and y
to create a scatterplot that somewhat resembles a heatmap.
set.seed(2021)
# create sample from population with same size as poll sample
pop <- population[sample(1:nrow(population), nrow(poll)),]
# or simpler:
# pop <- sample(population, nrow(poll))
# Scatterplot: Polled students
plot(jitter(poll$attendance, 2.2),
jitter(poll$grade, 2),
main = "Polled students",
xlab = "Attended lectures",
ylab = "Exam Grade",
xlim = c(0,14),
pch = 19,
col = viridis(1, alpha = 0.1),
bty = "n")
# now let's create the same plot, but on the sample taken from our population
plot(jitter(pop$attendance, 2.2),
jitter(pop$grade, 2),
main = "Student population sample",
xlab = "Attended lectures",
ylab = "Exam Grade",
xlim = c(0,14),
pch = 19,
col = viridis(1, alpha = 0.1),
bty = "n")
Can you spot differences?
ggplot(
data = poll, # data used for plotting
mapping = aes(x = jitter(attendance, 2.2),
y = jitter(grade, 2))
) +
geom_point(
color = viridis(1, alpha = 0.1),
size = 2
) +
theme_minimal() + # change the appearance
labs(x = "Lectures attended",
y = "Exam grades",
title = "Polled students"
)
ggplot(
data = pop, # data used for plotting
mapping = aes(x = jitter(attendance, 2.2),
y = jitter(grade, 2))
) +
geom_point(
color = viridis(1, alpha = 0.1),
size = 2
) +
theme_minimal() + # change the appearance
labs(
x = "Lectures attended",
y = "Exam grades",
title = "Student population sample"
)
# or with patchwork package
# lp + rp
Can you spot differences?
Apparently, the sample is not an accurate reflection of the underlying population of students.
Let’s test the relationship between lecture attendance and exam grades again!
lm2 <- lm(
grade ~ attendance,
data = pop
)
summary(lm2)
##
## Call:
## lm(formula = grade ~ attendance, data = pop)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.00897 -0.75577 -0.00897 0.99103 2.49744
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.008971 0.029352 102.52 <2e-16 ***
## attendance -0.036172 0.003549 -10.19 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.007 on 3470 degrees of freedom
## Multiple R-squared: 0.02906, Adjusted R-squared: 0.02878
## F-statistic: 103.9 on 1 and 3470 DF, p-value: < 2.2e-16
What happened?
We learned that colliders are bad controls and we do not include them in our model to avoid bias. A collider is a variable affected by both, X and Y. In other words, both X and Y can be used to explain some variation in the collider. You can read more on colliders in the Appendix.
However, colliders do not have to be included as control variables to introduce bias. Adjusting for a collider variable can introduce sample selection bias (or collider stratification bias). In this case, adjustment was done by students through self-selection into the online poll – by only using students who participated in the survey, we implicitly “control” for this variable (because we hold this constant). The team of researchers polled a sample which does not accurately depict the full population of students.
In our case, attendance affects exam grades, as shown by the regression on the full population. However, attendance also affects poll response.
We can also see the respective associations in the data:
# attendance affects poll response
lm3 <- lm(
poll ~ attendance,
data = population
)
lm3
##
## Call:
## lm(formula = poll ~ attendance, data = population)
##
## Coefficients:
## (Intercept) attendance
## 0.10857 0.01593
# Grades affect poll response
lm4 <- lm(
poll ~ grade,
data = population
)
lm4
##
## Call:
## lm(formula = poll ~ grade, data = population)
##
## Coefficients:
## (Intercept) grade
## 0.40124 -0.06736
In consequence, our sample overrepresents good students who attend regularly and underrepresents failing students, especially those who do not attend.
It’s best to look at it:
col_vec <- viridis(2, alpha = 0.3)
poll_col <- ifelse(pop$poll == 1, col_vec[1], col_vec[2])
par(mar=c(5,5,1,5), xpd=TRUE)
plot(jitter(pop$attendance, 2.2), # Now that we know what jitter does, let's make good use of it
jitter(pop$grade, 2),
main = "Population sample by poll response",
xlab = "Lectures attended",
ylab = "Exam grades",
pch = 19,
col = poll_col,
bty = "n")
legend("topright",
inset=c(-0.2,0),
col = viridis(2, alpha = 1),
pch = 19,
legend = c("polled",
"not polled"),
bty = "n")
It’s best to look at it:
col_vec <- viridis(2, alpha = 0.3)
poll_lab <- ifelse(pop$poll == 1,
"polled",
ifelse(pop$poll == 0,
"not polled", NA)) # Why do we nest two ifelse statements?
ggplot(data = pop, # data used for plotting
mapping = aes(x = jitter(attendance, 2.2),
y = jitter(grade, 2),
color = poll_lab)
) +
geom_point(size = 2) +
theme_minimal() + # change the appearance
labs(
x = "Lectures attended",
y = "Exam grades",
title = "Population sample by poll response"
) +
scale_color_manual(
name = "",
values = c("polled" = col_vec[1], "not polled" = col_vec[2])
)
If you were to select cases for your study conditional on a collider, this can bias in our causal estimates. Further, like controlling for colliders, it can generate spurious associations between variables that are not related to each other. This sort of Sample Selection Bias is induced, because we–even unintendedly–stratify on a collider, we also stratify on the X and Y of interest.
Now it’s your turn! You want to study how the amount of time students devote to studying affects their final grades. You are interested in the causal effect of study time on scores in a final exam: How many more points do students gain on average for one additional hour of weekly studying?
To study this question, you have the following data:
student_id
: an identifier for students in the samplestudy_hours
: average number of weekly study hoursmotivation
: whether students were motivated to take the course (prior to the semester)exercises
: average weekly number of practice exercises students worked on during their study hoursscore
: score in the final examload("raw-data/ex1.Rdata")
head(ex1)
## student_id study_hours motivation exercises score
## 1 001 5 0 5 46.84805
## 2 002 13 1 6 68.23539
## 3 003 7 0 1 37.57796
## 4 004 11 1 6 63.03814
## 5 005 13 1 6 67.92869
## 6 006 7 0 4 49.44455
# distribution of weekly study hours
hist(ex1$study_hours,
main = "Histogram of study hours",
xlab = "Average weekly study hours",
col = viridis(1),
border = F,
las = 1)
# Distribution of number of weekly exercises
hist(ex1$exercises,
breaks = max(ex1$exercises),
main = "Histogram of exercises",
xlab = "Average weekly number of practice exercises",
col = viridis(1),
border = F,
las = 1)
# Distribution of Motivation
barplot(table(ex1$motivation),
main = "Motivated and unmotivated students",
xlab = "",
names.arg = c("Not motivated",
"Motivated"),
col = viridis(1),
border = F,
las = 1)
# Distribution of final test score
hist(ex1$score,
main = "Histogram of Final Test Score",
xlab = "Final Test Score",
col = viridis(1),
border = F,
las = 1)
It is your task to specify a regression model to estimate the causal effect of study hours on students’ final score. In other words: You have to decide which variables to include as control variables.
motivation
a confounder, mediator, or collider?exercises
a confounder, mediator, or collider?In your homework you will do a peer review!
In the final example, another team of researchers wants to find out whether education has an effect on the popularity of populist right parties. The team uses survey data to test their hypothesis. Specifically, the consider three survey items: like_pop_right
indicates respondents rating of a populist right wing party. educ
indicates survey respondents level of formal education and vote_pop_right
indicates whether survey respondents voted for the populist right party in the last election.
load("raw-data/populism.Rdata")
head(populism)
## like_pop_right educ vote_pop_right
## 1 -0.001767621 1 0
## 2 0.451823262 1 0
## 3 4.878047894 5 1
## 4 1.544476798 6 1
## 5 0.170058582 3 0
## 6 1.925084331 6 1
hist(populism$educ,
breaks = seq(0.5, 7.5, 1),
main = "Histogram of Education",
xlab = "Formal level of education",
col = viridis(1),
border = F,
las = 1)
hist(populism$like_pop_right,
main = "Histogram of like/dislike populist right party",
xlab = "Like/Dislike Populist Right Party",
col = viridis(1),
border = F,
las = 1)
barplot(table(populism$vote_pop_right),
main = "Vote for populist right party",
xlab = "",
names.arg = c("Not voted for\npopulist right party",
"Voted for\npopulist right party"),
col = viridis(1),
border = F,
las = 1)
The team stumbles a cross a puzzling phenomenon when analysing the data: When they do not control for vote_pop_right
, they do not find an effect of education on the popularity of populist right wing parties. However, when they do control for vote_pop_right
, they find a negative effect of education on the popularity of populist right wing parties. They are unsure whether they should include vote_pop_right
as a control variable in their model.
lm1 <- lm(like_pop_right ~ educ,
data = populism)
summary(lm1)
##
## Call:
## lm(formula = like_pop_right ~ educ, data = populism)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.0136 -1.7589 0.1305 1.7661 8.4145
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.06934 0.17973 0.386 0.700
## educ -0.01500 0.04091 -0.367 0.714
##
## Residual standard error: 2.568 on 998 degrees of freedom
## Multiple R-squared: 0.0001348, Adjusted R-squared: -0.0008671
## F-statistic: 0.1345 on 1 and 998 DF, p-value: 0.7139
lm2 <- lm(like_pop_right ~ educ + vote_pop_right,
data = populism)
summary(lm2)
##
## Call:
## lm(formula = like_pop_right ~ educ + vote_pop_right, data = populism)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.7237 -1.2412 0.1366 1.3897 5.8827
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.69113 0.13937 4.959 8.33e-07 ***
## educ -0.44509 0.03519 -12.647 < 2e-16 ***
## vote_pop_right 4.32412 0.16221 26.657 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.963 on 997 degrees of freedom
## Multiple R-squared: 0.4162, Adjusted R-squared: 0.4151
## F-statistic: 355.4 on 2 and 997 DF, p-value: < 2.2e-16
Let’s again consider a DAG in order to decide whether vote_pop_right
should be included in the model. We suspect that education potentially affects whether someone voted for a populist right party (rather than the other way around) and whether someone likes a party affects whether they* votes for that party (rather than the other way around). Thus, we draw a directed arrow from ‘education’ to ‘vote’ and another directed arrow from ‘like populist right party’ to ‘vote’. This results in the following DAG:
*‘they’ is a gender-inclusive alternative to the generic ‘he’ or the binary ‘he/she’.
It turns out that vote_pop_right
is a collider and thus a bad control variable. Including colliders can generate spurious associations between variables that are not related to each other. When we include colliders, our causal estimates may be biased.
From our DAG in the attendace-grade example, we would assume that attendance affects survey response and so does exam performance. However, the relationship does not have to be direct. Instead, we could assume that attendance does not impact poll response, because the poll is done online. Instead, there may be an unobserved confounder that affects our outcome, exam grades, and poll response rate. For example: Internet speed (or access) could impact exam grades, because it makes preparation and learning harder, and poll response rate, because it is harder for students to participate in an online poll. It does however not affect attendance, as the lecture is teached only in person.
In this case, survey response is a collider of attendance and exam grades, because Internet affects survey response and exam grades. Those who do better because of Internet are also more likely to respond to the poll, which introduces a spurious relationship between attendance and exam grades, because we observe a non-representative subsample.
Keep this in mind when you are tasked with data collection, or conduct analysis on samples. Think about which groups of the population you collect data on, and more importantly, which parts of the population are underrepresented or omitted entirely. When you work with existing data, you should assess this retrospectively by studying the documentation on the data collection and processing of the data at hand.