Today we will learn

  1. Multivariate Regression: Statistical Control

  2. Statistical Control for Causal Inference: Good and Bad Controls

  3. Selection Bias: Bias through stratification

In other words, our goals are to:

  • understand what statistical control means
  • look at how to use dummy variables
  • explore bias introduced through stratified samples

1 Multivariate Regression: Statistical Control

First, we have another look at the US presidential elections.

load("raw-data/uspresnew.Rdata") # presidential elections

us
##    year   vote      party growth       incumbent approval
## 1  1948 52.370   Democrat  3.579     Truman 1948       39
## 2  1952 44.595   Democrat  0.691  Stevenson 1952       32
## 3  1956 57.764 Republican -1.451 Eisenhower 1956       69
## 4  1960 49.913 Republican  0.377      Nixon 1960       49
## 5  1964 61.344   Democrat  5.109    Johnson 1964       74
## 6  1968 49.596   Democrat  5.043   Humphrey 1968       40
## 7  1972 61.789 Republican  5.914      Nixon 1972       56
## 8  1976 48.948 Republican  3.751       Ford 1976       45
## 9  1980 44.697   Democrat -3.597     Carter 1980       33
## 10 1984 59.170 Republican  5.440     Reagan 1984       52
## 11 1988 53.902 Republican  2.178       Bush 1988       51
## 12 1992 46.545 Republican  2.662       Bush 1992       31
## 13 1996 54.736   Democrat  3.121    Clinton 1996       57
## 14 2000 50.265   Democrat  1.219       Gore 2000       59
## 15 2004 51.233 Republican  2.690       Bush 2004       47

Remember our basic bivariate model?

lm(vote ~ growth, data = us)
## 
## Call:
## lm(formula = vote ~ growth, data = us)
## 
## Coefficients:
## (Intercept)       growth  
##      49.699        1.127

Let’s estimate effect of approval while controlling for growth (i.e., holding growth constant) step by step (like in the lecture).

It is helpful to think of statistical control as Venn Diagrams. Each circle represents the variance of the variable, and the intersection of the circles is the covariance of the two variables, i.e. the part of variance in one variable that can be explained by the variance of another variable.

Vote on Growth

First, we need the residuals from growth on vote, i.e. the part in variance of vote that is not explained by growth. With this regression we “remove the effect of growth on vote”.

aux1 <- lm(vote ~ growth, data = us)

Let’s have another look at the Venn Diagram. The colored blue part here indicates the variance in vote that can be explained by growth.

Vote on Growth

Now we can get the residuals. The residuals in this model are the part of vote unexplained by growth, so in other words, the unexplained part of variance of the dependent variable vote.

resid_aux1 <- residuals(aux1)

Residuals 1

Next, we remove the effect of growth on approval. For this, we need the residuals from regressing approval on growth. Again, the residuals would be the unexplained variance in the dependent variable, hence we’re treating approval as the dependent variable here:

aux2 <- lm(approval ~ growth, data = us)

The blue part again corresponds to the explained by growth variance in approval, so their covariance.

Approval on Growth

And get the residuals, colored as red in the Venn diagram below.

resid_aux2 <- residuals(aux2)

residuals 2

We see that some of this red part is not explained by any variable in the plot (does not intersect with anything), and the other part intersects with vote. This intersection is what we are interested in.

Once we have the residuals from the two regressions, we can now estimate the unconfounded effect of approval on vote.

aux3 <- lm(resid_aux1 ~ resid_aux2)

Unconfounded effect of approval represented by blue area

summary(aux3)
## 
## Call:
## lm(formula = resid_aux1 ~ resid_aux2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.4076 -1.1555 -0.8039  2.1181  4.2504 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3.185e-16  6.558e-01   0.000        1    
## resid_aux2  3.195e-01  5.383e-02   5.935 4.94e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.54 on 13 degrees of freedom
## Multiple R-squared:  0.7304, Adjusted R-squared:  0.7097 
## F-statistic: 35.23 on 1 and 13 DF,  p-value: 4.942e-05

Of course we do not need to do this step-by-step. R knows how to handle multivariate OLS models.

mv <- lm(vote ~ growth + approval, data = us)
mv
## 
## Call:
## lm(formula = vote ~ growth + approval, data = us)
## 
## Coefficients:
## (Intercept)       growth     approval  
##     34.8293       0.8146       0.3195

summary(mv)
## 
## Call:
## lm(formula = vote ~ growth + approval, data = us)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.4076 -1.1555 -0.8039  2.1181  4.2504 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 34.82929    2.77295  12.560 2.90e-08 ***
## growth       0.81459    0.27126   3.003    0.011 *  
## approval     0.31950    0.05603   5.702 9.88e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.644 on 12 degrees of freedom
## Multiple R-squared:  0.808,  Adjusted R-squared:  0.776 
## F-statistic: 25.25 on 2 and 12 DF,  p-value: 5.01e-05

Let’s check if it worked. Compare the unconfounded effect of approval on vote with the multivariate model.

coef(aux3)[2] == coef(mv)[3]
## resid_aux2 
##       TRUE

What would we have to do to get the coefficient of growth step-by-step?

Unconfounded effect of growth represented by blue area

2 Statistical Control for Causal Inference: Good and Bad Controls

Whenever we run regression models, deciding which variables to include or not to include as control variables in a regression model is a decisive task that requires careful thinking. We learned about three types of variables: confounders, mediators, and colliders.

Confounder, Mediator, and Collider

While we want to control for confounders, we should avoid controlling for mediators and colliders.

2.1 Confounders

A team of researchers wants to study whether attending a Trump rally increases Trump’s popularity among visitors of the rally. The team fields a survey in a town where Trump just held a rally. Respondents were asked whether they like Trump (like_trump), about their ideology (ideology) and whether they attended the rally (attendance). Importantly, they only have data from after the rally, but not from before the rally.

load("raw-data/trump_rally.Rdata")

head(trump_rally)
##   ideology attendance like_trump
## 1       -8          0 -0.4118509
## 2        3          0 -1.2290978
## 3        2          0 -0.4148800
## 4        3          0  0.3874313
## 5        8          1  3.4798378
## 6        3          0  0.7476413
summary(trump_rally)
##     ideology         attendance      like_trump     
##  Min.   :-10.000   Min.   :0.000   Min.   :-9.5866  
##  1st Qu.: -5.000   1st Qu.:0.000   1st Qu.:-2.5361  
##  Median :  0.000   Median :0.000   Median : 0.1024  
##  Mean   :  0.138   Mean   :0.229   Mean   : 0.1399  
##  3rd Qu.:  5.000   3rd Qu.:0.000   3rd Qu.: 2.8918  
##  Max.   : 10.000   Max.   :1.000   Max.   :10.0000

Base R

Let’s have a look at the distributions of the variables:

attendance_table <- table(trump_rally$attendance)

barplot(attendance_table,
        main = "Attendance at Trump Rally",
        xlab = "",
        names.arg = c("was not at rally", "was at rally"),
        col = viridis(1),
        border = F,
        las = 1)

hist(trump_rally$ideology,
     main = "Histogram of Ideology",
     xlab = "Ideology (Liberal-Conservative)",
     col = viridis(1),
     border = F,
     las = 1)
     
hist(trump_rally$like_trump,
     main = "Histogram of Like/Dislike Trump",
     xlab = "Dislike - Like: Trump",
     col = viridis(1),
     border = F,
     las = 1)

Let’s compare those who were at the Trump rally with those who were not at the rally.

attendance_jitter <- jitter(trump_rally$attendance) # What does jitter do?

plot(attendance_jitter,
     trump_rally$like_trump,
     xlab = "Attended the Trump rally",
     ylab = "Dislike - Like: Trump",
     col = viridis(1, alpha = 0.5),
     pch = 19,
     xaxt = "n",
     bty = "n")
axis(1,
     at = 0:1,
     labels = c("was not at rally",
                "was at rally"),
     tick = F)

ggplot2

Let’s have a look at the distributions of the variables:

ggplot(
  data = trump_rally,
  aes(attendance)
) +
  geom_bar(
    stat = "count",
    color = "white",
    fill = viridis(1)
  ) +
  theme_minimal() +
  theme(panel.grid.minor = element_blank()) +
  scale_x_continuous(
    "",
    breaks = c(0, 1),
    labels = c("was not at rally", "was at rally")
  ) +
  labs(
    title = "Attendance at Trump Rally",
    y = ""
  )

ggplot(
  data = trump_rally,
  aes(ideology)
) +
  geom_histogram(
    boundary = 10,
    binwidth = 2,
    color = "white",
    fill = viridis(1)
  ) +
  labs(
    title = "Histogram of Ideology",
    x = "Ideology (Liberal-Conservative)",
    y = "Frequency"
  ) +
  theme_minimal() +
  scale_x_continuous(
    breaks = c(seq(-10, 10, by = 5))
  )

ggplot(
  data = trump_rally,
  aes(like_trump)
) +
  geom_histogram(
    boundary = 10,
    binwidth = 2,
    color = "white",
    fill = viridis(1)
  ) +
  labs(
    title = "Histogram of Like/Dislike Trump",
    x = "Dislike - Like: Trump",
    y = "Frequency"
  ) +
  theme_minimal() +
  scale_x_continuous(
    breaks = c(seq(-10, 10, by = 5))
  )

Let’s compare those who were at the Trump rally with those who were not at the rally.

trump_rally$attendance_jitter <- jitter(trump_rally$attendance) # What does jitter do?

ggplot(data = trump_rally,  # data used for plotting
       mapping = aes(x = attendance_jitter, 
                     y = like_trump)
       ) +
  geom_point(
    color = viridis(1, alpha = 0.5), 
    size = 2
    ) + 
  theme_minimal() + # change the appearance
  theme(panel.grid.minor = element_blank()) +
  labs(
    x = "Attended the Trump rally",
    y = "Dislike - Like: Trump"
    ) +
  scale_x_continuous(
    "", 
    breaks = c(0,1), 
    labels = c("was not at rally","was at rally")
    )


trump_rally$attendance_jitter <- NULL

A naive regression model:

lm1 <- lm(
  like_trump ~ attendance,
  data = trump_rally
)
summary(lm1)
## 
## Call:
## lm(formula = like_trump ~ attendance, data = trump_rally)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.6665 -2.2202  0.0215  2.0731  8.5448 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.9201     0.1132  -8.126  1.3e-15 ***
## attendance    4.6287     0.2366  19.563  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.144 on 998 degrees of freedom
## Multiple R-squared:  0.2772, Adjusted R-squared:  0.2765 
## F-statistic: 382.7 on 1 and 998 DF,  p-value: < 2.2e-16

Should we control for ideology? Let’s look at a causal graph.

DAG for Confounder

The DAG indicates that ideology is a confounder: Ideology affects whether voters attend the rally and whether they like or dislike Trump. We should thus control for ideology. Let’s see whether this changes the result.

lm2 <- lm(like_trump ~ attendance + ideology,
          data = trump_rally)
summary(lm2)
## 
## Call:
## lm(formula = like_trump ~ attendance + ideology, data = trump_rally)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.2541 -1.3940  0.0316  1.3660  6.2687 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.01793    0.07881   0.228    0.820    
## attendance   0.23563    0.19724   1.195    0.233    
## ideology     0.49297    0.01361  36.222   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.067 on 997 degrees of freedom
## Multiple R-squared:  0.6879, Adjusted R-squared:  0.6873 
## F-statistic:  1099 on 2 and 997 DF,  p-value: < 2.2e-16

The effect of attending the Trump rally on whether someone likes or dislikes Trump almost entirely disappears. To get an idea why including ideology as a control variable makes such a difference, let’s look at the bivariate relationship between ideology and the popularity of Trump and color the dots according to whether someone was at the rally or not.

Base R

col_vec <- viridis(2, alpha = 0.5)

attend_col <- ifelse(trump_rally$attendance == 1, 
                     col_vec[1],
                     ifelse(trump_rally$attendance == 0, 
                            col_vec[2], NA)) # Why do we nest two ifelse statements?

# instead of nesting two ifelse statements, we can also use case_when form dplyr 
# attend_lab <- case_when(trump_rally$attendance == 1 ~ "did attend",
#                         trump_rally$attendance == 0 ~ "did not attend",
#                         TRUE ~ NA)

plot(jitter(trump_rally$ideology), # Now that we know what jitter does, let's make good use of it
     trump_rally$like_trump,
     xlab = "Ideology (Liberal-Conservative)",
     ylab = "Dislike - Like: Trump",
     pch = 19,
     col = attend_col,
     bty = "n")
legend("topleft",
       col = col_vec,
       pch = 19,
       legend = c("was at rally",
                  "was not at rally"),
       bty = "n")

ggplot2

col_vec <- viridis(2, alpha = 0.5)

# Why do we nest two ifelse statements?
attend_lab <- ifelse(trump_rally$attendance == 1, 
                     "did attend",
                     ifelse(trump_rally$attendance == 0, 
                            "did not attend", NA)) 

# instead of nesting two ifelse statements, we can also use case_when form dplyr 
# attend_lab <- case_when(trump_rally$attendance == 1 ~ "did attend",
#                         trump_rally$attendance == 0 ~ "did not attend",
#                         TRUE ~ NA) # assign NA if trump_rally$attendance is not 1 or 0

ggplot(data = trump_rally,  # data used for plotting
       mapping = aes(x = ideology, 
                     y = like_trump,
                     color = attend_lab)
       ) +
  geom_jitter(
    size = 2,
    width = 0.25
       ) + 
  theme_minimal() + # change the appearance
  labs(
    x = "Ideology (Liberal-Conservative)",
    y = "Dislike - Like: Trump",
    title = ""
      ) + 
  scale_color_manual(
    name = "", 
    values = c("did attend" = col_vec[1], "did not attend" = col_vec[2])
    )

It becomes apparent that people who already like Trump were way more likely to go to the rally than people who dislike Trump. This supports the causal model we specified in the DAG and constitutes what we call selection into treatment. At the same time, ideology affects Trumps popularity: more conservative people are more in favor of Trump than liberal people. Because ideology affects the independent variable, as well as the dependent variable, we must control for it. Otherwise, we compare groups that were systematically different from one another already before the Trump rally and the regression coefficient will pick up this difference as long as we do not control for it.

2.2 Mediators

The next team of researchers wants to find out whether civic education programs can increase turnout. In search of empirical evidence, they conduct an experiment in which they randomly assign participants to attend a civic education program (educ_program = 1) or not (educ_program = 0). After the experiment, they conduct a survey among all participants asking them to indicate on a scale from 0 to 10 how willing they are to cast a vote in the next election (turnout). Furthermore, to understand the effect of the civic training on turnout independent of subjects’ political awareness, they also ask respondents about their political interest.

Let’s have a look at the resulting data.

load("raw-data/experiment.Rdata")

Base R

barplot(table(experiment$educ_program),
        main = "Assignment to civic education program",
        xlab = "",
        names.arg = c("Control Group", "Treatment Group"),
        col = viridis(1),
        border = F,
        las = 1)

hist(experiment$pol_interest,
     main = "Histogram of political interest",
     xlab = "Political Interest",
     col = viridis(1),
     border = F,
     las = 1)

hist(experiment$turnout,
     main = "Histogram of willingness to turn out",
     xlab = "Willingness to turn out",
     col = viridis(1),
     border = F,
     las = 1)

ggplot2

ggplot(
  data = experiment,
  aes(educ_program)
) +
  geom_bar(
    stat = "count",
    color = "white",
    fill = viridis(1)
  ) +
  theme_minimal() +
  theme(panel.grid.minor = element_blank()) +
  scale_x_continuous(
    "", 
    breaks = c(0,1), 
    labels = c("Control Group","Treatment Group")
    ) +
  labs(
    title = "Assignment to civic education program", 
    y = ""
    )

ggplot(
  data = experiment,
  aes(pol_interest)
) +
  geom_histogram(
    boundary = 10,
    binwidth = 1,
    color = "white",
    fill = viridis(1)
  ) +
  labs(
    title = "Histogram of political interest",
    x = "Political Interest",
    y = "Frequency"
  ) +
  theme_minimal() +
  scale_x_continuous(
    breaks = c(seq(0, 10, by = 2))
  )

ggplot(
  data = experiment,
  aes(turnout)
) +
  geom_histogram(
    boundary = 10,
    binwidth = 1,
    color = "white",
    fill = viridis(1)
  ) +
  labs(
    title = "Histogram of willingness to turn out",
    x = "Willingness to turn out",
    y = "Frequency"
  ) +
  theme_minimal() +
  scale_x_continuous(
    breaks = c(seq(0, 10, by = 2))
  )

When the team turns to analyzing their data, they are unsure whether they should use the political interest variable as a control variable in their model or not. Let’s see whether this makes a difference:

lm1 <- lm(turnout ~ educ_program,
          data = experiment)
          
summary(lm1)
## 
## Call:
## lm(formula = turnout ~ educ_program, data = experiment)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4019 -0.8897  0.0038  0.8887  2.5734 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.22731    0.05726   73.83   <2e-16 ***
## educ_program  3.26597    0.07956   41.05   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.257 on 998 degrees of freedom
## Multiple R-squared:  0.628,  Adjusted R-squared:  0.6277 
## F-statistic:  1685 on 1 and 998 DF,  p-value: < 2.2e-16
lm2 <- lm(turnout ~ educ_program + pol_interest,
          data = experiment)
          
summary(lm2)
## 
## Call:
## lm(formula = turnout ~ educ_program + pol_interest, data = experiment)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.07142 -0.55983 -0.00923  0.51489  2.91203 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.28361    0.05999  38.064   <2e-16 ***
## educ_program -0.22366    0.09990  -2.239   0.0254 *  
## pol_interest  0.77427    0.01929  40.141   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7776 on 997 degrees of freedom
## Multiple R-squared:  0.8578, Adjusted R-squared:  0.8575 
## F-statistic:  3008 on 2 and 997 DF,  p-value: < 2.2e-16

Which model should they use? Let’s consider a DAG again. Because the civic education program was randomly assigned to participants, political interest cannot affect the ‘education program’ variable. However, if the civic education program worked, it may also affect participant’s level of political interest. The arrow between ‘education program’ and ‘political interest’ is thus directed from ‘education program’ to ‘political interest’. This results in the following DAG:

DAG for Mediation

The DAG shows that pol_interest is a mediator variable and thus should not be included in the regression model. By including a mediator, we run in danger to control away the causal effect of interest. In the present case, when controlling for the mediator, the estimate of the causal effect even goes in the opposite direction. This is due to unobserved background variables (you can read Montgomery et al (2018) to learn more about this).

2.3 Colliders and sample selection

In the final example, a group of university researchers wants to find out how attendance in lectures affects exam grades. They hypothesize that the more lectures a student attends to, the better they will do in the final exam. To test their hypothesis, they poll students online on their ‘attendance’ in lectures and ‘grade’ in the exam at the end of the semester.

load("raw-data/studentpoll.Rdata")
glimpse(poll)
## Rows: 3,472
## Columns: 2
## $ grade      <dbl> 2.3, 1.0, 2.0, 3.3, 3.7, 1.7, 3.0, 2.3, 1.7, 2.7, 1.7, 2.3,~
## $ attendance <dbl> 8, 13, 10, 13, 7, 12, 9, 13, 11, 9, 8, 9, 11, 9, 11, 11, 14~

Base R

The data includes the survey results for each participant, namely their reported grade and attendance.

barplot(
  table(poll$grade),
  main = "Exam grades",
  col = viridis(1),
  border = F,
  las = 1
)

barplot(
  table(poll$attendance),
  main = "Number of lectures",
  col = viridis(1),
  border = F,
  las = 1,
)

ggplot2

The data includes the survey results for each participant, namely their reported grade and attendance.

grades <- sort(unique(poll$grade))

ggplot(
  data = poll,
  aes(grade)
) +
  geom_bar(
    stat = "count",
    color = "white",
    fill = viridis(1)
  ) +
  theme_minimal() +
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) +
  scale_x_continuous("", breaks = grades) +
  labs(title = "Exam grades", y = "", x = "")

rm(grades)

ggplot(
  data = poll,
  aes(attendance)
) +
  geom_bar(
    stat = "count",
    color = "white",
    fill = viridis(1)
  ) +
  theme_minimal() +
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) +
  scale_x_continuous("", breaks = seq(0, 14, 1)) +
  labs(title = "Number of lectures attended", y = "", x = "")

After exploring their poll results, the team of researchers start by calculating a linear regression.

lm1 <- lm(
  grade ~ attendance,
  data = poll
)

summary(lm1)
## 
## Call:
## lm(formula = grade ~ attendance, data = poll)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.4768 -0.6764 -0.1403  0.5779  2.6510 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.349015   0.032693  71.850  < 2e-16 ***
## attendance  0.009130   0.003428   2.664  0.00776 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9013 on 3470 degrees of freedom
## Multiple R-squared:  0.002041,   Adjusted R-squared:  0.001753 
## F-statistic: 7.095 on 1 and 3470 DF,  p-value: 0.007765

Surprisingly, the team of researchers do not find any empirical support for their hypothesis. According to their bivariate model, on average, each additional lecture attended actually increases the numerical value of exam grades received by 0.00913. That means, our prediction would be that not attending the lecture at all increases students’ exam grades by 0.128 (as we are working with German grading system, this means that attendance has a negative association with performance, which is counterintuitive). A small, but nonetheless unexpected estimated effect.

Do you have any ideas how this result can be explained?

Luckily, this is simulated data. So let’s have a look at the population to better explain this finding. First, we will take a sample of our population exactly the same size as our poll. The population consists of all students, whereas the sample only included students who participated in the online survey.

Perhaps, it is helpful to look at a scatterplot. Due to the scale of our variables, we can use jitter on both x and y to create a scatterplot that somewhat resembles a heatmap.

set.seed(2021)

# create sample from population with same size as poll sample
pop <- population[sample(1:nrow(population), nrow(poll)),]
# or simpler: 
# pop <- sample(population, nrow(poll))

Base R

# Scatterplot: Polled students
plot(jitter(poll$attendance, 2.2),
     jitter(poll$grade, 2),
     main = "Polled students",
     xlab = "Attended lectures",
     ylab = "Exam Grade",
     xlim = c(0,14),
     pch = 19,
     col = viridis(1, alpha = 0.1),
     bty = "n")

# now let's create the same plot, but on the sample taken from our population
plot(jitter(pop$attendance, 2.2),
     jitter(pop$grade, 2),
     main = "Student population sample",
     xlab = "Attended lectures",
     ylab = "Exam Grade",
     xlim = c(0,14),
     pch = 19,
     col = viridis(1, alpha = 0.1),
     bty = "n")

Can you spot differences?

ggplot2


ggplot(
  data = poll,  # data used for plotting
  mapping = aes(x = jitter(attendance, 2.2), 
                y = jitter(grade, 2))
    ) +
  geom_point(
    color = viridis(1, alpha = 0.1), 
    size = 2
    ) + 
  theme_minimal() + # change the appearance
  labs(x = "Lectures attended",
       y = "Exam grades",
       title = "Polled students"
       )

ggplot(
  data = pop,  # data used for plotting
  mapping = aes(x = jitter(attendance, 2.2), 
                y = jitter(grade, 2))
  ) +
  geom_point(
    color = viridis(1, alpha = 0.1), 
    size = 2
  ) + 
  theme_minimal() + # change the appearance
  labs(
    x = "Lectures attended",
    y = "Exam grades",
    title = "Student population sample"
    )

# or with patchwork package
# lp + rp

Can you spot differences?

Apparently, the sample is not an accurate reflection of the underlying population of students.

Let’s test the relationship between lecture attendance and exam grades again!

lm2 <- lm(
  grade ~ attendance,
  data = pop
)

summary(lm2)
## 
## Call:
## lm(formula = grade ~ attendance, data = pop)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.00897 -0.75577 -0.00897  0.99103  2.49744 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.008971   0.029352  102.52   <2e-16 ***
## attendance  -0.036172   0.003549  -10.19   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.007 on 3470 degrees of freedom
## Multiple R-squared:  0.02906,    Adjusted R-squared:  0.02878 
## F-statistic: 103.9 on 1 and 3470 DF,  p-value: < 2.2e-16

What happened?

We learned that colliders are bad controls and we do not include them in our model to avoid bias. A collider is a variable affected by both, X and Y. In other words, both X and Y can be used to explain some variation in the collider. You can read more on colliders in the Appendix.

However, colliders do not have to be included as control variables to introduce bias. Adjusting for a collider variable can introduce sample selection bias (or collider stratification bias). In this case, adjustment was done by students through self-selection into the online poll – by only using students who participated in the survey, we implicitly “control” for this variable (because we hold this constant). The team of researchers polled a sample which does not accurately depict the full population of students.

In our case, attendance affects exam grades, as shown by the regression on the full population. However, attendance also affects poll response.

We can also see the respective associations in the data:


# attendance affects poll response
lm3 <- lm(
  poll ~ attendance,
  data = population
)
lm3
## 
## Call:
## lm(formula = poll ~ attendance, data = population)
## 
## Coefficients:
## (Intercept)   attendance  
##     0.10857      0.01593

# Grades affect poll response
lm4 <- lm(
  poll ~ grade,
  data = population
)
lm4
## 
## Call:
## lm(formula = poll ~ grade, data = population)
## 
## Coefficients:
## (Intercept)        grade  
##     0.40124     -0.06736

In consequence, our sample overrepresents good students who attend regularly and underrepresents failing students, especially those who do not attend.

Base R

It’s best to look at it:

col_vec <- viridis(2, alpha = 0.3)

poll_col <- ifelse(pop$poll == 1, col_vec[1], col_vec[2])

par(mar=c(5,5,1,5), xpd=TRUE)
plot(jitter(pop$attendance, 2.2), # Now that we know what jitter does, let's make good use of it
     jitter(pop$grade, 2),
     main = "Population sample by poll response",
     xlab = "Lectures attended",
     ylab = "Exam grades",
     pch = 19,
     col = poll_col,
     bty = "n")
legend("topright",
       inset=c(-0.2,0),
       col = viridis(2, alpha = 1),
       pch = 19,
       legend = c("polled",
                  "not polled"),
       bty = "n")

ggplot2

It’s best to look at it:

col_vec <- viridis(2, alpha = 0.3)

poll_lab <- ifelse(pop$poll == 1, 
                     "polled",
                     ifelse(pop$poll == 0, 
                            "not polled", NA)) # Why do we nest two ifelse statements?

ggplot(data = pop,  # data used for plotting
       mapping = aes(x = jitter(attendance, 2.2), 
                     y = jitter(grade, 2),
                     color = poll_lab)
       ) +
  geom_point(size = 2) + 
  theme_minimal() + # change the appearance
  labs(
    x = "Lectures attended",
    y = "Exam grades",
    title = "Population sample by poll response"
  ) + 
  scale_color_manual(
    name = "", 
    values = c("polled" = col_vec[1], "not polled" = col_vec[2])
  )

If you were to select cases for your study conditional on a collider, this can bias in our causal estimates. Further, like controlling for colliders, it can generate spurious associations between variables that are not related to each other. This sort of Sample Selection Bias is induced, because we–even unintendedly–stratify on a collider, we also stratify on the X and Y of interest.

Exercise Section

Now it’s your turn! You want to study how the amount of time students devote to studying affects their final grades. You are interested in the causal effect of study time on scores in a final exam: How many more points do students gain on average for one additional hour of weekly studying?

To study this question, you have the following data:

  • student_id: an identifier for students in the sample
  • study_hours: average number of weekly study hours
  • motivation: whether students were motivated to take the course (prior to the semester)
  • exercises: average weekly number of practice exercises students worked on during their study hours
  • score: score in the final exam
load("raw-data/ex1.Rdata")

head(ex1)
##   student_id study_hours motivation exercises    score
## 1        001           5          0         5 46.84805
## 2        002          13          1         6 68.23539
## 3        003           7          0         1 37.57796
## 4        004          11          1         6 63.03814
## 5        005          13          1         6 67.92869
## 6        006           7          0         4 49.44455

# distribution of weekly study hours
hist(ex1$study_hours,
     main = "Histogram of study hours",
     xlab = "Average weekly study hours",
     col = viridis(1),
     border = F,
     las = 1)

# Distribution of number of weekly exercises
hist(ex1$exercises,
     breaks = max(ex1$exercises),
     main = "Histogram of exercises",
     xlab = "Average weekly number of practice exercises",
     col = viridis(1),
     border = F,
     las = 1)

# Distribution of Motivation
barplot(table(ex1$motivation),
        main = "Motivated and unmotivated students",
        xlab = "",
        names.arg = c("Not motivated", 
                      "Motivated"),
        col = viridis(1),
        border = F,
        las = 1)

# Distribution of final test score
hist(ex1$score,
     main = "Histogram of Final Test Score",
     xlab = "Final Test Score",
     col = viridis(1),
     border = F,
     las = 1)

It is your task to specify a regression model to estimate the causal effect of study hours on students’ final score. In other words: You have to decide which variables to include as control variables.

  • Try out different model specifications. How does the causal effect estimate differ depending on which variables you control for
  • Construct a DAG.
  • Is motivation a confounder, mediator, or collider?
  • Is exercises a confounder, mediator, or collider?
  • Which variables should we include in our final model? Interpret the causal effect estimate.

Concluding remarks

In your homework you will do a peer review!

Appendix

2.4 Appendix I: Colliders

In the final example, another team of researchers wants to find out whether education has an effect on the popularity of populist right parties. The team uses survey data to test their hypothesis. Specifically, the consider three survey items: like_pop_right indicates respondents rating of a populist right wing party. educ indicates survey respondents level of formal education and vote_pop_right indicates whether survey respondents voted for the populist right party in the last election.

load("raw-data/populism.Rdata")
head(populism)
##   like_pop_right educ vote_pop_right
## 1   -0.001767621    1              0
## 2    0.451823262    1              0
## 3    4.878047894    5              1
## 4    1.544476798    6              1
## 5    0.170058582    3              0
## 6    1.925084331    6              1
hist(populism$educ,
     breaks = seq(0.5, 7.5, 1),
     main = "Histogram of Education",
     xlab = "Formal level of education",
     col = viridis(1),
     border = F,
     las = 1)

     
hist(populism$like_pop_right,
     main = "Histogram of like/dislike populist right party",
     xlab = "Like/Dislike Populist Right Party",
     col = viridis(1),
     border = F,
     las = 1)


barplot(table(populism$vote_pop_right),
        main = "Vote for populist right party",
        xlab = "",
        names.arg = c("Not voted for\npopulist right party", 
                      "Voted for\npopulist right party"),
        col = viridis(1),
        border = F,
        las = 1)

The team stumbles a cross a puzzling phenomenon when analysing the data: When they do not control for vote_pop_right, they do not find an effect of education on the popularity of populist right wing parties. However, when they do control for vote_pop_right, they find a negative effect of education on the popularity of populist right wing parties. They are unsure whether they should include vote_pop_right as a control variable in their model.

lm1 <- lm(like_pop_right ~ educ,
          data = populism)
          
summary(lm1)
## 
## Call:
## lm(formula = like_pop_right ~ educ, data = populism)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.0136 -1.7589  0.1305  1.7661  8.4145 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  0.06934    0.17973   0.386    0.700
## educ        -0.01500    0.04091  -0.367    0.714
## 
## Residual standard error: 2.568 on 998 degrees of freedom
## Multiple R-squared:  0.0001348,  Adjusted R-squared:  -0.0008671 
## F-statistic: 0.1345 on 1 and 998 DF,  p-value: 0.7139
lm2 <- lm(like_pop_right ~ educ + vote_pop_right,
          data = populism)
          
summary(lm2)
## 
## Call:
## lm(formula = like_pop_right ~ educ + vote_pop_right, data = populism)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.7237 -1.2412  0.1366  1.3897  5.8827 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     0.69113    0.13937   4.959 8.33e-07 ***
## educ           -0.44509    0.03519 -12.647  < 2e-16 ***
## vote_pop_right  4.32412    0.16221  26.657  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.963 on 997 degrees of freedom
## Multiple R-squared:  0.4162, Adjusted R-squared:  0.4151 
## F-statistic: 355.4 on 2 and 997 DF,  p-value: < 2.2e-16

Let’s again consider a DAG in order to decide whether vote_pop_right should be included in the model. We suspect that education potentially affects whether someone voted for a populist right party (rather than the other way around) and whether someone likes a party affects whether they* votes for that party (rather than the other way around). Thus, we draw a directed arrow from ‘education’ to ‘vote’ and another directed arrow from ‘like populist right party’ to ‘vote’. This results in the following DAG:

*‘they’ is a gender-inclusive alternative to the generic ‘he’ or the binary ‘he/she’.

It turns out that vote_pop_right is a collider and thus a bad control variable. Including colliders can generate spurious associations between variables that are not related to each other. When we include colliders, our causal estimates may be biased.

2.5 Appendix II: More complex sample selection

From our DAG in the attendace-grade example, we would assume that attendance affects survey response and so does exam performance. However, the relationship does not have to be direct. Instead, we could assume that attendance does not impact poll response, because the poll is done online. Instead, there may be an unobserved confounder that affects our outcome, exam grades, and poll response rate. For example: Internet speed (or access) could impact exam grades, because it makes preparation and learning harder, and poll response rate, because it is harder for students to participate in an online poll. It does however not affect attendance, as the lecture is teached only in person.

In this case, survey response is a collider of attendance and exam grades, because Internet affects survey response and exam grades. Those who do better because of Internet are also more likely to respond to the poll, which introduces a spurious relationship between attendance and exam grades, because we observe a non-representative subsample.

Keep this in mind when you are tasked with data collection, or conduct analysis on samples. Think about which groups of the population you collect data on, and more importantly, which parts of the population are underrepresented or omitted entirely. When you work with existing data, you should assess this retrospectively by studying the documentation on the data collection and processing of the data at hand.