Frontiers in Difference-in-Differences

Alternative Identification Strategies

Brantly Callaway

University of Georgia

October 19, 2023

Introduction

\(\newcommand{\E}{\mathbb{E}} \newcommand{\E}{\mathbb{E}} \newcommand{\var}{\mathrm{var}} \newcommand{\cov}{\mathrm{cov}} \newcommand{\Var}{\mathrm{var}} \newcommand{\Cov}{\mathrm{cov}} \newcommand{\Corr}{\mathrm{corr}} \newcommand{\corr}{\mathrm{corr}} \newcommand{\L}{\mathrm{L}} \renewcommand{\P}{\mathrm{P}} \newcommand{\independent}{{\perp\!\!\!\perp}} \newcommand{\indicator}[1]{ \mathbf{1}\{#1\} }\) Recap:

  • Part 1: [DID] with staggered treatment adoption

  • Part 2: [DID] relax parallel trends by including covariates

  • Part 3: [DID] allow for more complicated treatment regimes

We have been following the high-level strategy of (1) targeting disaggregated parameters and then (2) combining them.

  • Part 4: go back to staggered treatment setting, but use alternative identification strategies

Examples in this part

  1. Lagged Outcome Unconfoundedness

  2. Change-in-Changes

  3. Interactive Fixed Effects

Main high-level takeaway Many of the insights that we have had previously continue to apply using other identification strategies

How do we choose among identifying assumptions?

View #1: Parallel trends as a purely reduced form assumption

  • For example, if you have extra pre-treatment periods, you can assess validity in pre-treatment periods

But this is certainly not the only possibility:

  • In some disciplines (e.g., biostats) it is relatively more common to assume unconfoundedness conditional on lagged outcomes (i.e., the LOU approach)

  • This is also what my undergraduate econometrics students almost always suggest (their judgement is not clouded by having thought about these things too much)

  • Or, alternatively, why not take two differences (closely related to linear trends models) or even more…

In my view, these seem like fair points

Model-based view

View #2: Models that lead to parallel trends assumption. In economics, there are many models (here we’ll focus on untreated potential outcomes) such as \[\begin{align*} Y_{it}(0) = h_t(\eta_i, e_{it}) \end{align*}\] where \(\eta_i\) is unobserved heterogeneity and \(e_{it}\) are idiosyncratic unobservables (you can add observed covariates if you want)

The parallel trends assumption comes from using the same sort of model, but layering on the additional functional form assumption

\[Y_{it}(0) = \theta_t + \eta_i + e_{it}\]

  • which additively separates the time and unit effects.

Additional references: Heckman and Robb (1985), Heckman, Ichimura, and Todd (1997), Athey and Imbens (2006), Blundell and Costa Dias (2009), Chabe-Ferret (2015), Ghanem, Sant’Anna, and Wuthrich (2022), and Marx, Tamer, and Tang (2022) all provide connections between models and different panel data approaches to causal inference

Introduction to Lagged Outcome Unconfoundedness

Intuition for Lagged Outcome identification strategies is to compare:

  • Observed outcomes for treated units to observed outcomes for untreated units conditional on having the same pre-treatment outcome(s)

Rough explanation: This is a version of unconfoundedness where the most important variable(s) to consider are lagged outcome(s)

Lagged Outcome Unconfoundedness with Two Periods


Lagged Outcome Unconfoundedness Assumption

\[\E[Y_{t^*}(0) | Y_{t^*-1}(0), D=1] = \E[Y_{t^*}(0) | Y_{t^*-1}(0), D=0]\]

Explanation: On average, untreated potential outcomes in the 2nd period are the same for the treated group as for the untreated group conditional on having the same pre-treatment outcome

Identification: Under LOU (plus an overlap condition), we can identify \(ATT\): \[ \begin{aligned} ATT &= \E[Y_{t^*} | D=1] - \E[Y_{t^*}(0) | D=1] \hspace{250pt} \end{aligned} \]

Lagged Outcome Unconfoundedness with Two Periods


Lagged Outcome Unconfoundedness Assumption

\[\E[Y_{t^*}(0) | Y_{t^*-1}(0), D=1] = \E[Y_{t^*}(0) | Y_{t^*-1}(0), D=0]\]

Explanation: On average, untreated potential outcomes in the 2nd period are the same for the treated group as for the untreated group conditional on having the same pre-treatment outcome

Identification: Under LOU (plus an overlap condition), we can identify \(ATT\): \[ \begin{aligned} ATT &= \E[Y_{t^*} | D=1] - \E[Y_{t^*}(0) | D=1] \hspace{250pt}\\ &= \E[Y_{t^*} | D=1] - \E\Big[\E[ Y_{t^*}(0) | Y_{t^*-1}(0), D=1] | D=1\Big] \end{aligned} \]

Lagged Outcome Unconfoundedness with Two Periods


Lagged Outcome Unconfoundedness Assumption

\[\E[Y_{t^*}(0) | Y_{t^*-1}(0), D=1] = \E[Y_{t^*}(0) | Y_{t^*-1}(0), D=0]\]

Explanation: On average, untreated potential outcomes in the 2nd period are the same for the treated group as for the untreated group conditional on having the same pre-treatment outcome

Identification: Under LOU (plus an overlap condition), we can identify \(ATT\): \[ \begin{aligned} ATT &= \E[Y_{t^*} | D=1] - \E[Y_{t^*}(0) | D=1] \hspace{250pt}\\ &= \E[Y_{t^*} | D=1] - \E\Big[\E[ Y_{t^*}(0) | Y_{t^*-1}(0), D=1] | D=1\Big]\\ &=\E[Y_{t^*} | D=1] - \E\Big[\E[ Y_{t^*}(0) | Y_{t^*-1}(0), D=0] | D=1\Big] \end{aligned} \]

\(\implies ATT\) is identified can be recovered by the difference in the average outcome for the treated group relative to the average outcome condional on lag for untreated group (this is averaged over the distribution of pre-treatment outcomes for the treated group)

LO Unconfoundedness Estimation

Recall under LO unconfoundedness assumption: \[ATT=\E[Y_{t^*} | D=1] - \E\Big[\underbrace{\E[ Y_{t^*}(0) | Y_{t^*-1}(0), D=0]}_{\textrm{challenging to estimate}} | D=1\Big]\]

  • Simplest approach (regression adjustment), assume linear model: \(Y_{it^*}(0) = \beta_0 + \beta_1 Y_{it^*-1}(0) + e_{it}\). Estimate \(\beta_0\) and \(\beta_1\) using set of untreated observations. Then, \[\begin{align*} \widehat{ATT} = \frac{1}{n_1} \sum_{i=1}^n D_i Y_{it^*} - \frac{1}{n_1} \sum_{i=1}^n D_i(\hat{\beta}_0 + \hat{\beta}_1 Y_{it^*-1}) \end{align*}\]
  • Or you can bring out the heavy artillery: nonparametric 1st step, weighting estimators, doubly robust estimators, machine learning etc.

LOU Identification of \(ATT(g,t)\)


Multiple Period Version of LO Unconfoundedness Assumption

For all groups \(g \in \mathcal{G}\) and for all time periods \(t=2,\ldots,\mathcal{T}\), \[\begin{align*} Y_t(0) \independent G | Y_{t-1}(0) \end{align*}\]


Applying a similar argument as before recursively (and noting that to get this argument to go through, we need full independence rather than only mean independence) \[\begin{align*} ATT(g,t) = \E[Y_t|G=g] - \E\Big[\E[Y_t | Y_{g-1}, U=1] \Big| G=g\Big] \end{align*}\]

[longer explanation]

Minimum Wage Example

devtools::install_github("bcallaway11/pte")
library(pte)
data2$G2 <- data2$G 
# lagged outcomes identification strategy
lo_res <- pte::pte_default(yname="lemp",
                           tname="year",
                           idname="id",
                           gname="G2",
                           data=data2,
                           d_outcome=FALSE,
                           lagged_outcome_cov=TRUE)
did::aggte(lo_res$attgt)
summary(lo_res)
ggpte(lo_res)

LO Unconfoundedness \(ATT(g,t)\)

LO Unconfoundedness \(ATT^O\)


Overall ATT:  
    ATT    Std. Error     [ 95%  Conf. Int.]  
 -0.061        0.0074    -0.0755     -0.0466 *


Dynamic Effects:
 Event Time Estimate Std. Error   [95%  Conf. Band]  
         -2   0.0140     0.0096 -0.0091      0.0370  
         -1   0.0103     0.0081 -0.0093      0.0299  
          0  -0.0242     0.0105 -0.0494      0.0011  
          1  -0.0739     0.0077 -0.0925     -0.0552 *
          2  -0.1290     0.0190 -0.1746     -0.0834 *
          3  -0.1403     0.0234 -0.1967     -0.0839 *
---
Signif. codes: `*' confidence band does not cover 0

LO Unconfoundedness Event Study

Introduction to Change-in-Changes

The idea of change-in-changes comes from Athey and Imbens (2006) and builds on work on estimating non-separable production function models. They consider the case where \[\begin{align*} Y_{it}(0) = h_t(U_{it}) \end{align*}\] where \(h_t\) is a nonparametric, time-varying function. To me, it is helpful to think of \(U_{it} = \eta_i + e_{it}\). This model (for the moment) generalizes the model that we used to rationalize parallel trends: \(Y_{it}(0) = \theta_t + \eta_i + e_{it}\).

Additional Conditions:

  1. \(U_{it} \overset{d}{=} U_{it'} | G\). In words: the distribution of \(U_{it}\) does not change over time given a particular group. However, the distribution of \(U_{it}\) can vary across groups.

  2. \(U_{it}\) is scalar

  3. \(h_t\) is stictly monotonically increasing \(\implies\) we can invert it.

  4. Support condition: \(\mathcal{U}_g \subseteq \mathcal{U}_0\) (support of \(U_{it}\) for the treated group is a subset of the support of \(U_{it}\) for the untreated group)

Change-in-Changes Identification

Under the conditions described above, you can show that

\[\begin{align*} ATT(g,t) = \E[Y_t | G=g] - \E\Big[Q_{Y_t(0)|U=1}\big(F_{Y_{g-1}(0)|U=1}(Y_{g-1}(0))\big) | G=g \Big] \end{align*}\] where \(Q_{Y_t(0)|U=1}(\tau)\) is the \(\tau\)-th quantile of \(Y_t(0)\) for the never-treated group (e.g., if \(\tau=0.5\), it is the median of \(Y_t(0)\) for the never-treated group).

  • [As an interesting side-comment, this is derived in Athey and Imbens (2006), way before recent work on group-time average treatment effects, and it is pretty much exactly analogous to the “first step” that we have been emphasizing]

Intuition for Change-in-Changes

Intuition: Notice that, under parallel trends, we can re-write \[\begin{align*} ATT(g,t) = \E[Y_t|G=g] - \E\left[ \Big(\E[Y_t | U=1] - \E[Y_{g-1} | U=1]\Big) + Y_{g-1} | G=g \right] \end{align*}\] which we can think of as: compare observed outcomes to, (an average of) taking observed outcomes in the pre-treatment period and accounting for how outcomes change over time in the untreated group across the same periods

For CIC, the intuition is the same, except the way that we “account for” how outcomes change over time during the same periods for the untreated group is a different.

Because these are different transformations, DID and CIC are non-nested approaches.

Comments

CIC is a nice approach in many applications

  • In addition, to recovering \(ATT(g,t)\), it is also possible to recover quantile treatment effect parameters in this setting (these can allow you to more effectively study treatment effect heterogeneity and are closely related to social welfare calculations/comparisons)

Though it is less commonly used in empirical work than DID.

  • Need to estimate quantiles

  • Harder to include covariates (due to needing to estimate quantiles). I think (not 100% sure though) that it is not possible (at least not obvious) if one can do a doubly robust version of CIC.

  • Support conditions can have real bite in some applications

  • Not as much software support

Minimum Wage Application

devtools::install_github("bcallaway11/qte")
library(qte)
# change-in-changes
cic_res <- qte::cic2(yname="lemp", 
                     gname="G2",
                     tname="year",
                     idname="id",
                     data=data2,
                     boot_type="empirical",
                     cl=4)
summary(cic_res)
ggpte(cic_res)

\(\widehat{ATT}^O = -0.059\), \(\textrm{s.e}(\widehat{ATT}^O) = 0.009\). (This is very close to our estimate using DID before: \(-0.057\))

Minimum Wage Application

Interactive Fixed Effects

Earlier we discussed this model for untreated potential outcomes \(Y_{it}(0) = h_t(\eta_i, e_{it})\) and argued that it was too general to make much progress on.

An intermediate case is an interactive fixed effects model for untreated potential outcomes: \[\begin{align*} Y_{it}(0) = \theta_t + \eta_i + \lambda_i F_t + e_{it} \end{align*}\]

  • \(\lambda_i\) is often referred to as “factor loading” (notation above implies that this is a scalar, but you can allow for higher dimension)

  • \(F_t\) is often referred to as a “factor”

  • \(e_{it}\) is idioyncratic in the sense that \(\E[e_{it} | G_i=g] = 0\) for all groups

In our context, though, it makes sense to interpret these as

  • \(\lambda_i\) unobserved heterogeneity (e.g., individual’s unobserved skill)

  • \(F_t\) the time-varying “return” unobserved heterogeneity (e.g., return to skill)

Interactive Fixed Effects

Interactive fixed effects models for untreated potential outcomes generalize some other important cases:

Example 1: Suppose we observe \(\lambda_i\), then this amounts to the regression adjustment version of DID with a time-invariant covariate considered earlier

Example 2: Suppose you know that \(F_t = t\), then this leads to a unit-specific linear trend model: \[\begin{align*} Y_{it}(0) = \theta_t + \eta_i + \lambda_i t + e_{it} \end{align*}\]

To allow for \(F_t\) to change arbitrarily over time is harder…

Example 3: Interactive fixed effects models also provide a connection to “large-T” approaches such as synthetic control and synthetic DID (Abadie, Diamond, and Hainmueller (2010), Arkhangelsky et al. (2021))

  • e.g., one of the motivations of the SCM in ADH-2010 is that (given large-T) constructing a synthetic control can balance the factor loadings in an interactive fixed effects model for untreated potential outcomes

Interactive Fixed Effects

Interactive fixed effects models allow for violations of parallel trends:

\[\begin{align*} \E[\Delta Y_{it}(0) | G=g] = \Delta \theta_t + \E[\lambda_i|G=g]\Delta F_t \end{align*}\] which can vary across groups.

Example: If \(\lambda_i\) is “ability” and \(F_t\) is increasing over time, then (even in the absence of the treatment) groups with higher mean “ability” will tend to increase outcomes more over time than less skilled groups

How can you recover \(ATT(g,t)\) here?

There are a lot of ideas. Probably the most prominent idea is to directly estimate the model for untreated potential outcomes and impute

  • See Xu (2017) and Gobillon and Magnac (2018) for substantial detail on this front

  • For example, Xu (2017) uses Bai (2009) principal components approach to estimate the model. This is a bit different in spirit from what we have been doing before as this argument requires the number of time periods to be “large”

Alternative Approaches with Fixed-T

Very Simple Case:

  • \(\mathcal{T}=4\)

  • 3 groups: 3, 4, \(\infty\)

  • We will target \(ATT(3,3) = \E[\Delta Y_{i3} | G_i=3] - \underbrace{\E[\Delta Y_{i3}(0) | G_i=3]}_{\textrm{have to figure out}}\)

In this case, given the IFE model for untreated potential outcomes, we have: \[\begin{align*} \Delta Y_{i3}(0) &= \Delta \theta_3 + \lambda_i \Delta F_3 + \Delta e_{i3} \\ \Delta Y_{i2}(0) &= \Delta \theta_2 + \lambda_i \Delta F_3 + \Delta e_{i2} \\ \end{align*}\]

The last equation implies that \[\begin{align*} \lambda_i = \Delta F_2^{-1}\Big( \Delta Y_{i2}(0) - \Delta \theta_2 - \Delta e_{i2} \Big) \end{align*}\] Plugging this back into the first equation (and combining terms), we have \(\rightarrow\)

Alternative Approaches with Fixed-T

From last slide, combining terms we have that

\[\begin{align*} \Delta Y_{i3}(0) = \underbrace{\Big(\Delta \theta_3 - \frac{\Delta F_3}{\Delta F_2} \Delta \theta_2 \Big)}_{=: \theta_3^*} + \underbrace{\frac{\Delta F_3}{\Delta F_2}}_{=: F_3^*} \Delta Y_{i2}(0) + \underbrace{\Delta e_{i3} - \frac{\Delta F_3}{\Delta F_2} \Delta e_{i2}}_{=: v_{i3}} \end{align*}\]

Now (momentarily) suppose that we (somehow) know \(\theta_3^*\) and \(F_3^*\). Then,

\[\begin{align*} \E[\Delta Y_{i3}(0) | G_i=3] = \theta_3^* + F_3^* \underbrace{\E[\Delta Y_{i2}(0) | G_i = 3]}_{\textrm{identified}} + \underbrace{\E[v_{i3}|G_i=3]}_{=0} \end{align*}\]

\(\implies\) this term is identified; hence, we can recover \(ATT(3,3)\).

Alternative Approaches with Fixed-T

From last slide, combining terms we have that

\[\begin{align*} \Delta Y_{i3}(0) = \underbrace{\Big(\Delta \theta_3 - \frac{\Delta F_3}{\Delta F_2} \Delta \theta_2 \Big)}_{=: \theta_3^*} + \underbrace{\frac{\Delta F_3}{\Delta F_2}}_{=: F_3^*} \Delta Y_{i2}(0) + \underbrace{\Delta e_{i3} - \frac{\Delta F_3}{\Delta F_2} \Delta e_{i2}}_{=: v_{i3}} \end{align*}\]

How can we recover \(\theta_3^*\) and \(F_3^*\)?

Notice: this involves untreated potential outcomes through period 3, and we have groups 4 and \(\infty\) for which we observe these untreated potential outcomes. This suggests using those groups.

  • However, this is not so simple because, by construction, \(\Delta Y_{i2}(0)\) is correlated with \(v_{i3}\) (note: \(v_{i3}\) contains \(\Delta e_{i2} \implies\) they will be correlated by construction)

  • We need some exogenous variation (IV) to recover the parameters \(\rightarrow\)

Alternative Approaches with Fixed-T

There are a number of different ideas here:

  • Make additional assumptions ruling out serial correlation in \(e_{it}\) \(\implies\) can use lags of outcomes as instruments:

    • But this is seen as a strong assumption in many applications (Bertrand, Duflo, Mullainathan (2004))
  • Alternatively can introduce covariates and make auxiliary assumptions about them (Callaway and Karami (2023), Brown and Butts (2023), and Brown, Butts, and Westerlund (2023))
  • However, it turns out that, with staggered treatment adoption, you can recover \(ATT(3,3)\) essentially for free (Callaway and Tsyawo (2023)).

Alternative Approaches with Fixed-T

In particular, notice that, given that we have two distinct untreated groups in period 3: group 4 and group \(\infty\), then we have two moment conditions:

\[\begin{align*} \E[\Delta Y_{i3}(0) | G=4] &= \theta_3^* + F_3^* \E[\Delta Y_{i2}(0) | G=4] \\ \E[\Delta Y_{i3}(0) | G=\infty] &= \theta_3^* + F_3^* \E[\Delta Y_{i2}(0) | G=\infty] \\ \end{align*}\] We can solve these for \(\theta_3^*\) and \(F_3^*\), then use these to recover \(ATT(3,3)\).

  • The main requirement is that \(\E[\lambda_i | G=4] \neq \E[\lambda_i|G=\infty]\) (relevance condition)

  • Can scale this argument up for more periods, groups, and IFEs

  • Relative to other approaches, the main drawback is that can’t recover as many \(ATT(g,t)\)’s; e.g., in this example, we can’t recover \(ATT(3,4)\) or \(ATT(4,4)\) which might be recoverable in other settings

Minimum Wage Application

devtools::install_github("bcallaway11/ife")
library(ife)
# staggered ife
load("data3.RData")
data4 <- subset(data3, G %in% c(2007,2006,0))
sife_res <- ife::staggered_ife2(yname="lemp",
                                gname="G",
                                tname="year",
                                idname="id",
                                data=data4,
                                nife=1)

did::ggdid(sife_res$att_gt)

\(\widehat{ATT}^O = -0.010\), \(\textrm{s.e}(\widehat{ATT}^O) = 0.127\).

[Note: this estimate is fully coming from \(ATT(g=2006,t=2006)\). It is actually not very different from what we get with DID for the same parameter: \(-0.009\), \(\textrm{s.e.}=0.009\) though the standard errors are much bigger]

Minimum Wage Application

Summary

This section has emphasized alternative approaches to DID to recover disaggregated treatment effect parameters:

  • Lagged outcome unconfoundedness

  • Change-in-changes

  • Interactive fixed effects models

We have targeted \(ATT(g,t)\). Moving to more aggregated treatment effect parameters such as \(ATT^{ES}(e)\) or \(ATT^O\) is the same as before.

Summary

I want to emphasize the high-level thought process one last time for using/inventing heterogeneity robust causal inference procedures with panel data:

  • Step 1: target disaggregated parameters directly using whatever approach you think would work well for recovering the \(ATT\) for a fixed “group” and “time”

  • Step 2: if desired, combine those disaggregated parameters into lower dimensional parameter that you may be able to estimate better and report more easily; hopefully, you can provide some motivation for this aggregated parameter

Conclusion

Thank you very much for having me!


Contact Information: brantly.callaway@uga.edu

Code and Slides: Available here

Papers:

  • Callaway (2023, Handbook of Labor, Human Resources and Population Economics), [published version]   [draft version]; the draft version is ungated and very similar to the published version.

  • Today is also based on the not-yet-made-publicly available manuscript Baker, Callaway, Cunningham, Goodman-Bacon, and Sant’Anna (be on the lookout for it very soon)

Appendix

LOU Identification Explanation

Simplest possible non-trivial example: \(ATT(g=2,t=3)\).

Auxiliary condition: for any group \(g\), \(\E[Y_{it}(0) | Y_{it-1}(0), \ldots, Y_{i1}(0), G=g] = \E[Y_{it}(0) | Y_{it-1}(0), G=g]\) (intuition: the right number of lags are included in the model). Then,

\[\begin{align*} \E[Y_{i3}(0) | Y_{i1}(0), G_i=2] &= \E\Big[ \E[Y_{i3}(0) | Y_{i2}(0), Y_{i1}(0), G_i=2] \Big| Y_{i1}(0), G_i=2 \Big] \\ &= \E\Big[ \E[Y_{i3}(0) | Y_{i2}(0), G_i=2] \Big| Y_{i1}(0), G_i=2 \Big] \\ &= \E\Big[ \E[Y_{i3}(0) | Y_{i2}(0), U_i=1] \Big| Y_{i1}(0), G_i=2 \Big] \\ &= \E\Big[ h(Y_{i2}) \Big| Y_{i1}(0), G_i=2 \Big] \\ &= \E\Big[ h(Y_{i2}) \Big| Y_{i1}(0), U_i=1 \Big] \\ &= \E\Big[ \E[Y_{i3}(0) | Y_{i2}(0), U_i=1] \Big| Y_{i1}(0), U_i=1 \Big] \\ &= \E\Big[ \E[Y_{i3}(0) | Y_{i2}(0), Y_{i1}(0), U_i=1] \Big| Y_{i1}(0), U_i=1 \Big] \\ &= \E[Y_{i3}(0) | Y_{i1}(0), U_i=1] \end{align*}\]

LOU Identification Explanation (cont’d)

Thus, we have that \[\begin{align*} ATT(g=2,t=3) &= \E[Y_{i3}|G_i=2] - \E[Y_{i3}(0) | G_i=2] \\ &= \E[Y_{i3}|G_i=2] - \E[Y_{i3}(0) | G_i=2] \\ &= \E[Y_{i3}|G_i=2] - \E\Big[ \E[Y_{i3}(0) | Y_{i1}(0), G_i=2] \Big| G_i=2\Big] \\ &= \E[Y_{i3}|G_i=2] - \E\Big[ \E[Y_{i3}(0) | Y_{i1}(0), U_i=1] \Big| G_i=2\Big] \end{align*}\] done.

[Back]