# run TWFE regression
twfe_x <- fixest::feols(lemp ~ post | id + region^year,
data=data2)
modelsummary(twfe_x, gof_omit=".*")
(1) | |
---|---|
post | 0.001 |
(0.008) |
Relaxing Parallel Trends
October 17, 2023
Including covariates in the parallel trends assumption can often make DID identification strategies more plausible:
Example
Minimum wage example: path of teen employment may depend on a state’s population / population growth / region of the country
Job displacement example: path of earnings may depend on year’s of education / race / occupation
However, there are a number of new issues that can arise in this setting…
\(\newcommand{\E}{\mathbb{E}} \newcommand{\E}{\mathbb{E}} \newcommand{\var}{\mathrm{var}} \newcommand{\cov}{\mathrm{cov}} \newcommand{\Var}{\mathrm{var}} \newcommand{\Cov}{\mathrm{cov}} \newcommand{\Corr}{\mathrm{corr}} \newcommand{\corr}{\mathrm{corr}} \newcommand{\L}{\mathrm{L}} \renewcommand{\P}{\mathrm{P}} \newcommand{\independent}{{\perp\!\!\!\perp}} \newcommand{\indicator}[1]{ \mathbf{1}\{#1\} }\)
Limitations of TWFE Regression
Identification with Two Periods
Alternative Estimation Strategies
Multiple Periods
Minimum Wage Application
Dealing with “Bad” Controls
Start with the case with only two time periods
Only need a little bit of new notation here:
\(X_{t^*}\) and \(X_{t^*-1}\) — time-varying covariates
\(Z\) — time-invariant covariates
Conditional Parallel Trends Assumption
\[\E[\Delta Y_{t^*}(0) | X_{t^*}, X_{t^*-1},Z,D=1] = \E[\Delta Y_{t^*}(0) | X_{t^*}, X_{t^*-1},Z,D=0]\]
In words: Parallel trends holds conditional on having the same covariates \((X_t,X_{t-1},Z)\).
Minimum wage example (e.g.) Parallel trends conditional on counties have the same population (like \(X_t\)) and being in the same region of the country (like \(Z\))
In this setting, it is common to run the following TWFE regression:
\[Y_{it} = \theta_t + \eta_i + \alpha D_{it} + X_{it}'\beta + e_{it}\]
However, there are a number of issues:
Issue 1: Issues related to multiple periods and variation in treatment timing still arise
Issue 2: Hard to allow parallel trends to depend on time-invariant covariates
Issue 3: Hard to allow for covariates that could be affected by the treatment
In this setting, it is common to run the following TWFE regression:
\[Y_{it} = \theta_t + \eta_i + \alpha D_{it} + X_{it}'\beta + e_{it}\]
However, there are a number of issues:
Issue 4: Linearity results in mixing identification and estimation…e.g., with 2 periods \[\begin{align*} \Delta Y_{it} = \Delta \theta_t + \alpha D_{it} + \Delta X_{it}'\beta + \Delta e_{it} \end{align*}\] \(\implies\) differencing out unit fixed effects can have implications about what researcher controls for
This doesn’t matter if model for untreated potential outcomes is truly linear
However, if we think of linear model as an approximation, this may have meaningful implications.
Even if none of the previous 4 issues apply, \(\alpha\) will still be equal to a weighted average of underlying (conditional-on-covariates) treatment effect parameters.
The weights can be negative, and suffer from “weight reversal” (similar to the issue discussed in Sloczynski (2020) in the context of cross-sectional regressions with covaraites)
In other words, weights \(\alpha\) is a weighted average of \(ATT(X)\) where (relative to a baseline of weighting based on the distribution of \(X\) for the treated group), the weights put larger weight on \(ATT(X)\) for values of the covariates that are uncommon for the treated group relative to the untreated group and smaller weight on \(ATT(X)\) for values of the covariates that are common for the treated group relative to the untreated group
See Caetano and Callaway (2023) for more details
Under conditional parallel trends, we have that \[ \begin{aligned} ATT &= \E[\Delta Y_{t^*} | D=1] - \E[\Delta Y_{t^*}(0) | D=1] \hspace{150pt} \end{aligned} \]
Under conditional parallel trends, we have that \[ \begin{aligned} ATT &= \E[\Delta Y_{t^*} | D=1] - \E[\Delta Y_{t^*}(0) | D=1] \hspace{150pt}\\ &=\E[\Delta Y_{t^*} | D=1] - \E\Big[ \E[\Delta Y_{t^*}(0) | X, D=1] \Big| D=1\Big] \end{aligned} \]
Under conditional parallel trends, we have that \[ \begin{aligned} ATT &= \E[\Delta Y_{t^*} | D=1] - \E[\Delta Y_{t^*}(0) | D=1] \hspace{150pt}\\ &=\E[\Delta Y_{t^*} | D=1] - \E\Big[ \E[\Delta Y_{t^*}(0) | X, D=1] \Big| D=1\Big]\\ &= \E[\Delta Y_{t^*} | D=1] - \E\Big[ \underbrace{\E[\Delta Y_{t^*}(0) | X, D=0]}_{=:m_0(X)} \Big| D=1\Big] \end{aligned} \]
Intuition: (i) Compare path of outcomes for treated group to (conditional on covariates) path of outcomes for untreated group, (ii) adjust for differences in the distribution of covariates between groups.
This argument also require an overlap condition
For all possible values of the covariates, \(p(x) := \P(D=1|X=x) < 1\).
In words: for all treated units, we can find untreated units that have the same characteristics
There are several possible ways to turn this identification result into an estimation strategy \(\rightarrow\)
The challenging term to deal with in the previous expression for \(ATT\) is
\[\E\Big[ \underbrace{\E[\Delta Y_{t^*}(0) | X, D=0]}_{=:m_0(X)} \Big| D=1\Big]\]
The most direct way to proceed is by proposing a model for \(m_0(X)\).
This expression suggests a regression adjustment estimator. For example, if we assume that \(m_0(X) = X'\\beta_0\), then we have that
\[ATT = \E[\Delta Y_{t^*} | D=1] - \E[X'|D=1]\beta_0\]
and we can estimate the \(ATT\) by
Step 1: Estimate \(\beta_0\) using untreated group
Step 2: Combine this estimate with estimates of \(\E[\Delta Y_t | D=1]\) and \(\E[X'|D=1]\) (using treated group) to estimate \(ATT\)
Alternatively, if we could choose balancing weights \(\nu_0(X)\) such that the distribution of \(X\) was the same in the untreated group as it is in the treated group after applying the balancing weights, then we would have that (from the second term above) \[\begin{align*} \E\Big[ \E[\Delta Y_{t^*}(0) | X, D=0 ] \Big| D=1\Big] &= \E\Big[ \nu_0(X) \E[\Delta Y_{t^*}(0) | X, D=0 ] \Big| D=0\Big] \\ &= \E[\nu_0(X) \Delta Y_{t^*}(0) | D=0] \end{align*}\] where the first equality is due to balancing weights and the second by the law of iterated expectations.
The most common way to re-weight is based on the propensity score, you can show: \[\begin{align*} \nu_0(x) = \frac{p(x)(1-p)}{(1-p(x))p} \end{align*}\] where \(p(x) = \P(D=1|X=x)\) and \(p=\P(D=1)\).
This is the approach suggested in Abadie (2005). In practice, you need to estimate the propensity score. The most common choices are probit or logit.
Alternatively, you can show
\[ATT=\E\left[ \left( \frac{D}{p} - \frac{p(X)(1-D)}{(1-p(X))p} \right)(\Delta Y_t - \E[\Delta Y_t | X, D=0]) \right]\]
This requires estimating both \(p(X)\) and \(\E[\Delta Y_{t^*}|X,D=0]\).
Big advantage:
Conditional Parallel Trends with Multiple Periods
For all groups \(g \in \bar{\mathcal{G}}\) (all groups except the never-treated group) and for all time periods \(t=2, \ldots, \mathcal{T}\),
\[\E[\Delta Y_t(0) | \mathbf{X}, Z, G=g] = \E[\Delta Y_t(0) | \mathbf{X}, Z, U=1]\]
where \(\mathbf{X} := (X_1,X_2,\ldots,X_\mathcal{T})\).
Under this assumption, using similar arguments to the ones above, one can show that
\[ATT(g,t) = \E\left[ \left( \frac{\indicator{G=g}}{p_g} - \frac{p_g(\mathbf{X},Z)U}{(1-p_g(\mathbf{X},Z))p_g}\right)\Big(Y_t - Y_{g-1} - m_{gt}^0(\mathbf{X},Z)\Big) \right]\]
where \(p_g(\mathbf{X},Z) := \P(G=g|\mathbf{X},Z,\indicator{G=g}+U=1)\) and \(m_{gt}^0(\mathbf{X},Z) := \E[Y_t-Y_{g-1}|\mathbf{X},Z,U=1]\).
Because \(\mathbf{X}\) contains \(X_t\) for all time periods, terms like \(m_{gt}^0(\mathbf{X},Z)\) can be quite high-dimensional (and hard to estimate) in many applications. In many cases, it may be reasonable to replace with lower dimensional function \(\mathbf{X}\):
pte
package currently and may be added to did
soon).did
)We’ll allow for path of outcomes to depend on region of the country
# run TWFE regression
twfe_x <- fixest::feols(lemp ~ post | id + region^year,
data=data2)
modelsummary(twfe_x, gof_omit=".*")
(1) | |
---|---|
post | 0.001 |
(0.008) |
Relative to previous results, this is much smaller and statistically insignificant and is similar to the result in Dube et al. (2010).
# callaway and sant'anna including covariates
cs_x <- att_gt(yname="lemp",
tname="year",
idname="id",
gname="G",
xformla=~region,
control_group="nevertreated",
base_period="universal",
data=data2)
cs_x_res <- aggte(cs_x, type="group")
summary(cs_x_res)
cs_x_dyn <- aggte(cs_x, type="dynamic")
ggdid(cs_x_dyn)
Call:
aggte(MP = cs_x, type = "group")
Reference: Callaway, Brantly and Pedro H.C. Sant'Anna. "Difference-in-Differences with Multiple Time Periods." Journal of Econometrics, Vol. 225, No. 2, pp. 200-230, 2021. <https://doi.org/10.1016/j.jeconom.2020.12.001>, <https://arxiv.org/abs/1803.09015>
Overall summary of ATT's based on group/cohort aggregation:
ATT Std. Error [ 95% Conf. Int.]
-0.0273 0.0087 -0.0444 -0.0101 *
Group Effects:
Group Estimate Std. Error [95% Simult. Conf. Band]
2004 -0.0436 0.0199 -0.0860 -0.0013 *
2006 -0.0199 0.0087 -0.0384 -0.0013 *
---
Signif. codes: `*' confidence band does not cover 0
Control Group: Never Treated, Anticipation Periods: 0
Estimation Method: Doubly Robust
# callaway and sant'anna including covariates
cs_x <- att_gt(yname="lemp",
tname="year",
idname="id",
gname="G",
xformla=~region + lpop + lavg_pay,
control_group="nevertreated",
base_period="universal",
est_method="reg",
data=data2)
cs_x_res <- aggte(cs_x, type="group")
summary(cs_x_res)
cs_x_dyn <- aggte(cs_x, type="dynamic")
ggdid(cs_x_dyn)
Call:
aggte(MP = cs_x, type = "group")
Reference: Callaway, Brantly and Pedro H.C. Sant'Anna. "Difference-in-Differences with Multiple Time Periods." Journal of Econometrics, Vol. 225, No. 2, pp. 200-230, 2021. <https://doi.org/10.1016/j.jeconom.2020.12.001>, <https://arxiv.org/abs/1803.09015>
Overall summary of ATT's based on group/cohort aggregation:
ATT Std. Error [ 95% Conf. Int.]
-0.0321 0.0082 -0.0482 -0.016 *
Group Effects:
Group Estimate Std. Error [95% Simult. Conf. Band]
2004 -0.0596 0.0186 -0.1013 -0.0178 *
2006 -0.0197 0.0074 -0.0363 -0.0031 *
---
Signif. codes: `*' confidence band does not cover 0
Control Group: Never Treated, Anticipation Periods: 0
Estimation Method: Outcome Regression
# callaway and sant'anna including covariates
cs_x <- att_gt(yname="lemp",
tname="year",
idname="id",
gname="G",
xformla=~region + lpop + lavg_pay,
control_group="nevertreated",
base_period="universal",
est_method="ipw",
data=data2)
cs_x_res <- aggte(cs_x, type="group")
summary(cs_x_res)
cs_x_dyn <- aggte(cs_x, type="dynamic")
ggdid(cs_x_dyn)
Call:
aggte(MP = cs_x, type = "group")
Reference: Callaway, Brantly and Pedro H.C. Sant'Anna. "Difference-in-Differences with Multiple Time Periods." Journal of Econometrics, Vol. 225, No. 2, pp. 200-230, 2021. <https://doi.org/10.1016/j.jeconom.2020.12.001>, <https://arxiv.org/abs/1803.09015>
Overall summary of ATT's based on group/cohort aggregation:
ATT Std. Error [ 95% Conf. Int.]
-0.0313 0.0078 -0.0467 -0.0159 *
Group Effects:
Group Estimate Std. Error [95% Simult. Conf. Band]
2004 -0.0514 0.0184 -0.0901 -0.0128 *
2006 -0.0222 0.0073 -0.0376 -0.0068 *
---
Signif. codes: `*' confidence band does not cover 0
Control Group: Never Treated, Anticipation Periods: 0
Estimation Method: Inverse Probability Weighting
# callaway and sant'anna including covariates
cs_x <- att_gt(yname="lemp",
tname="year",
idname="id",
gname="G",
xformla=~region + lpop + lavg_pay,
control_group="nevertreated",
base_period="universal",
est_method="dr",
data=data2)
cs_x_res <- aggte(cs_x, type="group")
summary(cs_x_res)
cs_x_dyn <- aggte(cs_x, type="dynamic")
ggdid(cs_x_dyn)
Call:
aggte(MP = cs_x, type = "group")
Reference: Callaway, Brantly and Pedro H.C. Sant'Anna. "Difference-in-Differences with Multiple Time Periods." Journal of Econometrics, Vol. 225, No. 2, pp. 200-230, 2021. <https://doi.org/10.1016/j.jeconom.2020.12.001>, <https://arxiv.org/abs/1803.09015>
Overall summary of ATT's based on group/cohort aggregation:
ATT Std. Error [ 95% Conf. Int.]
-0.0317 0.0077 -0.0467 -0.0167 *
Group Effects:
Group Estimate Std. Error [95% Simult. Conf. Band]
2004 -0.0509 0.0201 -0.0921 -0.0096 *
2006 -0.0230 0.0077 -0.0387 -0.0073 *
---
Signif. codes: `*' confidence band does not cover 0
Control Group: Never Treated, Anticipation Periods: 0
Estimation Method: Doubly Robust
# callaway and sant'anna including covariates
cs_x <- att_gt(yname="lemp",
tname="year",
idname="id",
gname="G",
xformla=~region + lpop + lavg_pay,
control_group="nevertreated",
base_period="varying",
est_method="dr",
data=data2)
cs_x_res <- aggte(cs_x, type="group")
summary(cs_x_res)
cs_x_dyn <- aggte(cs_x, type="dynamic")
ggdid(cs_x_dyn)
Call:
aggte(MP = cs_x, type = "group")
Reference: Callaway, Brantly and Pedro H.C. Sant'Anna. "Difference-in-Differences with Multiple Time Periods." Journal of Econometrics, Vol. 225, No. 2, pp. 200-230, 2021. <https://doi.org/10.1016/j.jeconom.2020.12.001>, <https://arxiv.org/abs/1803.09015>
Overall summary of ATT's based on group/cohort aggregation:
ATT Std. Error [ 95% Conf. Int.]
-0.0317 0.0081 -0.0475 -0.0159 *
Group Effects:
Group Estimate Std. Error [95% Simult. Conf. Band]
2004 -0.0509 0.0197 -0.0936 -0.0081 *
2006 -0.0230 0.0077 -0.0398 -0.0062 *
---
Signif. codes: `*' confidence band does not cover 0
Control Group: Never Treated, Anticipation Periods: 0
Estimation Method: Doubly Robust
# callaway and sant'anna including covariates
cs_x <- att_gt(yname="lemp",
tname="year",
idname="id",
gname="G",
xformla=~region + lpop + lavg_pay,
control_group="notyettreated",
base_period="universal",
est_method="dr",
data=data2)
cs_x_res <- aggte(cs_x, type="group")
summary(cs_x_res)
cs_x_dyn <- aggte(cs_x, type="dynamic")
ggdid(cs_x_dyn)
Call:
aggte(MP = cs_x, type = "group")
Reference: Callaway, Brantly and Pedro H.C. Sant'Anna. "Difference-in-Differences with Multiple Time Periods." Journal of Econometrics, Vol. 225, No. 2, pp. 200-230, 2021. <https://doi.org/10.1016/j.jeconom.2020.12.001>, <https://arxiv.org/abs/1803.09015>
Overall summary of ATT's based on group/cohort aggregation:
ATT Std. Error [ 95% Conf. Int.]
-0.0312 0.0084 -0.0477 -0.0147 *
Group Effects:
Group Estimate Std. Error [95% Simult. Conf. Band]
2004 -0.0493 0.0189 -0.0906 -0.008 *
2006 -0.0230 0.0073 -0.0391 -0.007 *
---
Signif. codes: `*' confidence band does not cover 0
Control Group: Not Yet Treated, Anticipation Periods: 0
Estimation Method: Doubly Robust
# callaway and sant'anna including covariates
cs_x <- att_gt(yname="lemp",
tname="year",
idname="id",
gname="G",
xformla=~region + lpop + lavg_pay,
control_group="nevertreated",
base_period="universal",
est_method="dr",
anticipation=1,
data=data2)
cs_x_res <- aggte(cs_x, type="group")
summary(cs_x_res)
cs_x_dyn <- aggte(cs_x, type="dynamic")
ggdid(cs_x_dyn)
Call:
aggte(MP = cs_x, type = "group")
Reference: Callaway, Brantly and Pedro H.C. Sant'Anna. "Difference-in-Differences with Multiple Time Periods." Journal of Econometrics, Vol. 225, No. 2, pp. 200-230, 2021. <https://doi.org/10.1016/j.jeconom.2020.12.001>, <https://arxiv.org/abs/1803.09015>
Overall summary of ATT's based on group/cohort aggregation:
ATT Std. Error [ 95% Conf. Int.]
-4e-04 0.0102 -0.0204 0.0195
Group Effects:
Group Estimate Std. Error [95% Pointwise Conf. Band]
2006 -4e-04 0.0095 -0.019 0.0182
---
Signif. codes: `*' confidence band does not cover 0
Control Group: Never Treated, Anticipation Periods: 1
Estimation Method: Doubly Robust
So far, our discussion has been for the case where the time-varying covariates evolve exogenously.
But in other cases, there can exist covariates that we would like to include in the parallel trends assumption that could be affected by the treatment (this type of covariate is often referred to as a bad control.
The traditional approach in empirical work is to completely drop covariates that could have been affected by the treatment.
To wrap our heads around this, let’s go back to the case with two time periods.
Define treated and untreated potential covariates: \(X_{it}(1)\) and \(X_{it}(0)\). Notice that in the “textbook” two period setting, we observe \[X_{it^*} = D_i X_{it^*}(1) + (1-D_i) X_{it^*}(0) \qquad \textrm{and} \qquad X_{it^*-1} = X_{it^*-1}(0)\]
If the covariates are literally not affected by the treatment at all, then we can write the version of conditional parallel trends that we have been using above in terms of potential outcomes:
Conditional Parallel Trends using Untreated Potential Covariates
\[\E[\Delta Y_{t^*}(0) | X_{t^*}(0), X_{t^*-1}(0), Z, D=1] = \E[\Delta Y_{t^*}(0) | X_{t^*}(0), X_{t^*-1}(0), Z, D=0]\]
One idea is to just ignore that the covariates may have been affected by the treatment:
Alternative Conditional Parallel Trends 1
\[\E[\Delta Y_{t^*}(0) | { \color{red} X_{\color{red} t^\color{red}*} }, X_{t^*-1}(0), Z, D=1] = \E[\Delta Y_{t^*}(0) | { \color{red} X_{\color{red} t^\color{red}*} }, X_{t^*-1}(0), Z, D=0]\]
The limitations of this approach are well known (even discussed in MHE), and this is not typically the approach taken in empirical work
Job Displacement Example: You would compare paths of outcomes for workers who left union because they were displaced to paths of outcomes for non-displaced workers who also left union (e.g., because of better non-unionized job opportunity)
It is more common in empirical work to drop \(X_t(0)\) entirely from the parallel trends assumption
Alternative Conditional Parallel Trends 2
\[\E[\Delta Y_{t^*}(0) | Z, D=1] = \E[\Delta Y_{t^*}(0) | Z, D=0]\]
In my view, this is not attractive either though. If we believe this assumption, then we have basically solved the bad control problem by assuming that it does not exist.
Job Displacement Example: We have now just assumed that path of earnings (absent job displacement) doesn’t depend on union status
Perhaps a better alternative identifying assumption is the following one
Alternative Conditional Parallel Trends 3
\[\E[\Delta Y_{t^*}(0) | X_{t^*-1}(0), Z, D=1] = \E[\Delta Y_{t^*}(0) | X_{t^*-1}(0), Z, D=0]\]
Intuition: Conditional parallel trends holds after conditioning on pre-treatment time-varying covariates that could have been affected by treatment
Job Displacement Example: Path of earnings (absent job displacement) depends on pre-treatment union status, but not untreated potential union status in the second period
What to do: Since \(X_{t^*-1}(0)\) is observed for all units, we can immediately operationalize this assumption use our arguments from earlier (i.e., the ones without bad controls)
This is difficult to operationalize with a TWFE regression
In practice, you can just include the bad control among other covariates in did
Going back to the original conditional parallel trends assumption (that include both \(X_{t^*}(0)\) and \(X_{t^*-1}(0)\))…Using the same sort of arguments as for regression adjustment earlier, it follows that
\[ATT = \E[\Delta Y_{t^*} | D=1] - \E\Big[ \E[\Delta Y_{t^*}(0) | X_{t^*}(0), X_{t^*-1}(0), Z, D=0] \Big| D=1\Big]\]
The second term is the tricky one. Notice that:
The inside conditional expectation is identified — we see untreated potential outcomes and covariates for the untreated group
But the outside expectation is infeasible \(\implies\) we are stuck (Option 4a)
Covariate Unconfoundedness Assumption
\[X_{t^*}(0) \independent D | X_{t^*-1}(0), Z\]
Intuition: For the treated group, the time-varying covariate would have evolved in the same way over time as it actually did for the untreated group, conditional on \(X_{t^*-1}\) and \(Z\).
Notice that this assumption only concerns untreated potential covariates \(\implies\) it allows for \(X_{t^*}\) to be affected by the treatment
Making an assumption like this indicates that \(X_{t^*}(0)\) is playing a dual role: (i) start by treating it as if it’s an outcome, (ii) have it continue to play a role as a covariate
Under this assumption, can show that we can recover the \(ATT\):
\[ATT = \E[\Delta Y_{t^*} | D=1] - \E\left[ \E[\Delta Y_{t^*} | X_{t^*-1}, Z, D=0] \Big| D=1 \right]\]
This is the same expression as in Option 3
In some cases, it may make sense to condition on other additional variables (e.g., the lagged outcome \(Y_{t^*-1}\)) in the covariate unconfoundedness assumption. In this case, it is still possible to identify \(ATT\), but it is more complicated
It could also be possible to use alternative identifying assumptions besides covariate unconfoundedness — at a high-level, we somehow need to recover the distribution of \(X_{t^*}(0)\)
See Caetano et al. (2023) for more details about bad controls.
To understand double robustness, we can rewrite the expression for \(ATT\) as \[\begin{align*} ATT = \E\left[ \frac{D}{p} \Big(\Delta Y_{t^*} - m_0(X)\Big) \right] - \E\left[ \frac{p(X)(1-D)}{(1-p(X))p} \Big(\Delta Y_{t^*} - m_0(X)\Big)\right] \end{align*}\]
The first term is exactly the same as what comes from regression adjustment
If we correctly specify a model for \(m_0(X)\), it will be equal to \(ATT\).
If \(m_0(X)\) not correctly specified, then, by itself, this term will be biased for \(ATT\)
The second term can be thought of as a de-biasing term
If \(m_0(X)\) is correctly specified, it is equal to 0
If \(p(X)\) is correctly specified, it reduces to \(\E[\Delta Y_t(0) | D=1] - \E[m_0(X)|D=1]\) which both delivers counterfactual untreated potential outcomes and removes the (possibly misspecified) second term from the first equation
Comments
Even more than in the previous case, the results in this case are notably different depending on the estimation strategy.