More Complicated Treatment Regimes
October 19, 2023
\(\newcommand{\E}{\mathbb{E}} \newcommand{\E}{\mathbb{E}} \newcommand{\var}{\mathrm{var}} \newcommand{\cov}{\mathrm{cov}} \newcommand{\Var}{\mathrm{var}} \newcommand{\Cov}{\mathrm{cov}} \newcommand{\Corr}{\mathrm{corr}} \newcommand{\corr}{\mathrm{corr}} \newcommand{\L}{\mathrm{L}} \renewcommand{\P}{\mathrm{P}} \newcommand{\independent}{{\perp\!\!\!\perp}} \newcommand{\indicator}[1]{ \mathbf{1}\{#1\} }\) The discussion (and much of the recent DID literature) has focused on the setting with staggered treatment adoption.
However, this certainly does not cover the full range of possible treatments. In Part 3, we’ll primarily consider two leading extensions:
A treatment that is multi-valued or continuous (e.g., length of school closures during Covid on student test scores)
A treatment that can turn on and off (e.g., union status)
A couple of things to notice as we go along:
I’m not going to cover much on TWFE regressions here. They have even more sources of things that can go wrong.
Try to pay attention to the pattern. Even though the arguments are getting more complicated, we are still following the idea of (i) target disaggregated parameters, (ii) combine them into lower dimensional objects, (3) here there will be some additional interpretation issues that are worth emphasizing
Potential outcomes notation
Two time periods: \(t^*\) and \(t^*-1\)
Potential outcomes: \(Y_{it^*}(d)\)
Observed outcomes: \(Y_{it^*}\) and \(Y_{it^*-1}\)
\[Y_{it^*}=Y_{it^*}(D_i) \quad \textrm{and} \quad Y_{it^*-1}=Y_{it^*-1}(0)\]
Level Effects (Average Treatment Effect on the Treated)
\[ATT(d|d) := \E[Y_{t^*}(d) - Y_{t^*}(0) | D=d]\]
Interpretation: The average effect of dose \(d\) relative to not being treated local to the group that actually experienced dose \(d\)
This is the natural analogue of \(ATT\) in the binary treatment case
Slope Effects (Average Causal Response on the Treated)
\[ACRT(d|d) := \frac{\partial ATT(l|d)}{\partial l} \Big|_{l=d}\]
We can view \(ACRT(d|d)\) as the “building block” here. An aggregated version of it (into a single number) is \[\begin{align*} ACRT^O := \E[ACRT(D|D)|D>0] \end{align*}\]
\(ACRT^O\) averages \(ACRT(d|d)\) over the population distribution of the dose
Like \(ATT^O\) for staggered treatment adoption, \(ACRT^O\) is the natural target parameter for the TWFE regression in this case
“Standard” Parallel Trends Assumption
For all \(d\),
\[\E[\Delta Y_{t^*}(0) | D=d] = \E[\Delta Y_{t^*}(0) | D=0]\]
“Standard” Parallel Trends Assumption
For all \(d\),
\[\E[\Delta Y_{t^*}(0) | D=d] = \E[\Delta Y_{t^*}(0) | D=0]\]
Then,
\[ \begin{aligned} ATT(d|d) &= \E[Y_{t^*}(d) - Y_{t^*}(0) | D=d] \hspace{150pt} \end{aligned} \]
“Standard” Parallel Trends Assumption
For all \(d\),
\[\E[\Delta Y_{t^*}(0) | D=d] = \E[\Delta Y_{t^*}(0) | D=0]\]
Then,
\[ \begin{aligned} ATT(d|d) &= \E[Y_{t^*}(d) - Y_{t^*}(0) | D=d] \hspace{150pt}\\ &= \E[Y_{t^*}(d) - Y_{t^*-1}(0) | D=d] - \E[Y_{t^*}(0) - Y_{t^*-1}(0) | D=d] \end{aligned} \]
“Standard” Parallel Trends Assumption
For all \(d\),
\[\E[\Delta Y_{t^*}(0) | D=d] = \E[\Delta Y_{t^*}(0) | D=0]\]
Then,
\[ \begin{aligned} ATT(d|d) &= \E[Y_{t^*}(d) - Y_{t^*}(0) | D=d] \hspace{150pt}\\ &= \E[Y_{t^*}(d) - Y_{t^*-1}(0) | D=d] - \E[Y_{t^*}(0) - Y_{t^*-1}(0) | D=d]\\ &= \E[Y_{t^*}(d) - Y_{t^*-1}(0) | D=d] - \E[\Delta Y_{t^*}(0) | D=0] \end{aligned} \]
“Standard” Parallel Trends Assumption
For all \(d\),
\[\E[\Delta Y_{t^*}(0) | D=d] = \E[\Delta Y_{t^*}(0) | D=0]\]
Then,
\[ \begin{aligned} ATT(d|d) &= \E[Y_{t^*}(d) - Y_{t^*}(0) | D=d] \hspace{150pt}\\ &= \E[Y_{t^*}(d) - Y_{t^*-1}(0) | D=d] - \E[Y_{t^*}(0) - Y_{t^*-1}(0) | D=d]\\ &= \E[Y_{t^*}(d) - Y_{t^*-1}(0) | D=d] - \E[\Delta Y_{t^*}(0) | D=0]\\ &= \E[\Delta Y_{t^*} | D=d] - \E[\Delta Y_{t^*} | D=0] \end{aligned} \]
This is exactly what you would expect
Unfortunately, no
Most empirical work with a continuous treatment wants to think about how causal responses vary across dose
There are new issues related to comparing \(ATT(d|d)\) at different doses and interpreting these differences as causal effects
At a high-level, these issues arise from a tension between empirical researchers wanting to use a quasi-experimental research design (which delivers “local” treatment effect parameters) but (often) wanting to compare these “local” parameters to each other
Unlike the staggered, binary treatment case: No easy fixes here!
Consider comparing \(ATT(d|d)\) for two different doses
\[ \begin{aligned} & ATT(d_h|d_h) - ATT(d_l|d_l) \hspace{350pt} \end{aligned} \]
Consider comparing \(ATT(d|d)\) for two different doses
\[ \begin{aligned} & ATT(d_h|d_h) - ATT(d_l|d_l) \hspace{350pt}\\ & \hspace{25pt} = \E[Y_{t^*}(d_h)-Y_{t^*}(d_l) | D=d_h] + \E[Y_{t^*}(d_l) - Y_{t^*}(0) | D=d_h] - \E[Y_{t^*}(d_l) - Y_{t^*}(0) | D=d_l] \end{aligned} \]
Consider comparing \(ATT(d|d)\) for two different doses
\[ \begin{aligned} & ATT(d_h|d_h) - ATT(d_l|d_l) \hspace{350pt}\\ & \hspace{25pt} = \E[Y_{t^*}(d_h)-Y_{t^*}(d_l) | D=d_h] + \E[Y_{t^*}(d_l) - Y_{t^*}(0) | D=d_h] - \E[Y_{t^*}(d_l) - Y_{t^*}(0) | D=d_l]\\ & \hspace{25pt} = \underbrace{\E[Y_{t^*}(d_h) - Y_{t^*}(d_l) | D=d_h]}_{\textrm{Causal Response}} + \underbrace{ATT(d_l|d_h) - ATT(d_l|d_l)}_{\textrm{Selection Bias}} \end{aligned} \]
“Standard” Parallel Trends is not strong enough to rule out the selection bias terms here
Implication: If you want to interpret differences in treatment effects across different doses, then you will need stronger assumptions than standard parallel trends
This problem spills over into identifying \(ACRT(d|d)\)
Intuition:
Difference-in-differences identification strategies result in \(ATT(d|d)\) parameters. These are local parameters and difficult to compare to each
This explanation is similar to thinking about LATEs with two different instruments
Thus, comparing \(ATT(d|d)\) across different values is tricky and not for free
What can you do?
One idea, just recover \(ATT(d|d)\) and interpret it cautiously (interpret it by itself not relative to different values of \(d\))
If you want to compare them to each other, it will come with the cost of additional (structural) assumptions
“Strong” Parallel Trends Assumption
For all doses d
and l
,
\[\mathbb{E}[Y_{t^*}(d) - Y_{t^*-1}(0) | D=l] = \mathbb{E}[Y_{t^*}(d) - Y_{t^*-1}(0) | D=d]\]
This is notably different from “Standard” Parallel Trends
It involves potential outcomes for all values of the dose (not just untreated potential outcomes)
All dose groups would have experienced the same path of outcomes had they been assigned the same dose
Strong parallel trends implies a version of treatment effect homogeneity. Notice:
\[ \begin{aligned} ATT(d|d) &= \E[Y_{t^*}(d) - Y_{t^*}(0) | D=d] \hspace{200pt} \ \end{aligned} \]
Strong parallel trends implies a version of treatment effect homogeneity. Notice:
\[ \begin{aligned} ATT(d|d) &= \E[Y_{t^*}(d) - Y_{t^*}(0) | D=d] \hspace{200pt} \\\ &= \E[Y_{t^*}(d) - Y_{t^*-1}(0) | D=d] - \E[Y_{t^*}(0) - Y_{t^*-1}(0) | D=d] \ \end{aligned} \]
Strong parallel trends implies a version of treatment effect homogeneity. Notice:
\[ \begin{aligned} ATT(d|d) &= \E[Y_{t^*}(d) - Y_{t^*}(0) | D=d] \hspace{200pt} \\\ &= \E[Y_{t^*}(d) - Y_{t^*-1}(0) | D=d] - \E[Y_{t^*}(0) - Y_{t^*-1}(0) | D=d] \\\ &= \E[Y_{t^*}(d) - Y_{t^*-1}(0) | D=l] - \E[Y_{t^*}(0) - Y_{t^*-1}(0) | D=l] \ \end{aligned} \]
Strong parallel trends implies a version of treatment effect homogeneity. Notice:
\[ \begin{aligned} ATT(d|d) &= \E[Y_{t^*}(d) - Y_{t^*}(0) | D=d] \hspace{200pt} \\\ &= \E[Y_{t^*}(d) - Y_{t^*-1}(0) | D=d] - \E[Y_{t^*}(0) - Y_{t^*-1}(0) | D=d] \\\ &= \E[Y_{t^*}(d) - Y_{t^*-1}(0) | D=l] - \E[Y_{t^*}(0) - Y_{t^*-1}(0) | D=l] \\\ &= \E[Y_{t^*}(d) - Y_{t^*}(0) | D=l] = ATT(d|l) \end{aligned} \]
Since this holds for all \(d\) and \(l\), it also implies that \(ATT(d|d) = ATE(d) = \E[Y_{t^*}(d) - Y_{t^*}(0)]\). Thus, under strong parallel trends, we have that
\[ATE(d) = \E[\Delta Y_{t^*}|D=d] - \E[\Delta Y_{t^*}|D=0]\]
RHS is exactly the same expression as for \(ATT(d|d)\) under “standard” parallel trends, but here
assumptions are different
parameter interpretation is different
ATE-type parameters do not suffer from the same issues as ATT-type parameters when making comparisons across dose
\[ \begin{aligned} ATE(d_h) - ATE(d_l) &= \E[Y_{t^*}(d_h) - Y_{t^*}(0)] - \E[Y_{t^*}(d_l) - Y_{t^*}(0)] \end{aligned} \]
ATE-type parameters do not suffer from the same issues as ATT-type parameters when making comparisons across dose
\[ \begin{aligned} ATE(d_h) - ATE(d_l) &= \E[Y_{t^*}(d_h) - Y_{t^*}(0)] - \E[Y_{t^*}(d_l) - Y_{t^*}(0)]\\ &= \underbrace{\E[Y_{t^*}(d_h) - Y_{t^*}(d_l)]}_{\textrm{Causal Response}} \end{aligned} \]
Thus, recovering \(ATE(d)\) side-steps the issues about comparing treatment effects across doses, but it comes at the cost of needing a (potentially very strong) extra assumption
Given that we can compare \(ATE(d)\)’s across dose, we can recover slope effects in this setting
\[ \begin{aligned} ACR(d) := \frac{\partial ATE(d)}{\partial d} \qquad &\textrm{or} \qquad ACR^O := \E[ACR(D) | D>0] \end{aligned} \]
Can you relax strong parallel trends?
Positive side-comment: No untreated units
Consider the same TWFE regression (but now \(D_{it}\) is continuous): \[\begin{align*} Y_{it} = \theta_t + \eta_i + \alpha D_{it} + e_{it} \end{align*}\] You can show that \[\begin{align*} \alpha = \int_{\mathcal{D}_+} w(l) m'_\Delta(l) \, dl \end{align*}\] where \(m_\Delta(l) := \E[\Delta Y_{t^*}|D=l] - \E[\Delta Y_{t^*}|D=0]\) and \(w(l)\) are weights
Under standard parallel trends, \(m'_{\Delta}(l) = ACRT(l|l) + \textrm{local selection bias}\)
Under strong parallel trends, \(m'_{\Delta}(l) = ACR(l)\).
Thus, issues related to selection bias continue to show up here
About the weights: they are all positive, but have some strange properties (e.g., always maximized at \(l = \E[D]\) (even if this is not a common value for the dose))
Other issues can arise in more complicated cases
For example, suppose you have a staggered continuous treatment, then you will additionally get issues that are analogous to the ones we discussed earlier for a binary staggered treatment
In general, things get worse for TWFE regressions with more complications
“Scarring” vs. Moving in and out of treatment
Example treatments:
Union status (Vella and Verbeek, 1998)
Whether or not location hit by hurricane (Deryugina, 2017)
Whether or not a district shares the same ethnicity as the president of the country (Burgess, et al., 2015)
Additional Notation:
We can make a lot of progress by redefining our notion of a “group”
Keep track of entire treatment regime \(\mathbf{D}_i := (D_{i1}, \ldots, D_{i\mathcal{T}})'\) and/or treatment history up to period \(t\): \(\mathbf{D}_{it} := (D_{i1}, \ldots, D_{it})'\).
Potential outcomes \(Y_{it}(\mathbf{d}_t)\) where \(\mathbf{d}_t\) is some treatment history up to period \(t\) (this notation imposes “no anticipation” — potential outcomes do not depend on future treatments). Observed outcomes: \(Y_{it}(\mathbf{D}_{it})\)
A little more notation…
\(\mathcal{D}_t \subseteq \{0,1\}^t\) is the set of all possible treatment histories in period \(t\). As earlier, we will exclude units that are treated in the first period, (I’ll briefly come back to this later)
\(\mathbf{0}_t\) denotes not participating in the treatment in any period up to period \(t\)
In this case, we’ll define groups by their treatment histories \(\mathbf{d}_t\). Thus, we can consider group-time average treatment effects defined by \[\begin{align*} ATT(\mathbf{d}_t, t) := \E[Y_{it}(\mathbf{d}_t) - Y_{it}(\mathbf{0}_t) | \mathbf{D}_{it} = \mathbf{d}_t] \end{align*}\]
In-and-Out Parallel Trends Assumption:
For all \(t=2,\ldots,\mathcal{T}\), and for all \(\mathbf{d}_t \in \mathcal{D}_t\), \[\begin{align*} \E[\Delta Y_{it}(\mathbf{0}_t) | \mathbf{D}_{it} = \mathbf{d}_t] = \E[\Delta Y_{it}(\mathbf{0}_t) | \mathbf{D}_{it} = \mathbf{0}_t] \end{align*}\]
Identification: In this setting, under the parallel trends assumption, we have that \[\begin{align*} ATT(\mathbf{d}_t, t) = \E[Y_{it} - Y_{i1} | \mathbf{D}_{it} = \mathbf{d}_t] - \E[Y_{it} - Y_{i1} | \mathbf{D}_{it} = \mathbf{0}_t] \end{align*}\]
This argument is straightforward and analogous to what we have done before. However…
There are a number of additional complications that arise here.
There are way more possible groups here than in the staggered treatment case (you can think of this as leading to a kind of curse of dimensionality)
\(\implies\) small groups \(\implies\) imprecise estimates and (possibly) invalid inferences
also makes it harder to report the results
The previous point provides an additional reason to try to aggregate the group-time average treatment effects. However, this is also not so straightforward.
This is an area of active research (e.g., de Chaisemartin and d’Haultfoeuille (2023) and Yanagi (2023))
Some ideas below…but the literature has not converged here yet
Probably the simplest approach is to just make groups on the basis of the first period when a unit experiences the treatment
We have (kind of) been doing this in our minimum wage application
Lots of papers (e.g., job displacement, hospitalization) have used this idea
Formally, it amounts to averaging over all subsequent treatments decisions (de Chaisemartin and d’Haultfoeuille (2023))
But there are other ideas too. Suppose that you were interested in the average treatment effect of experiencing some cumulative number of treatments over time (e.g., how many years someone was in a union).
Define \(\sigma_t(\mathbf{d}_t) := \displaystyle \sum_{s=1}^t d_s\)
We will target the average treatment effect of having experienced exactly \(\sigma\) treatments by period \(t\).
Towards this end, also define \(\mathcal{D}_t^\sigma = \{\mathbf{d}_t \in \mathcal{D}_t : \sigma_t(\mathbf{d}_t) = \sigma\}\) — this is the set of treatment histories that result in \(\sigma\) cumulative treatments in period \(t\). Then, consider
\[\begin{align*} ATT^{sum}(\sigma, t) = \sum_{\mathbf{d}_t \in \mathcal{D}_t^\sigma} ATT(\mathbf{d}_t, t) \P(D_{it}=\mathbf{d}_t | \mathbf{D}_{it} \in \mathcal{D}_t^\sigma) \end{align*}\]
This is the average \(ATT(\mathbf{d}_t,t)\) across treatment regimes that lead to exactly \(\sigma\) treatments by period \(t\)
Similar to previous cases, \(ATT^{sum}(\sigma,t)\) is a weighted average of underlying 2x2 DID parameters
Averaging like this reduces the number of groups, and makes the estimation problem discussed above easier (the “effective” number of units is larger)
Even though \(ATT^{sum}(\sigma,t)\) (possibly substantially) reduces the dimensionality of the underlying group-time average treatment effect parameters, we might want to reduce more.
This is tricky though because the composition of the effective groups changes over time (just because you have two groups have the same number of cumulative treatments in one period doesn’t mean that they have the same number in subsequent periods)
An alternative idea is to just report treatment effect parameters in the last period: \(ATT^{sum}(\sigma,\mathcal{T})\) as a function of \(\sigma\).
Unlike the staggered treatment adoption case, where \(ATT^{ES}(e)\) and \(ATT^O\) seem like good default parameters to report, it is not clear to me what (or if there is) a good default choice here.
Another caution is that (I presume) the issues about interpreting \(ATT\)-type parameters across different amounts of the treatment (here across \(\sigma\)) will introduce selection bias terms except under additional assumptions
Notice that above, we only invoked parallel trends with respect to untreated potential outcomes.
But it seems within the spirit of DID to assume parallel trends for staying at the same treatment over time
Then we can recover group-time average treatment effects for switchers relative to stayers
See de Chaisemartin et al. (2022) and de Chaisemartin and d’Haultfoeuille (2023) for approaches along these lines
This results in many more disaggregated treatment effect parameters
[Details]
If we engage seriously with differing minimum wages across states, this is related to (but not exactly the same) as either or the two cases considered previously.
Multiple values of the treatment
Amount can change over time
But (in our sample) treatment does not ever turn back off
It is straightforward for us to get \(ATT(\mathbf{d}_t, t)\). This amounts to just estimating treatment effects for each treated state in our data in each time period.
It is less clear how to aggregate them. I will propose an idea, but you could certainly come up with something else.
There is variation in the amount of the minimum wage across states and time periods
per dollar \(\widehat{ATT}^O = -0.058\), \(\textrm{s.e.}=0.018\).
We’ve covered a number of different settings, but we certainly haven’t covered all of them
Using new, heterogeneity-robust approaches typically requires customized approaches in complicated settings (unlike TWFE regressions)
In my view, this is a feature of new approaches (rather than a weakness). As researchers, I think we should grapple with complexity of the problems that we are studying
What should you do?
My goal in this section is to provide at least a recipe for dealing with complicated treatment regimes
Step 1: Target disaggregated parameters
Step 2: If desired, choose aggregated target parameter suitable to the application, combine underlying disaggregateed parameters directly to recover this parameter
Some ideas:
Conditioning on some covariates could make strong parallel trends more plausible.
[Back]
It’s possible to do some versions of DID with a continuous treatment without having access to a fully untreated group.
In this case, it is not possible to recover level effects like \(ATT(d|d)\).
However, notice that \[\begin{aligned}& \E[\Delta Y_t | D=d_h] - \E[\Delta Y_t | D=d_l] \\ &\hspace{50pt}= \Big(\E[\Delta Y_t | D=d_h] - \E[\Delta Y_t(0) | D=d_h]\Big) - \Big(\E[\Delta Y_t | D=d_l]-\E[\Delta Y_t(0) | D=d_l]\Big) \\ &\hspace{50pt}= ATT(d_h|d_h) - ATT(d_l|d_l)\end{aligned}\]
In words: comparing path of outcomes for those that experienced dose \(d_h\) to path of outcomes among those that experienced dose \(d_l\) (and not relying on having an untreated group) delivers the difference between their \(ATT\)’s.
Still face issues related to selection bias / strong parallel trends though
[Back]
Strategies like binarizing the treatment can still work (though be careful!)
If you classify units as being treated or untreated, you can recover the \(ATT\) of being treated at all.
On the other hand, if you classify units as being “high” treated, “low” treated, or untreated — our arguments imply that selection bias terms can come up when comparing effects for “high” to “low”
[Back]
That the expressions for \(ATE(d)\) and \(ATT(d|d)\) are exactly the same also means that we cannot use pre-treatment periods to try to distinguish between “standard” and “strong” parallel trends. In particular, the relevant information that we have for testing each one is the same
[Back]
This is a simplified version of Acemoglu and Finkelstein (2008)
1983 Medicare reform that eliminated labor subsidies for hospitals
Medicare moved to the Prospective Payment System (PPS) which replaced “full cost reimbursement” with “partial cost reimbursement” which eliminated reimbursements for labor (while maintaining reimbursements for capital expenses)
Rough idea: This changes relative factor prices which suggests hospitals may adjust by changing their input mix. Could also have implications for technology adoption, etc.
In the paper, we provide some theoretical arguments concerning properties of production functions that suggests that strong parallel trends holds.
Hospital reported data from the American Hospital Association, yearly from 1980-1986
Outcome is capital/labor ratio
proxy using the depreciation share of total operating expenses (avg. 4.5%)
our setup: collapse to two periods by taking average in pre-treatment periods and average in post-treatment periods
Dose is “exposure” to the policy
the number of Medicare patients in the period before the policy was implemented
roughly 15% of hospitals are untreated (have essentially no Medicare patients)
[Back]
Parallel Trends Assumption for Stayers
For any treatment history \(\mathbf{d}_{t-1}\),
\[\begin{align*} \E[Y_t(d_{t-1},\mathbf{d}_{t-1}) - Y_{t-1}(\mathbf{d}_{t-1}) | \mathbf{D}_{it-1} = \mathbf{d}_{t-1})] = \E[Y_t(d_{t-1},\mathbf{d}_{t-1}) - Y_{t-1}(\mathbf{d}_{t-1}) | \mathbf{D}_{it} = (d_{t-1},\mathbf{d}_{t-1})] \end{align*}\]
In this case, you can recover the \(ATT\) for switchers: (here we are supposing that \(d_{t-1}=0\), but can make an analogous argument in the opposite case) \[\begin{align*} ATT^{switchers}(\mathbf{d}_{t-1},t) &= \E[Y_{it}(1,\mathbf{d}_{t-1}) - Y_{it}(0,\mathbf{d}_{t-1}) | \mathbf{D}_{it} = (1,\mathbf{d}_{t-1})] \\ &\overset{\textrm{PTA}}{=} \E[\Delta Y_{it} | \mathbf{D}_{it}=(1,\mathbf{d}_{t-1})] - \E[\Delta Y_{it} | \mathbf{D}_{it}=(0,\mathbf{d}_{t-1})] \end{align*}\] That is, you can recover \(ATT^{switchers}\) by comparing the paths of outcomes for switchers to the path of outcomes for stayers (exactly what you’d expect!)
Given this sort of assumption, there may be a huge number of \(ATT^{switchers}(\mathbf{d}_{t-1},t)\) in realistic applications.
You could use these to further understand treatment effect heterogeneity
You could also propose some way to aggregate them into a lower dimensional argument
[Back]
\(\mu(\mathbf{d}_t) := d_t\) — “how much” treated in this period
\(\varrho(\mathbf{d}_t) := \min\{s : d_s \in \mathbf{d}_t, d_s \neq 0\}\) — first period treated
Building block parameter: Define \(\mathcal{D}_t^{\mu,\varrho} = \{\mathbf{d}_t \in \mathcal{D}_t : \mu(\mathbf{d}_t) = \mu, \varrho(\mathbf{d}_t) = \varrho\}\) — this is the set of states that have a minimum wage equal to \(\mu\) in period \(t\) and first increased their minimum wage in period \(\varrho\). Then, consider
\[ATT^{per}(\mu, \varrho, t) = \sum_{\mathbf{d}_t \in \mathcal{D}_t^{\mu,\varrho}} \frac{ATT(\mathbf{d}_t, t)}{\mu(\mathbf{d}_t)} \P(D_{it} = \mathbf{d}_t | \mathbf{D}_{it} \in \mathcal{D}_t^{\mu,\varrho})\]
This is the (per-dollar) \(ATT\) of having a minimum wage \(\mu\) in period \(t\) among states that (a) actually had a \(\mu\) minimum wage and first increased their minimum wage in period \(\rho\).
Next, define \(M_t= \{\mu : \mu\}\)
Further consider
\[ATT^{per}(\rho, t) = \sum_{\mu \in M_t} ATT^{per}(\mu, \varrho, t) \P(\mu(\mathbf{d}_t))\]
[Back]