R package for Generalized Synthetic Control Method: for Causal Inference with Interactive Fixed Effect Models

R source files can be found on Github. R code used in this demonstration can be downloaded from here.

Main Reference: Xu, Yiqing, 2017. “Generalized Synthetic Control Method: Causal Inference with Interactive Fixed Effects Models.” Political Analysis, Volume 25, Issue 1, January 2017, pp. 57-76. Available at: http://dx.doi.org/10.1017/pan.2016.2.


Authors: Yiqing Xu (Stanford); Licheng Liu (MIT)

Date: March 6, 2020

Package: gsynth

Version: 1.1.6 (Github version). 1.0.9 (CRAN version). Please report bugs!

Updates in v.1.1.4:

  1. Import function felm from lfe to fit two-way fixed effects model as the starting value for estimation of interactive fixed effects model with unbalanced panel data.
  2. Add cluster bootstrap option for uncertainty estimates.
  3. Add jackknife uncertainty estimates.
  4. Add a new function cumuEff for calculation of sub-group and cumulative treatment effects.

Updates in v.1.0.9:

  1. Function panelView() is removed from gsynth and becomes an independent package panelView. For more information, click here.
  2. Implement the matrix completion method.
  3. Fix bugs.
  4. Change the color scheme.

Updates in v.1.0.8:

  1. Add a function panelView() to visualize raw data and data structure before estimation.
  2. Fix bugs.

Updates in v.1.0.7:

  1. Add “implied weights” of control units for each treated unit to the output of the main function (wgt.implied).
  2. Add a plot to visualize missing data and treatment status (type = "missing").
  3. Accommodate unbalanced panels.

Contents

  1. Installation

  2. Example 1: Simulated Data

  3. Example 2: Election-Day Registration on Voter Turnout

  4. Unbalanced Panels


Installation

You can install gsynth directly from CRAN by typing the following command in the R console:

install.packages('gsynth', type = 'source')

If you plan to use the lastest functionalities, please install the development version of the package from Github by typing the following commands:

install.packages('devtools', repos = 'http://cran.us.r-project.org') # if not already installed
devtools::install_github('xuyiqing/gsynth')

gsynth depends on the following packages, which will be installed AUTOMATICALLY when gsynth is being installed; you can also install them manually:

## for processing C++ code
require(Rcpp) 
## for plotting
require(ggplot2)  
require(GGally) 
## for parallel computing 
require(foreach)  
require(doParallel) 
require(abind) 
require(lfe)

Notes on installation failures

  1. For Rcpp, RcppArmadillo and MacOS “-lgfortran” and “-lquadmath” error, click here for details.
  2. Installation failure related to OpenMP on MacOS, click here for a solution.
  3. To fix these issues, try installing gfortran 6.1 from here.

Example 1: Simulated Data

Two datasets simdata and turnout are shipped with the gsynth package. Load these two datasets:

library(gsynth)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
data(gsynth)
ls()
## [1] "simdata" "turnout"

We start with the first example, a simulated dataset described in the paper. There are 5 treated units, 45 control units, and 30 time periods. The treatment kicks at Period 21 for all treated units.

head(simdata)
##    id time         Y D        X1         X2 eff      error mu       alpha
## 1 101    1  6.210998 0 0.3776736 -0.1732470   0  0.2982276  5 -0.06052364
## 2 101    2  4.027106 0 1.7332009 -0.4945009   0  0.6365697  5 -0.06052364
## 3 101    3  8.877187 0 1.8580159  0.4984432   0 -0.4837806  5 -0.06052364
## 4 101    4 11.515346 0 1.3943369  1.1272713   0  0.5168620  5 -0.06052364
## 5 101    5  5.971526 0 2.3636963 -0.1535215   0  0.3689645  5 -0.06052364
## 6 101    6  8.237905 0 0.5370867  0.8774397   0 -0.2153805  5 -0.06052364
##           xi          F1          L1           F2         L2
## 1  1.1313372  0.25331851 -0.04303273  0.005764186 -0.8804667
## 2 -1.4606401 -0.02854676 -0.04303273  0.385280401 -0.8804667
## 3  0.7399475 -0.04287046 -0.04303273 -0.370660032 -0.8804667
## 4  1.9091036  1.36860228 -0.04303273  0.644376549 -0.8804667
## 5 -1.4438932 -0.22577099 -0.04303273 -0.220486562 -0.8804667
## 6  0.7017843  1.51647060 -0.04303273  0.331781964 -0.8804667

Basic panel data visualization

Before we conduct any statistical analysis, it is helpful to visualize the data structure and/or spot missing values (if there are any). We can easily do so with the help of panelView. The following figure shows that: (1) there are 5 treated units and 45 control units; (2) the treated units start to be treated in period 21; and (3) there are no missing values, which is a rare case.

## devtools::install_github('xuyiqing/panelView')   # if not already installed
library(panelView)
panelView(Y ~ D, data = simdata,  index = c("id","time"), pre.post = TRUE) 

The following line of code visualizes the outcome variable; different colors correspond to different treatment status.

panelView(Y ~ D, data = simdata,  index = c("id","time"), type = "outcome") 

Estimation

We estimate the model only using information of the outcome variable \(Y\), the treatment indicator \(D\), and two observed covariates \(X_{1}\) and \(X_{2}\) (as well as the group indicators \(id\) and \(time\)).

Now we formally run the gsynth algorithm. The first variable on the right-hand-side of the formula is a binary treatment indicator. The rest of the right-hand-side variables serve as controls. The index option specifies the unit and time indicators.

The force option (“none”, “unit”, “time”, and “two-way”) specifies the additive component(s) of the fixed effects included in the model. The default option is “unit” (including additive unit fixed effects). A cross-validation procedure is provided (when CV = TRUE) to select the number of unobserved factors within the interval of r=c(0,5). When cross-validation is switched off, the first element in r will be set as the number of factors.

When the uncertainty estimates are needed, set se = TRUE. When the number of treated units is small, a parametric bootstrap procedure is preferred. Alternatively, one can use a non-parametric bootstrap procedure by setting inference = "nonparametric"; it only works well when the treatment group is relatively large, e.g. \(N_{tr}> 40\)). The number of bootstrap runs is set by nboots.

When dealing with clustered data, for example, panel data with regionally clustered units, one can use the cluster bootstrap procedure by specifying name of the cluster variable in option cl. For example, cl = "state" for city-level data.

system.time(
    out <- gsynth(Y ~ D + X1 + X2, data = simdata, index = c("id","time"), force = "two-way", CV = TRUE, r = c(0, 5), se = TRUE, inference = "parametric", nboots = 1000, parallel = FALSE)
)
## Cross-validating ... 
##  r = 0; sigma2 = 1.84865; IC = 1.02023; PC = 1.74458; MSPE = 2.37280
##  r = 1; sigma2 = 1.51541; IC = 1.20588; PC = 1.99818; MSPE = 1.71743
##  r = 2; sigma2 = 0.99737; IC = 1.16130; PC = 1.69046; MSPE = 1.14540*
##  r = 3; sigma2 = 0.94664; IC = 1.47216; PC = 1.96215; MSPE = 1.15032
##  r = 4; sigma2 = 0.89411; IC = 1.76745; PC = 2.19241; MSPE = 1.21397
##  r = 5; sigma2 = 0.85060; IC = 2.05928; PC = 2.40964; MSPE = 1.23876
## 
##  r* = 2
## 
## 
Simulating errors .............
Bootstrapping ...
## ..........
##    user  system elapsed 
##   5.799   0.065   5.926

Because the core function of gsynth is written in C++, the algorithm runs relatively fast. The entire procedure (including cross-validation and 1,000 bootstrap runs) takes about 6 seconds on an iMac (2016 model).

The algorithm prints out the results automatically. sigma2 stands for the estimated variance of the error term; IC represents the Bayesian Information Criterion; and MPSE is the Mean Squared Prediction Error. The cross-validation procedure selects an \(r\) that minimizes the MSPE.

Users can use the print function to visualize the result or retrieve the numbers by directly accessing the gsynth object. est.att reports the average treatment effect on the treated (ATT) by period; est.avg shows the ATT averaged over all periods; and est.beta reports the coefficients of the time-varying covariates. Treatment effect estimates from each bootstrap run is stored in eff.boot, an array whose dimension = (#time periods * #treated * #bootstrap runs).

print(out)
out$est.att
out$est.avg
out$est.beta

Parallel computing will speed up the bootstrap procedure significantly. When parallel = TRUE (default) and cores options are omitted, the algorithm will detect the number of available cores on your computer automatrically. (Warning: it may consume most of your computer’s computational power if all cores are being used.)

system.time(
out <- gsynth(Y ~ D + X1 + X2, data = simdata, index = c("id","time"), 
    force = "two-way", CV = TRUE, r = c(0, 5), se = TRUE, 
    inference = "parametric", nboots = 1000, parallel = TRUE, cores = 4)
)
## Parallel computing ...
## Cross-validating ... 
##  r = 0; sigma2 = 1.84865; IC = 1.02023; PC = 1.74458; MSPE = 2.37280
##  r = 1; sigma2 = 1.51541; IC = 1.20588; PC = 1.99818; MSPE = 1.71743
##  r = 2; sigma2 = 0.99737; IC = 1.16130; PC = 1.69046; MSPE = 1.14540*
##  r = 3; sigma2 = 0.94664; IC = 1.47216; PC = 1.96215; MSPE = 1.15032
##  r = 4; sigma2 = 0.89411; IC = 1.76745; PC = 2.19241; MSPE = 1.21397
##  r = 5; sigma2 = 0.85060; IC = 2.05928; PC = 2.40964; MSPE = 1.23876
## 
##  r* = 2
## 
## 
Simulating errors ...
Bootstrapping ...
## 
##    user  system elapsed 
##   0.923   0.082   6.170

gsynth also incoporates jackknife method for uncertainty estimates.

out2 <- gsynth(Y ~ D + X1 + X2, data = simdata, index = c("id","time"), 
    force = "two-way", CV = FALSE, r = c(2, 5), se = TRUE, 
    inference = "jackknife", parallel = TRUE, cores = 4)
## Parallel computing ...
## 
 Jackknifing ...
## 

Sometimes researchers may be interested in cumulative treatment effects like cumulative abnormal returns in event study. In this version, we add a simple function cumuEff to calculate cumulative treatment effects. One can specify a period of interest in option period. If left blank, cumulative treatment effects for all post-treatment period will be calculated.

cumu1 <- cumuEff(out, cumu = TRUE, id = NULL, period = c(0,5))
cumu1$est.catt
##           CATT      S.E.   CI.lower   CI.upper p.value
## 0 -0.003308833 0.3338661 -0.5744544  0.6978486   0.980
## 1  1.232653177 0.8616244 -0.5132399  2.9765032   0.166
## 2  2.862917489 1.0840917  0.5496877  4.7759361   0.014
## 3  5.575095192 1.2300781  3.1136086  8.0318369   0.000
## 4  9.041852883 1.5413470  6.0597479 12.0194565   0.000
## 5 14.781985192 1.8350697 11.2034118 18.2506808   0.000

One can also calculate average treatment effect for a sub-group by specifying unit names in option id. By specifying cumu = FALSE, average treatment effects (rather than cumulative effects) at each period will be returned. Note that in this case, parametric bootstrap procedure is needed for uncertainty estimates.

cumu2 <- cumuEff(out, cumu = FALSE, id = c(101, 102, 103), period = c(0,5))
cumu2$est.catt
##         CATT      S.E.   CI.lower  CI.upper p.value
## 0 -0.2277091 0.4121386 -0.9653334 0.6333357   0.558
## 1  2.2491804 0.8366391  0.5931973 3.7783124   0.002
## 2  2.3930760 0.6831834  1.0146396 3.7796219   0.000
## 3  2.3067796 0.7224259  0.9148753 3.7519866   0.000
## 4  2.5812540 0.8282216  0.9021067 4.0855578   0.002
## 5  4.7445071 0.6783848  3.3484131 5.9761241   0.000

Plotting the results

By default, the plot function produce a “gap” plot – as if we type plot(out, type = "gap") – which visualize the estimated ATT by period. For your reference, the true effects in simdata go from 1 to 10 (plus some white noise) from period 21 to 30. You can save the graph, a ggplot2 object, by directing print to an object name.

plot(out) # by default

A black-and-white theme is available, too:

plot(out, theme.bw = TRUE) 

Users can adjust xlim and ylim, and supply the plot title and titles for the two axes.

plot(out, type = "gap", ylim = c(-3,12), xlab = "Period", main = "My GSynth Plot")

The other four type options include "raw", which plots the raw data (outcome) variable as panelView() does; "counterfactual" (or "ct" for short), which plots the estimated counterfactual(s); and "factors" and "loadings", which plot estimated factors and loadings, respectively. The next figure is a raw plot with a black-and-white theme. The treated (pre- and post-treatment) and control units are painted with different colors.

plot(out, type = "raw", theme.bw = TRUE)

We can set the axes limits, remove the legend and title, for example, by typing (figure not shown):

plot(out,type = "raw", legendOff = TRUE, ylim=c(-10,40), main="")

We use the following line of command to plot the estimated counterfactual. raw = "none" (the default option) means that we do not include the raw data in this figure:

plot(out, type = "counterfactual", raw = "none", main="")

One can use shade.post to control the shading in the post-treatment period

plot(out, type = "ct", raw = "none", main = "", shade.post = FALSE)

We can also add two 5 to 95% quantile bands of the treated and control outcomes as references to make sure the estimated counterfactuals are not results of severe extrapolations.

plot(out, type = "counterfactual", raw = "band", xlab = "Time", ylim = c(-5,35), theme.bw = TRUE)

… or simply plot all the raw data (for the outcome variable).

plot(out, type = "counterfactual", raw = "all")

One can also visualize the estimated counterfactual for each treated unit to evaluate the quality of model fit.

plot(out, type = "counterfactual", id = 102)

We can add the reference band, a 5 to 95% quantile band the control outcomes.

plot(out, type = "counterfactual", id = 104, raw = "band", ylim = c(-10, 30))

… or replaced the band with the raw data of the control outcomes.

plot(out, type = "counterfactual", id = 105, raw = "all", legendOff = TRUE)

The next two figures plot the estimated factors and factor loadings, respectively.

plot(out, type = "factors", xlab = "Time")

plot(out, type = "loadings")

EM Method

We also implement an EM algorithm that takes advantage of the treatment group information in the pre-treatment period (Gobillon and Magnac 2016). The estimation takes more time, but the results are very similar to that from the original method – the coefficients will be slightly more precisely estimated.

system.time(
out <- gsynth(Y ~ D + X1 + X2, data = simdata, EM = TRUE, index = c("id","time"), inference = "parametric", se = TRUE, nboots = 500, r = c(0, 5), CV = TRUE, force = "two-way", parallel = TRUE, cores = 4)
)
## Parallel computing ...
## Cross-validating ... 
##  r = 0; sigma2 = 1.87052; IC = 1.03286; PC = 1.76603; MSPE = 2.38012
##  r = 1; sigma2 = 1.50792; IC = 1.20393; PC = 1.90937; MSPE = 1.75038
##  r = 2; sigma2 = 0.99174; IC = 1.16142; PC = 1.57657; MSPE = 1.17430*
##  r = 3; sigma2 = 0.93513; IC = 1.46912; PC = 1.79035; MSPE = 1.24820
##  r = 4; sigma2 = 0.88552; IC = 1.77104; PC = 1.98424; MSPE = 1.43445
##  r = 5; sigma2 = 0.84141; IC = 2.06634; PC = 2.16106; MSPE = 1.61788
## 
##  r* = 2
## 
## 
Bootstrapping ...
## 
##    user  system elapsed 
##   4.250   0.111  39.107
plot(out, main = "Estimated ATT (EM)")

Matrix Completion Method

In v1.0.9 or higher versions, we implement the matrix completion method proposed by Athey et al. (2017) by setting estimator = "mc" (the default is "ife", which represents the interactive fixed effects model). It uses a different regularization scheme and takes advantage of the treatment group information in the pre-treatment period. The option lambda controls the tuning parameter. If it is left blank, a sequence of candidate lambdas will be generated and tested sequentially. Users can set the length of candidate list by setting, for example, nlambda = 10. The option lambda can also receive a user-specified sequence as candidate lambdas. A built in cross-validation procedure will select the optimal hyper parameter when a single lambda is given. Users can specify the number of folds in cross-validation using the option k (e.g. k = 5).

system.time(
out <- gsynth(Y ~ D + X1 + X2, data = simdata, estimator = "mc", index = c("id","time"), se = TRUE, nboots = 500, r = c(0, 5), CV = TRUE, force = "two-way", parallel = TRUE, cores = 4)
)
## Parallel computing ...
## Cross-validating ... 
##  lambda.norm = 1.00000; MSPE = 1.81108
##  lambda.norm = 0.42170; MSPE = 1.31955*
##  lambda.norm = 0.17783; MSPE = 1.60657
##  lambda.norm = 0.07499; MSPE = 1.77476
##  lambda.norm = 0.03162; MSPE = 1.77978
##  lambda.norm = 0.01334; MSPE = 1.79948
##  lambda.norm = 0.00562; MSPE = 1.80589
##  lambda.norm = 0.00237; MSPE = 1.80884
##  lambda.norm = 0.00100; MSPE = 1.81013
##  lambda.norm = 0.00000; MSPE = 1.81108
## 
##  lambda.norm* = 0.4216965
## 
## 
Bootstrapping ...
## 
##    user  system elapsed 
##   0.917   0.081  22.383
plot(out, main = "Estimated ATT (MC)")


Example 2: Election-Day Registration on Voter Turnout

The second example investigates the effect of Election-Day Registration (EDR) reforms on voter turnout in the United States. Note that the treatment kicks in at different times.

data(gsynth)
names(turnout)
## [1] "abb"            "year"           "turnout"        "policy_edr"    
## [5] "policy_mail_in" "policy_motor"

Data Structure

First, we take a look at the data structure. The following figure shows that (1) we have a balanced panel with 9 treated units and (2) the treatment starts at different time periods.

panelView(turnout ~ policy_edr, data = turnout,  index = c("abb","year"), pre.post = TRUE, by.timing = TRUE) 

panelView can also visualize the outcome variable by group (change in treatment status)

panelView(turnout ~ policy_edr, data = turnout, index = c("abb","year"), type = "outcome", main = "EDR Reform and Turnout", by.group = TRUE)

Estimation w/o Factors

When no factor is assumed, the estimates are close to what we obtain from difference-in-differences:

out0 <- gsynth(turnout ~ policy_edr + policy_mail_in + policy_motor, data = turnout, index = c("abb","year"), se = TRUE, inference = "parametric", r = 0, CV = FALSE, force = "two-way", nboots = 1000, seed = 02139)
## Parallel computing ...
## 
Simulating errors ...
Bootstrapping ...
## 
plot(out0, type = "gap", xlim = c(-10, 5))

Estimation w/ Factors

Now we allow the algorithm to find the “correct” number of factors that predicts the pre-treatment data the best:

out <- gsynth(turnout ~ policy_edr + policy_mail_in + policy_motor, data = turnout,  index = c("abb","year"), se = TRUE, inference = "parametric", r = c(0, 5), CV = TRUE, force = "two-way", nboots = 1000, seed = 02139)
## Parallel computing ...
## Cross-validating ... 
##  r = 0; sigma2 = 77.99366; IC = 4.82744; PC = 72.60594; MSPE = 22.13889
##  r = 1; sigma2 = 13.65239; IC = 3.52566; PC = 21.67581; MSPE = 12.03686
##  r = 2; sigma2 = 8.56312; IC = 3.48518; PC = 19.23841; MSPE = 10.31254*
##  r = 3; sigma2 = 6.40268; IC = 3.60547; PC = 18.61783; MSPE = 11.48390
##  r = 4; sigma2 = 4.74273; IC = 3.70145; PC = 16.93707; MSPE = 16.28613
##  r = 5; sigma2 = 4.02488; IC = 3.91847; PC = 17.05226; MSPE = 15.78683
## 
##  r* = 2
## 
## 
Simulating errors ...
Bootstrapping ...
## 

Implied weights

out$wgt.implied (\(N_{co}\times N_{tr}\)) stores the implied weights of all control units for each treated unit. Different from the synthetic control method, the weights can be both positive and negative. Below we show the control unit weights for Wisconsin.

dim(out$wgt.implied)
## [1] 38  9
sort(out$wgt.implied[,8])
##           AL           VA           MS           KY           SC 
## -0.420446734 -0.359404870 -0.358899140 -0.323621805 -0.275363471 
##           NC           GA           MD           FL           TN 
## -0.222140041 -0.214909889 -0.197804828 -0.185884567 -0.162290117 
##           AR           NM           LA           TX           AZ 
## -0.151519318 -0.148094608 -0.127000746 -0.064591440 -0.057043363 
##           MO           DE           OH           OR           VT 
## -0.038932843 -0.001453018  0.010929464  0.029620626  0.049794411 
##           NV           CO           MI           IN           KS 
##  0.073077120  0.079864605  0.090892631  0.091876103  0.093457125 
##           NE           IL           OK           WA           WV 
##  0.104411283  0.133990028  0.158671615  0.159121640  0.181418421 
##           SD           PA           NJ           UT           CA 
##  0.182877791  0.221808796  0.226279566  0.250193486  0.278783695 
##           RI           NY           MA 
##  0.280519519  0.295315162  0.316497707

Plotting

The missing data plot can also be produced after the estimation is performed. In addition, plot allows users to changes axes labels and the graph title (figure not shown).

plot(out, type = "missing", xlab = "Year", ylab = "State", main = "Treatment Status")

The following line of code produces the “gap” plot:

plot(out, type = "gap", xlim = c(-10, 5), ylim=c(-3,10))

If we want to know the treatment effect on a specific state, such as Wisconsin, we can specify the id option:

plot(out, type = "gap", id = "WI", main = "Wisconsin")

Plotting the raw turnout data:

plot(out, type = "raw", xlab = "Year", ylab = "Turnout")

Plotting the estimated average counterfactual. Since EDR reforms kick in at different times, the algorithm realign the x-axis based on the timing of the treatment. raw = all shows the realigned outcomes of the treated units.

plot(out, type = "counterfactual", raw = "all")

We can plot estimated counterfactuals for each treated unit:

plot(out, type="counterfactual", id = "CT")

plot(out, type = "counterfactual", id = "WY", raw = "all", legendOff = TRUE)

plot(out, type = "counterfactual", id = "WI", raw = "none", shade.post = FALSE, ylim = c(0,100),
     legend.labs = c("Wisconsin Actual","Wisconsin Counterfactural"))

Estimated factors and factor loadings::

plot(out, type = "factors", xlab="Year")

plot(out, type = "loadings")


Unbalanced Panels

Starting from v1.0.7, gsynth can accommodate unbalanced panels. To illustrate how it works, we randomly remove 50 observations as well as the first 15 observations of Wyoming from the turnout dataset and then re-estimate the model:

set.seed(123456)
turnout.ub <- turnout[-c(which(turnout$abb=="WY")[1:15], sample(1:nrow(turnout),50,replace=FALSE)),]

Again, before running any regressions, we first plot the data structure and visualize missing values. In the following plot, white cells represent missing values.

panelView(turnout ~ policy_edr + policy_mail_in + policy_motor, data = turnout.ub,  index = c("abb","year"), pre.post = TRUE) 

Estimation

out <- gsynth(turnout ~ policy_edr + policy_mail_in + policy_motor, data = turnout.ub,  index = c("abb","year"), se = TRUE, inference = "parametric", r = c(0, 5), CV = TRUE, force = "two-way", parallel = TRUE, min.T0 = 8, nboots = 1000, seed = 02139)
## Some treated units has too few pre-treatment periods. 
## They will be automatically removed.
## Parallel computing ...
## Cross-validating ... 
##  r = 0; sigma2 = 78.53850; IC = 4.85229; PC = 72.87077; MSPE = 19.00413
##  r = 1; sigma2 = 13.71115; IC = 3.56457; PC = 21.68716; MSPE = 11.91054
##  r = 2; sigma2 = 8.64454; IC = 3.54545; PC = 19.34554; MSPE = 7.61167
##  r = 3; sigma2 = 6.52720; IC = 3.69114; PC = 18.90508; MSPE = 7.56166*
##  r = 4; sigma2 = 4.83105; IC = 3.80135; PC = 17.18456; MSPE = 15.96018
##  r = 5; sigma2 = 4.16136; IC = 4.04774; PC = 17.56161; MSPE = 15.64924
## 
##  r* = 3
## 
## 
Simulating errors ...
Bootstrapping ...
## 
list of removed units: WY

Plotting

The missing plot visualizes the data being used in the regression (similar to that from turning on quick_missing in the main function). It is different from the plot generated from panelView in that observations for Wyoming are now represented by gray cells with red dots to indicate that they are removed from the sample because of insufficient number of pre-treatment periods. The missing data matrix can also be retrieved from out$obs.missing, in which 0, 1, 2, 3 4 represent missing values, control observations, treated (pre-treatment) observations, treated (post-treatment) observations, and treated observations removed from the sample, respectively.

plot(out, type = "missing")

Like in panelView, users can plot part of all units by specifying id and adjust labels on x-axis when the the magnitude of data is too large.

plot(out, type = "missing", xlab = "Year", ylab = "State", main = "Treatment Status", id = out$id.tr, xlim = c(1920,2012),  axis.adjust=TRUE)

We re-produce the “gap” plot with the unbalanced panel:

plot(out, type = "gap")

Matrix Completion Method

Finally, we re-estimate the model using the matrix completion method:

out.mc <- gsynth(turnout ~ policy_edr + policy_mail_in + policy_motor, min.T0 = 8, data = turnout.ub,  index = c("abb","year"), estimator = "mc", se = TRUE, nboots = 1000, seed = 02139)
## Some treated units has too few pre-treatment periods. 
## They will be automatically removed.
## Parallel computing ...
## Cross-validating ... 
##  lambda.norm = 1.00000; MSPE = 111.41839
##  lambda.norm = 0.42170; MSPE = 47.92701
##  lambda.norm = 0.17783; MSPE = 22.54666
##  lambda.norm = 0.07499; MSPE = 14.65412*
##  lambda.norm = 0.03162; MSPE = 15.40772
##  lambda.norm = 0.01334; MSPE = 23.22015
##  lambda.norm = 0.00562; MSPE = 53.92629
##  lambda.norm = 0.00237; MSPE = 107.19545
##  lambda.norm = 0.00100; MSPE = 109.61991
##  lambda.norm = 0.00000; MSPE = 111.41839
## 
##  lambda.norm* = 0.07498942
## 
## 
Bootstrapping ...
## 
list of removed units: WY
plot(out.mc, main = "Estimated ATT (MC)")

Important Notes

  1. Running gsynth with unbalanced panels will take significantly more time than with balanced panels (the magnitude is often 100:1 or even higher). This is because the EM algorithm that fills in missing values (also written in C++) runs many more loops. This means users can often significantly reduce run-time by removing some units or time periods that have many missing values. Knowing the data structure before running any regressions always helps!

  2. Similarly, adding covariates will significantly slow down the algorithm because the IFE/MC model will take more time to converge.

  3. Set min.T0 to a positive number helps. The algorithm will automatically discard treated units that have too few pre-treatment periods. A bigger \(T_{0}\) reduces bias in the causal estimates and the chances of severe extrapolation. If users run cross-validation to select the number of factors, min.T0 needs to be equal to or greater than (r.max+2).


Addtional References:

  1. Gobillon L. and Magnac T., 2016. “Regional Policy Evaluation: Interactive Fixed Effects and Synthetic Controls.” The Review of Economics and Statistics, July 2016, Vol. 98, No. 3, pp. 535–551.

  2. Athey S, Bayati M, Doudchenko N, et al. Matrix completion methods for causal panel data models[J]. arXiv preprint arXiv:1710.10251, 2017.

\(\square\)