gsynth: Generalized Synthetic Control Method

R package for Generalized Synthetic Control Method: for Causal Inference with Interactive Fixed Effect Models

Authors: Yiqing Xu (UCSD); Licheng Liu (Tsinghua)

Date: Dec 9, 2017

Package: gsynth

Version: 1.0.7 (Github version). 1.0.5 (CRAN version). Please report bugs!

Updates:

  1. Add “implied weights” of control units for each treated unit to the ouput of the main function (wgt.implied)

  2. Add an option to visualize missing data before running regressions to the main function (quick_missing = TRUE).

  3. Add a plot to visualize missing data and treatment status (type = "missing")

  4. Accomdate unbalanced panels


R source files can be found on Github. R code used in this demonstration can be downloaded from here.

Reference: Yiqing Xu, 2017. “Generalized Synthetic Control Method: Causal Inference with Interactive Fixed Effects Models.” Political Analysis, Volume 25, Issue 1, January 2017, pp. 57-76. Available at: http://dx.doi.org/10.1017/pan.2016.2.


Contents

  1. Installation
  2. Example 1: Simulated Data
  3. Example 2: Election-Day Registration on Voter Turnout
  4. Unbalanced Panels

Installation

You can install the gsyth package directly from CRAN by typing the following command in the R console:

install.packages('gsynth', type = 'source')

If you plan to use the lastest functionalities (e.g. dealing with unbalanced panels), please install the deveolopment version of the package from Github by typing the following commands:

install.packages('devtools', repos = 'http://cran.us.r-project.org') # if not already installed
devtools::install_github('xuyiqing/gsynth')

gsynth depends on the following packages, which will be installed AUTOMATICALLY when gsynth is being installed; you can also install them manually:

## for processing C++ code
require(Rcpp) 
## for plotting
require(ggplot2)  
require(GGally) 
## for parallel computing 
require(foreach)  
require(doParallel) 
require(abind) 

Example 1: Simulated Data

Two datasets simdata and turnout are shipped with the gsynth package. Load these two datasets:

library(gsynth)
data(gsynth)
ls()
## [1] "simdata" "turnout"

We start with the first example, a simulated dataset describled in the paper. There are 5 treated units, 45 control units, and 30 time periods. The treatment kicks at Period 21 for all treated units.

head(simdata)
##    id time         Y D        X1         X2 eff      error mu       alpha
## 1 101    1  6.210998 0 0.3776736 -0.1732470   0  0.2982276  5 -0.06052364
## 2 101    2  4.027106 0 1.7332009 -0.4945009   0  0.6365697  5 -0.06052364
## 3 101    3  8.877187 0 1.8580159  0.4984432   0 -0.4837806  5 -0.06052364
## 4 101    4 11.515346 0 1.3943369  1.1272713   0  0.5168620  5 -0.06052364
## 5 101    5  5.971526 0 2.3636963 -0.1535215   0  0.3689645  5 -0.06052364
## 6 101    6  8.237905 0 0.5370867  0.8774397   0 -0.2153805  5 -0.06052364
##           xi          F1          L1           F2         L2
## 1  1.1313372  0.25331851 -0.04303273  0.005764186 -0.8804667
## 2 -1.4606401 -0.02854676 -0.04303273  0.385280401 -0.8804667
## 3  0.7399475 -0.04287046 -0.04303273 -0.370660032 -0.8804667
## 4  1.9091036  1.36860228 -0.04303273  0.644376549 -0.8804667
## 5 -1.4438932 -0.22577099 -0.04303273 -0.220486562 -0.8804667
## 6  0.7017843  1.51647060 -0.04303273  0.331781964 -0.8804667

Before we conduct any statistical analysis, it is helpful to visusalize the data structure and spot missing values (if there are any). The following figure shows that: (1) there are 5 treated units and 45 control units; (2) the treated units start to be treated in period 21; and (3) there are no missing values, which is a rare case. Note that when the quick_missing option is switched on, no estimation will be performed.

gsynth(Y ~ D, data = simdata,  index = c("id","time"), quick_missing = TRUE) 

Estimation

We estimate the model only using information of the outcome variable \(Y\), the treatment indicator \(D\), and two observed covariates \(X_{1}\) and \(X_{2}\) (as well as the group indicators \(id\) and \(time\)).

Now we formally run the gsynth algorithm. The first variable on the right hand side of the formula is a binary treatement indicator. The rest of the right hand side variables serve as controls. The index option specifies the unit and time indicators.

The force option specifies the additive component(s) of the fixed effects included in the model (with four options: “none”, “unit”, “time”, and “two-way”). The default option is “unit” (including additive unit fixed effects). A cross-validation procedure is provided (when CV=TRUE) to select the number of unobserved factors within the interval of r=c(0,5). When cross-validation is switched off, the first item of r will be set as the number of factors.

When the uncertianty estimates are needed, set se=TRUE. When the number of treated units is small, an inferential method based a parametric bootstrap procedure is preferred (the alternative is a non-parametric bootstrap procedure by setting the inference option to nonparametric; this only works when the treatment group is relatively large, e.g. > 40). The number of bootstrap runs is set by nboots.

system.time(
    out <- gsynth(Y ~ D + X1 + X2, data = simdata, 
                  index = c("id","time"), force = "two-way",
                  CV = TRUE, r = c(0, 5), se = TRUE, 
                  inference = "parametric", nboots = 1000,
                  parallel = FALSE)
)
## Cross-validating ... 
##  r = 0; sigma2 = 1.74717; IC = 0.56867; MSPE = 2.37280
##  r = 1; sigma2 = 1.42977; IC = 0.76329; MSPE = 1.71737
##  r = 2; sigma2 = 0.93928; IC = 0.72756; MSPE = 1.14525*
##  r = 3; sigma2 = 0.88977; IC = 1.04715; MSPE = 1.15013
##  r = 4; sigma2 = 0.83864; IC = 1.35103; MSPE = 1.21356
##  r = 5; sigma2 = 0.79605; IC = 1.65129; MSPE = 1.23830
## 
##  r* = 2
## 
## 
Simulating errors .............
Bootstrapping ...
## ..........
##    user  system elapsed 
##  10.662   0.074  11.244

Because the core function is written in C++, the algorithm runs relatively fast. The entire procedure (of 1,000 bootstrap runs) takes about 15 seconds on a 2013 model MacBook Pro laptop.

The algorithm prints out the results automatically. sigma2 stands for the estimated variance of the error term; IC represents the Bayesian Information Criterion; and MPSE is the Mean Squared Prediction Error. The cross-validation procedure selects an \(r\) that minimizes the MSPE.

Users can use the print function to visualize the result or retrieve the numbers by directly accessing the gsynth object. est.att reports the average treatment effect on the treatd (ATT) by period; est.avg shows the ATT of all periods; and est.beta reports the coefficients of the time-varying covariates. Treatment effect estimates from each bootstrap run is stored in eff.boot, an array whose dimension = (#time periods * #treated * #bootstrap runs).

print(out)
out$est.att
out$est.avg
out$est.beta

Parallel computing may speed up the boostrap procedue significantly. When paralllel=TRUE (default) and cores options are omitted, the algorithm will detect the number of available cores automatrically (warning: it may consume most of the computational power of your computor if all cores are being used.)

system.time(
    out <- gsynth(Y ~ D + X1 + X2, data = simdata, 
                  index = c("id","time"), force = "two-way",
                  CV = TRUE, r = c(0, 5), se = TRUE, 
                  inference = "parametric", nboots = 1000,
                  parallel = TRUE, cores = 4)
)
## Parallel computing ...
## Cross-validating ... 
##  r = 0; sigma2 = 1.74717; IC = 0.56867; MSPE = 2.37280
##  r = 1; sigma2 = 1.42977; IC = 0.76329; MSPE = 1.71737
##  r = 2; sigma2 = 0.93928; IC = 0.72756; MSPE = 1.14525*
##  r = 3; sigma2 = 0.88977; IC = 1.04715; MSPE = 1.15013
##  r = 4; sigma2 = 0.83864; IC = 1.35103; MSPE = 1.21356
##  r = 5; sigma2 = 0.79605; IC = 1.65129; MSPE = 1.23830
## 
##  r* = 2
## 
## 
Simulating errors ...
Bootstrapping ...
## 
##    user  system elapsed 
##   1.184   0.075   7.339

Plotting

By default, the plot function produce a gap plot – as if we type plot(out, type = "gap") – which visualize the estimated ATT by period. For your reference, the true effect in simdata go from 1 to 10 (plus some white noise) from period 21 to 30. You can save the graph object by directing print to an object name.

plot(out) # by default

Users can adjust xlim and ylim, and supply the plot title and titles for the two axes.

plot(out, type = "gap", ylim = c(-3,12), xlab = "Period", main = "My GSynth Plot")