gsynth: Generalized Synthetic Control Method

R package for Generalized Synthetic Control Method: for Causal Inference with Interactive Fixed Effect Models

Authors: Yiqing Xu (UCSD); Licheng Liu (Tsinghua)

Date: Sep 30, 2017

Package: gsynth

Version: 1.0.5. Please report bugs!

Updates: (1) Accomdates unbalanced panels (2) Add a plot for visualizing missing data and treatment status (type = "missing")


R source files can be found on Github. R code used in this demonstration can be downloaded from here.

Reference: Yiqing Xu, 2017. “Generalized Synthetic Control Method: Causal Inference with Interactive Fixed Effects Models.” Political Analysis, Available at: http://dx.doi.org/10.1017/pan.2016.2.


Contents

  1. Installation
  2. First Example: Simulated Data
  3. Second Example: Election-Day Registration on Voter Turnout

Installation

You can install the gsyth package directly from CRAN by typing the following command in the R console:

install.packages('gsynth', type = 'source')

You can also install the deveolopment version of the package from Github by typing the following commands:

install.packages('devtools', repos = 'http://cran.us.r-project.org') # if not already installed
devtools::install_github('xuyiqing/gsynth')

gsynth depends on the following packages, which will be installed AUTOMATICALLY when gsynth is being installed; you can also install them manually:

## for processing C++ code
require(Rcpp) 
## for plotting
require(ggplot2)  
require(GGally) 
## for parallel computing 
require(foreach)  
require(doParallel) 
require(abind) 

First Example (Simulated Data)

Two datasets simdata and turnout are shipped with the gsynth package. Load these two datasets:

library(gsynth)
data(gsynth)
ls()
## [1] "simdata" "turnout"

We start with the first example, a simulated dataset describled in the paper. There are 5 treated units, 45 control units, and 30 time periods. The treatment kicks at Period 21 for all treated units.

head(simdata)
##    id time         Y D        X1         X2 eff      error mu       alpha
## 1 101    1  6.210998 0 0.3776736 -0.1732470   0  0.2982276  5 -0.06052364
## 2 101    2  4.027106 0 1.7332009 -0.4945009   0  0.6365697  5 -0.06052364
## 3 101    3  8.877187 0 1.8580159  0.4984432   0 -0.4837806  5 -0.06052364
## 4 101    4 11.515346 0 1.3943369  1.1272713   0  0.5168620  5 -0.06052364
## 5 101    5  5.971526 0 2.3636963 -0.1535215   0  0.3689645  5 -0.06052364
## 6 101    6  8.237905 0 0.5370867  0.8774397   0 -0.2153805  5 -0.06052364
##           xi          F1          L1           F2         L2
## 1  1.1313372  0.25331851 -0.04303273  0.005764186 -0.8804667
## 2 -1.4606401 -0.02854676 -0.04303273  0.385280401 -0.8804667
## 3  0.7399475 -0.04287046 -0.04303273 -0.370660032 -0.8804667
## 4  1.9091036  1.36860228 -0.04303273  0.644376549 -0.8804667
## 5 -1.4438932 -0.22577099 -0.04303273 -0.220486562 -0.8804667
## 6  0.7017843  1.51647060 -0.04303273  0.331781964 -0.8804667

Estimation

We estimate the model only using information of the outcome variable \(Y\), the treatment indicator \(D\), and two observed covariates \(X_{1}\) and \(X_{2}\) (as well as the group indicators \(id\) and \(time\)).

We run the gsynth algorithm. The first variable on the right hand side of the formula is a binary treatement indicator. The rest of the right hand side variables serve as controls. The index option specifies the unit and time indicators.

The force option specifies the additive component(s) of the fixed effects included in the model (with four options: “none”, “unit”, “time”, and “two-way”). The default option is “unit” (including additive unit fixed effects). A cross-validation procedure is provided (when CV=TRUE) to select the number of unobserved factors within the interval of r=c(0,5). When cross-validation is switched off, the first item of r will be set as the number of factors.

When the uncertianty estimates are need, set se=TRUE. When the number of treated units is small, an inferential method based a parametric bootstrap procedure is preferred (the alternative is a non-parametric bootstrap procedure by setting the inference option to nonparametric; this only works when the treatment group is relatively large, e.g. > 40). The number of bootstrap runs is set by nboots.

system.time(
    out <- gsynth(Y ~ D + X1 + X2, data = simdata, 
                  index = c("id","time"), force = "two-way",
                  CV = TRUE, r = c(0, 5), se = TRUE, 
                  inference = "parametric", nboots = 1000,
                  parallel = FALSE)
)
## Cross-validating ... 
##  r = 0; sigma2 = 1.74717; IC = 0.56867; MSPE = 2.37280
##  r = 1; sigma2 = 1.42977; IC = 0.76329; MSPE = 1.71737
##  r = 2; sigma2 = 0.93928; IC = 0.72756; MSPE = 1.14525*
##  r = 3; sigma2 = 0.88977; IC = 1.04715; MSPE = 1.15013
##  r = 4; sigma2 = 0.83864; IC = 1.35103; MSPE = 1.21356
##  r = 5; sigma2 = 0.79605; IC = 1.65129; MSPE = 1.23830
## 
##  r* = 2
## 
## 
Simulating errors .............
Bootstrapping ...
## ..........
Call:
## gsynth.formula(formula = Y ~ D + X1 + X2, data = simdata, index = c("id", 
##     "time"), force = "two-way", r = c(0, 5), CV = TRUE, se = TRUE, 
##     nboots = 1000, inference = "parametric", parallel = FALSE)
## 
## Average Treatment Effect on the Treated:
##      ATT.avg   S.E. CI.lower CI.upper p.value
## [1,]   5.543 0.2697    4.989    5.998       0
## 
##    ~ by Period (including Pre-treatment Periods):
##          ATT   S.E. CI.lower CI.upper p.value n.Treated
## 1   0.392362 0.5364  -0.6976  1.43781   0.506         0
## 2   0.276705 0.4413  -0.6122  1.06986   0.546         0
## 3  -0.274839 0.5280  -1.3327  0.65810   0.598         0
## 4   0.440915 0.4471  -0.3842  1.39061   0.282         0
## 5  -0.889404 0.4674  -1.7844  0.03404   0.056         0
## 6   0.593545 0.4865  -0.2511  1.59628   0.158         0
## 7   0.527967 0.3882  -0.2906  1.20897   0.238         0
## 8   0.170937 0.5408  -0.9018  1.23759   0.744         0
## 9   0.611304 0.4725  -0.2854  1.51483   0.184         0
## 10  0.170545 0.4556  -0.6430  1.13949   0.634         0
## 11 -0.272413 0.5868  -1.3133  0.89473   0.680         0
## 12  0.094706 0.5018  -0.9205  1.10277   0.804         0
## 13 -0.651881 0.5542  -1.8873  0.30746   0.172         0
## 14  0.573748 0.4840  -0.4206  1.39505   0.338         0
## 15 -0.469730 0.4923  -1.4320  0.54169   0.344         0
## 16 -0.077895 0.5575  -1.2725  0.87512   0.802         0
## 17 -0.141584 0.5615  -1.1158  0.93765   0.842         0
## 18 -0.156434 0.3874  -0.8990  0.64817   0.696         0
## 19 -0.914963 0.5131  -1.8203  0.18181   0.102         0
## 20 -0.003591 0.3468  -0.6055  0.73186   0.964         0
## 21  1.236193 0.7025  -0.1048  2.59920   0.084         5
## 22  1.629796 0.5609   0.4213  2.59323   0.008         5
## 23  2.711476 0.5646   1.5926  3.74436   0.000         5
## 24  3.466495 0.7158   1.9851  4.76607   0.000         5
## 25  5.739896 0.5459   4.5478  6.69967   0.000         5
## 26  5.280624 0.5907   4.1732  6.49077   0.000         5
## 27  8.435863 0.4628   7.4564  9.26511   0.000         5
## 28  7.839314 0.6643   6.2574  8.84660   0.000         5
## 29  9.454987 0.5296   8.4017 10.44941   0.000         5
## 30  9.638255 0.4974   8.6320 10.58700   0.000         5
## 
## Coefficients for the Covariates:
##     beta    S.E. CI.lower CI.upper p.value
## X1 1.021 0.02973   0.9624    1.081       0
## X2 3.052 0.02917   2.9934    3.109       0
##    user  system elapsed 
##  10.602   0.092  10.985

Because the core function is written in C++, the algorithm runs relatively fast. The entire procedure (of 1,000 bootstrap runs) takes about 15 seconds on a 2013 model MacBook Pro laptop.

The algorithm prints out the results automatically. sigma2 stands for the estimated variance of the error term; IC represents the Bayesian Information Criterion; and MPSE is the Mean Squared Prediction Error. The cross-validation procedure selects an \(r\) that minimizes the MSPE.

Users can also use print() or directly access the gsynth object to retrieve the results. est.att reports the average treatment effect on the treatd (ATT) by period; est.avg shows the ATT of all periods; and est.beta reports the coefficients of the time-varying covariates. Treatment effect estimates from each bootstrap run is stored in eff.boot, an array whose dimension = (#time periods * #treated * #bootstrap runs).

print(out)
out$est.att
out$est.avg
out$est.beta

Parallel computing may speed up the boostrap procedue. If paralllel=TRUE (default) and the cores option is omitted, the algorithm will detect the number of available cores automatrically (warning: it may consume most of the computational power of your computor if all cores are being used.)

system.time(
    out <- gsynth(Y ~ D + X1 + X2, data = simdata, 
                  index = c("id","time"), force = "two-way",
                  CV = TRUE, r = c(0, 5), se = TRUE, 
                  inference = "parametric", nboots = 1000,
                  parallel = TRUE, cores = 4)
)
## Parallel computing ...
## Cross-validating ... 
##  r = 0; sigma2 = 1.74717; IC = 0.56867; MSPE = 2.37280
##  r = 1; sigma2 = 1.42977; IC = 0.76329; MSPE = 1.71737
##  r = 2; sigma2 = 0.93928; IC = 0.72756; MSPE = 1.14525*
##  r = 3; sigma2 = 0.88977; IC = 1.04715; MSPE = 1.15013
##  r = 4; sigma2 = 0.83864; IC = 1.35103; MSPE = 1.21356
##  r = 5; sigma2 = 0.79605; IC = 1.65129; MSPE = 1.23830
## 
##  r* = 2
## 
## 
Simulating errors ...
Bootstrapping ...
## 
Call:
## gsynth.formula(formula = Y ~ D + X1 + X2, data = simdata, index = c("id", 
##     "time"), force = "two-way", r = c(0, 5), CV = TRUE, se = TRUE, 
##     nboots = 1000, inference = "parametric", parallel = TRUE, 
##     cores = 4)
## 
## Average Treatment Effect on the Treated:
##      ATT.avg   S.E. CI.lower CI.upper p.value
## [1,]   5.543 0.2806    4.974    6.082       0
## 
##    ~ by Period (including Pre-treatment Periods):
##          ATT   S.E. CI.lower  CI.upper p.value n.Treated
## 1   0.392362 0.5611  -0.7371  1.521582   0.592         0
## 2   0.276705 0.4393  -0.6055  1.118579   0.598         0
## 3  -0.274839 0.5259  -1.3542  0.748188   0.576         0
## 4   0.440915 0.4475  -0.4064  1.325776   0.284         0
## 5  -0.889404 0.5005  -1.9230 -0.006671   0.050         0
## 6   0.593545 0.4857  -0.2499  1.630306   0.214         0
## 7   0.527967 0.4105  -0.2609  1.265716   0.230         0
## 8   0.170937 0.5775  -0.8298  1.344019   0.706         0
## 9   0.611304 0.5015  -0.3620  1.580188   0.230         0
## 10  0.170545 0.4715  -0.7297  1.118308   0.728         0
## 11 -0.272413 0.5836  -1.3864  0.880265   0.708         0
## 12  0.094706 0.4999  -0.9709  1.015215   0.856         0
## 13 -0.651881 0.5358  -1.7196  0.346603   0.194         0
## 14  0.573748 0.4822  -0.4998  1.385880   0.320         0
## 15 -0.469730 0.5110  -1.5119  0.474601   0.318         0
## 16 -0.077895 0.5421  -1.1294  0.987679   0.946         0
## 17 -0.141584 0.5484  -1.1591  1.015823   0.884         0
## 18 -0.156434 0.4032  -0.9020  0.684593   0.778         0
## 19 -0.914963 0.5219  -1.9197  0.031617   0.068         0
## 20 -0.003591 0.3623  -0.6376  0.773544   0.914         0
## 21  1.236193 0.7254  -0.1330  2.630782   0.078         5
## 22  1.629796 0.5662   0.4698  2.723624   0.006         5
## 23  2.711476 0.5354   1.7512  3.821544   0.000         5
## 24  3.466495 0.7092   2.1455  4.831086   0.000         5
## 25  5.739896 0.5842   4.5722  6.841840   0.000         5
## 26  5.280624 0.5579   4.2254  6.444443   0.000         5
## 27  8.435863 0.4878   7.4058  9.327691   0.000         5
## 28  7.839314 0.6548   6.4642  8.997924   0.000         5
## 29  9.454987 0.5272   8.3794 10.381223   0.000         5
## 30  9.638255 0.5235   8.6421 10.663048   0.000         5
## 
## Coefficients for the Covariates:
##     beta    S.E. CI.lower CI.upper p.value
## X1 1.021 0.03023   0.9638    1.082       0
## X2 3.052 0.03021   2.9980    3.116       0
##    user  system elapsed 
##   1.108   0.065   4.824

Plotting

By default, the plot() function produce a gap plot (as if we type plot(out, type = "gap")), or the ATT by period. For your reference, the true effect in simdata go from 1 to 10 (plus some white noise) from period 21 to 30.

plot(out) # by default

Users can adjust xlim and ylim, and supply the plot title and titles for the two axes.

plot(out, type = "gap", ylim = c(-3,12), xlab = "Period", main = "My GSynth Plot")

The other four type options include "raw", which plots the raw data (outcome variable); "counterfactual", which plots the estimated counterfactual(s); and "factors" and "loadings", which plot estimated factors and loadings, respectively. The next figure is a raw plot. The treated (pre and post) and control untis are painted with different colors.

plot(out, type = "raw")