**R** package for *Generalized Synthetic Control Method: for Causal Inference with Interactive Fixed Effect Models*

**Authors:** Yiqing Xu (UCSD); Licheng Liu (Tsinghua)

**Date:** Dec 9, 2017

**Package:** gsynth

**Version:** 1.0.7 (Github version). 1.0.5 (CRAN version). Please report bugs!

**Updates:**

Add “implied weights” of control units for each treated unit to the ouput of the main function (

`wgt.implied`

)Add an option to visualize missing data before running regressions to the main function (

`quick_missing = TRUE`

).Add a plot to visualize missing data and treatment status (

`type = "missing"`

)Accomdate unbalanced panels

**R** source files can be found on Github. R code used in this demonstration can be downloaded from here.

**Reference**: Yiqing Xu, 2017. “Generalized Synthetic Control Method: Causal Inference with Interactive Fixed Effects Models.” Political Analysis, Volume 25, Issue 1, January 2017, pp. 57-76. Available at: http://dx.doi.org/10.1017/pan.2016.2.

- Installation
- Example 1: Simulated Data
- Example 2: Election-Day Registration on Voter Turnout
- Unbalanced Panels

You can install the **gsyth** package directly from CRAN by typing the following command in the **R** console:

`install.packages('gsynth', type = 'source')`

If you plan to use the lastest functionalities (e.g. dealing with unbalanced panels), please install the deveolopment version of the package from Github by typing the following commands:

```
install.packages('devtools', repos = 'http://cran.us.r-project.org') # if not already installed
devtools::install_github('xuyiqing/gsynth')
```

**gsynth** depends on the following packages, which will be installed AUTOMATICALLY when **gsynth** is being installed; you can also install them manually:

```
## for processing C++ code
require(Rcpp)
## for plotting
require(ggplot2)
require(GGally)
## for parallel computing
require(foreach)
require(doParallel)
require(abind)
```

Two datasets **simdata** and **turnout** are shipped with the **gsynth** package. Load these two datasets:

```
library(gsynth)
data(gsynth)
ls()
```

`## [1] "simdata" "turnout"`

We start with the first example, a simulated dataset describled in the paper. There are 5 treated units, 45 control units, and 30 time periods. The treatment kicks at Period 21 for all treated units.

`head(simdata)`

```
## id time Y D X1 X2 eff error mu alpha
## 1 101 1 6.210998 0 0.3776736 -0.1732470 0 0.2982276 5 -0.06052364
## 2 101 2 4.027106 0 1.7332009 -0.4945009 0 0.6365697 5 -0.06052364
## 3 101 3 8.877187 0 1.8580159 0.4984432 0 -0.4837806 5 -0.06052364
## 4 101 4 11.515346 0 1.3943369 1.1272713 0 0.5168620 5 -0.06052364
## 5 101 5 5.971526 0 2.3636963 -0.1535215 0 0.3689645 5 -0.06052364
## 6 101 6 8.237905 0 0.5370867 0.8774397 0 -0.2153805 5 -0.06052364
## xi F1 L1 F2 L2
## 1 1.1313372 0.25331851 -0.04303273 0.005764186 -0.8804667
## 2 -1.4606401 -0.02854676 -0.04303273 0.385280401 -0.8804667
## 3 0.7399475 -0.04287046 -0.04303273 -0.370660032 -0.8804667
## 4 1.9091036 1.36860228 -0.04303273 0.644376549 -0.8804667
## 5 -1.4438932 -0.22577099 -0.04303273 -0.220486562 -0.8804667
## 6 0.7017843 1.51647060 -0.04303273 0.331781964 -0.8804667
```

Before we conduct any statistical analysis, it is helpful to visusalize the data structure and spot missing values (if there are any). The following figure shows that: (1) there are 5 treated units and 45 control units; (2) the treated units start to be treated in period 21; and (3) there are no missing values, which is a rare case. Note that when the `quick_missing`

option is switched on, no estimation will be performed.

`gsynth(Y ~ D, data = simdata, index = c("id","time"), quick_missing = TRUE) `

We estimate the model only using information of the outcome variable \(Y\), the treatment indicator \(D\), and two observed covariates \(X_{1}\) and \(X_{2}\) (as well as the group indicators \(id\) and \(time\)).

Now we formally run the **gsynth** algorithm. The first variable on the right hand side of the formula is a binary treatement indicator. The rest of the right hand side variables serve as controls. The `index`

option specifies the unit and time indicators.

The force option specifies the additive component(s) of the fixed effects included in the model (with four options: “none”, “unit”, “time”, and “two-way”). The default option is “unit” (including additive unit fixed effects). A cross-validation procedure is provided (when `CV=TRUE`

) to select the number of unobserved factors within the interval of `r=c(0,5)`

. When cross-validation is switched off, the first item of `r`

will be set as the number of factors.

When the uncertianty estimates are needed, set `se=TRUE`

. When the number of treated units is small, an inferential method based a `parametric`

bootstrap procedure is preferred (the alternative is a non-parametric bootstrap procedure by setting the inference option to `nonparametric`

; this only works when the treatment group is relatively large, e.g. > 40). The number of bootstrap runs is set by `nboots`

.

```
system.time(
out <- gsynth(Y ~ D + X1 + X2, data = simdata,
index = c("id","time"), force = "two-way",
CV = TRUE, r = c(0, 5), se = TRUE,
inference = "parametric", nboots = 1000,
parallel = FALSE)
)
```

```
## Cross-validating ...
## r = 0; sigma2 = 1.74717; IC = 0.56867; MSPE = 2.37280
## r = 1; sigma2 = 1.42977; IC = 0.76329; MSPE = 1.71737
## r = 2; sigma2 = 0.93928; IC = 0.72756; MSPE = 1.14525*
## r = 3; sigma2 = 0.88977; IC = 1.04715; MSPE = 1.15013
## r = 4; sigma2 = 0.83864; IC = 1.35103; MSPE = 1.21356
## r = 5; sigma2 = 0.79605; IC = 1.65129; MSPE = 1.23830
##
## r* = 2
##
##
Simulating errors .............
Bootstrapping ...
## ..........
```

```
## user system elapsed
## 10.662 0.074 11.244
```

Because the core function is written in C++, the algorithm runs relatively fast. The entire procedure (of 1,000 bootstrap runs) takes about 15 seconds on a 2013 model MacBook Pro laptop.

The algorithm prints out the results automatically. `sigma2`

stands for the estimated variance of the error term; `IC`

represents the Bayesian Information Criterion; and `MPSE`

is the Mean Squared Prediction Error. The cross-validation procedure selects an \(r\) that minimizes the MSPE.

Users can use the **print** function to visualize the result or retrieve the numbers by directly accessing the gsynth object. `est.att`

reports the average treatment effect on the treatd (ATT) by period; `est.avg`

shows the ATT of all periods; and `est.beta`

reports the coefficients of the time-varying covariates. Treatment effect estimates from each bootstrap run is stored in `eff.boot`

, an array whose dimension = (#time periods * #treated * #bootstrap runs).

```
print(out)
out$est.att
out$est.avg
out$est.beta
```

Parallel computing may speed up the boostrap procedue significantly. When `paralllel=TRUE`

(default) and `cores`

options are omitted, the algorithm will detect the number of available cores automatrically (warning: it may consume most of the computational power of your computor if all cores are being used.)

```
system.time(
out <- gsynth(Y ~ D + X1 + X2, data = simdata,
index = c("id","time"), force = "two-way",
CV = TRUE, r = c(0, 5), se = TRUE,
inference = "parametric", nboots = 1000,
parallel = TRUE, cores = 4)
)
```

```
## Parallel computing ...
## Cross-validating ...
## r = 0; sigma2 = 1.74717; IC = 0.56867; MSPE = 2.37280
## r = 1; sigma2 = 1.42977; IC = 0.76329; MSPE = 1.71737
## r = 2; sigma2 = 0.93928; IC = 0.72756; MSPE = 1.14525*
## r = 3; sigma2 = 0.88977; IC = 1.04715; MSPE = 1.15013
## r = 4; sigma2 = 0.83864; IC = 1.35103; MSPE = 1.21356
## r = 5; sigma2 = 0.79605; IC = 1.65129; MSPE = 1.23830
##
## r* = 2
##
##
Simulating errors ...
Bootstrapping ...
##
```

```
## user system elapsed
## 1.184 0.075 7.339
```

By default, the **plot** function produce a gap plot – as if we type `plot(out, type = "gap")`

– which visualize the estimated ATT by period. For your reference, the true effect in `simdata`

go from 1 to 10 (plus some white noise) from period 21 to 30. You can save the graph object by directing **print** to an object name.

`plot(out) # by default`

Users can adjust `xlim`

and `ylim`

, and supply the plot title and titles for the two axes.

`plot(out, type = "gap", ylim = c(-3,12), xlab = "Period", main = "My GSynth Plot")`