1  Get Started

This chapter provides installation instructions and introduces the datasets used in the tutorial.

1.1 Installation

Source Version Date Features
CRAN 2.4.5 2026-05-30 Default release
GitHub (master) 2.4.5 2026-05-30 Same as CRAN
GitHub (dev) 2.4.5 2026-05-30 Same as CRAN

To install fect from CRAN:

For the most up-to-date stable version on GitHub:

devtools::install_github("xuyiqing/fect")

Bug fixes land on the dev branch first, so it is often ahead of master:

devtools::install_github("xuyiqing/fect@dev")

Check the installed version:

installed.packages()["fect", "Version"]
#> [1] "2.4.5"

panelView for panel data visualization is highly recommended and will be used in the tutorial:

devtools::install_github('xuyiqing/panelView')

fect depends on the following packages, which should be installed automatically when fect is being installed. You can also install them manually.

install_all <- function(packages) {
  installed_pkgs <- installed.packages()[, "Package"]
  for (pkg in packages) {
    if (!pkg %in% installed_pkgs) {
      install.packages(pkg)
    }
  }
}
packages <- c("abind", "doParallel", "doRNG", "fixest", "foreach", "future", 
              "GGally", "ggplot2", "grid", "gridExtra", "MASS", 
              "panelView", "Rcpp")
install_all(packages)

1.2 Datasets

The fect package ships several datasets. With LazyData: true, all datasets become available once the package is loaded.

data(simdata)
data(sim_base)
data(sim_gsynth)
data(sim_linear)
data(sim_trend)
data(sim_region)
data(hh2019)
data(gs2020)
data(turnout)
ls()
#> [1] "gs2020"     "hh2019"     "sim_base"   "sim_gsynth" "sim_linear"
#> [6] "sim_region" "sim_trend"  "simdata"    "turnout"

1.2.1 Simulated datasets

The package includes six simulated panel datasets. Two (simdata and sim_base) are based on the data-generating process (DGP) in Liu, Wang, and Xu (2024). Both have \(N = 200\) units and \(T = 35\) time periods. Treatment switches on and off over time (99 of 150 treated units experience at least one reversal), reflecting a general treatment pattern rather than simple staggered adoption. Three more (sim_trend, sim_region, sim_linear) are block DID designs used to demonstrate CFE model components. The sixth (sim_gsynth) is a no-reversal DGP from Xu (2017) used to illustrate the synthetic-control regime.

The full DGP for simdata is: \[Y_{it} = \tau_{it} D_{it} + X_{1,it} + 3 X_{2,it} + \mu + 3\alpha_i + \xi_t + 2\, \lambda_i' f_t + \varepsilon_{it}\] where \(\alpha_i \sim N(0,1)\) are unit fixed effects, \(\xi_t\) follows an AR(1) process with drift (time fixed effects), \(X_{1,it}\) and \(X_{2,it} \sim N(0,1)\) are observed covariates with coefficients 1 and 3, \(\lambda_i \in \mathbb{R}^2\) are unit-specific factor loadings drawn from \(N(0.5, 1)\), \(f_t \in \mathbb{R}^2\) are latent time factors (one trending, one white noise), and \(\varepsilon_{it} \sim N(0,2)\). The treatment effect is heterogeneous, i.e., \(\tau_{it} \sim N(0.4 \cdot \text{tr\_cum}_{it}/T,\; 0.2)\), where \(\text{tr\_cum}_{it}\) counts cumulative treatment periods. The grand mean is \(\mu = 5\). The factor contribution carries a coefficient of 2 (vs the original LWX 2024 DGP’s coefficient of 1) to give a factor signal-to-noise ratio of approximately 10.9, well above the threshold needed for cross-validated rank selection to recover the true rank reliably on this dataset.

Treatment assignment is correlated with unobservables: the latent index \(D^*_{it}\) depends on the factor component \(5 \lambda_i' f_t\), the unit fixed effect \(2\alpha_i\), the time fixed effect \(2\xi_t\), and an AR(1)-like persistence term \(5 D_{i,t-1}\), passed through a logistic link. Units with larger factor loadings and fixed effects are more likely to be treated, creating confounding that correlates treatment with unobserved heterogeneity. This is why the FE estimator is biased when factors are present—the parallel trends assumption fails because treated units systematically differ in their factor loadings.

  • simdata: The main simulated dataset. The outcome includes two latent factors (\(r = 2\)), so the parallel trends assumption is violated and the FE estimator is biased. Because treatment assignment loads on the same factors and fixed effects that enter the outcome—units with larger \(\lambda_i\) and \(\alpha_i\) are more likely to be treated—the confounding is structural and cannot be removed by two-way fixed effects alone. Used in Chapter 4 and Chapter 5 to demonstrate factor-augmented approaches.

  • sim_base: A simplified version of simdata in which the latent factor contributions (\(\lambda_i' f_t\)) are removed from the outcome. The parallel trends assumption holds, and the FE estimator is consistent. Treatment assignment, covariates, fixed effects, and errors are identical to simdata—treatment still correlates with the factor loadings and fixed effects, but this no longer causes bias because the factors do not enter the outcome. Used in Chapter 2 to demonstrate the imputation estimator.

  • sim_trend: A block DID dataset with unit-specific sinusoidal time trends. \(N = 200\) units (\(80\) treated, \(120\) control), \(T = 50\) periods, treatment starts at period 41 for all treated units. The DGP is: \[Y_{it} = \alpha_i + \xi_t + \kappa_i \sin(2\pi t / 2T) + \tau D_{it} + \varepsilon_{it}\] where \(\kappa_i \sim U(0.5, 1.0)\) for treated units and \(\kappa_i \sim U(0.125, 0.375)\) for controls. Treatment is deterministic (block assignment), but confounding arises because treated units load more heavily on the sinusoidal time trend—the correlation between \(\kappa_i\) and \(D_i\) violates parallel trends. Used in Chapter 5 to demonstrate nonlinear unit-specific time trends with B-splines.

  • sim_linear: A block DID dataset with unit-specific linear time trends. Same structure as sim_trend (\(N = 200\), \(T = 50\), \(T_0 = 41\)) but the trend is \(\kappa_i \cdot t/T\) rather than sinusoidal. Treated units have slopes \(\kappa_i \sim U(2, 4)\), controls have \(\kappa_i \sim U(0, 0.5)\). Used in Chapter 5 to demonstrate Q.type = "linear".

  • sim_region: An unbalanced panel with region-specific time effects. \(N = 500\) units in 5 regions, \(T = 20\) periods. The DGP is: \[Y_{it}^{0} = \alpha_i + \xi_t + \delta_{g(i),t} + \varepsilon_{it}\] where \(\delta_{g(i),t}\) are region-specific linear time trends. Treatment probability and timing depend on region, and units in higher-numbered regions enter the panel later. Used in Chapter 5 to demonstrate additional fixed effects in the CFE estimator.

  • sim_gsynth: A simulated dataset with no treatment reversal, based on Xu (2017). Used in Chapter 6 to demonstrate the never-treated estimation regime (generalized synthetic control).

The scripts that generate simulated datasets are in data-raw/.

1.2.2 Empirical datasets

  • turnout: Based on Xu (2017). Used in Chapter 6 alongside sim_gsynth.
  • gs2020: Based on Grumbach and Sahn (2020), who examine the effect of minority candidate presence on the proportion of coethnic donations in U.S. House elections. Used in Chapter 9 and Chapter 11.
  • hh2019: Based on Hainmueller and Hangartner (2019), who study the effect of indirect versus direct democracy on naturalization rates in Switzerland. Used in Chapter 9, Chapter 10, and Chapter 11.