This RMarkdown tutorial replicates the core analyses from Xu and Yao (2015): “Informal Institutions, Collective Action, and Public Investment in Rural China”. The replication is conducted by Jinwen Wu, a predoctoral fellow at Stanford University, under the supervision of Professor Yiqing Xu. It summarizes the main data analyses from the article; for a comprehensive understanding of the ideas presented, please refer to the original paper.

Click the Code button in the top right and select Show All Code to reveal all code used in this RMarkdown. Click Show in paragraphs to reveal the code used to generate a finding. Data and R files that replicate this RMarkdown can be downloaded here.


Informal institutions, such as lineage groups, play a crucial role in rural governance, particularly where formal structures are weak. Xu and Yao (2015) examine the relationship between lineage-based informal institutions and public goods provision in rural China. Analyzing panel data from 220 Chinese villages from 1986 to 2005, they find that village leaders from the two largest family clans significantly increased local public investment, with stronger effects in more cohesive clans. They find that clans helped leaders overcome collective action challenges in financing public goods, but there is little evidence that they helped through improved leader accountability.

1 Conceptual Framework

1.1 Informal Institutions

Following Helmke and Levitsky (2012), Xu and Yao (2015) define informal institutions as “rules and norms that are created and enforced by social groups rather than the state.” The paper primarily focuses on those that may influence public goods provision.

In the absence of strong formal institutions, as seen in rural China, public goods provision must address two fundamental challenges:

  • Collective Action: Convincing resource-constrained community members to contribute.
  • Accountability: Motivating local leaders to initiate projects while preventing issues like embezzlement and corruption.

Xu and Yao (2015) examined whether informal institutions, such as clans, can facilitate public goods provision through one or both of these mechanisms.

1.2 Family Clans

Family clans share several key characteristics that contribute to village governance in rural China.

  1. They serve as a vital social identity for villagers (Freedman 1958; Tsai 2007).

  2. They have historically facilitated collective production. Today, rural entrepreneurs often hire relatives to strengthen clan-based business networks (Watson 1982; Oi 1999).

  3. Clan members share a strong sense of obligation to their group, deeply valuing kinship ties and social bonds (Madsen 1984).

  4. Clan leaders play a crucial role in maintaining order by enforcing social norms and resolving conflicts both within and beyond the clan.

1.3 Potential Causal Path

Government officials from large clans in the region have stronger connections and networks within local communities. As members of robust informal institutions, they are expected to deliver more public goods projects for two reasons, both aligning with the mechanisms discussed in the previous section.

  1. These leaders are more likely to gain support from their clans than those who are either not native to the area or affiliated with only smaller local clans. They can leverage their clans’ influence to better mobilize villagers for public goods projects, potentially alleviating the collective action problem. The authors use financial contributions as a consistent and fair metric to measure participation in social welfare provisions over time.

  2. Leaders from large local clans may feel morally obligated to adhere to their clans’ rules and interests, which is expected to reduce corruption or embezzlement.

If clan linkages enhance public goods provision, the influence of the clans from which leaders originate can shape outcomes by determining: (1) the extent to which leaders can secure local support (through the first mechanism) and (2) the likelihood that local politicians will act more effectively (through the second mechanism).

Xu and Yao (2015) measure clan cohesiveness as a proxy for a leader’s home clan influence. This measurement is based on two indicators: (1) whether large clans have maintained family tree records over time and (2) whether they upheld lineage halls throughout the observation period (1986–2005).

1.4 Research Questions and Implications

This study examines two channels through which informal institutions influence village governance: facilitating collective action for public goods financing and ensuring accountability of village chairpersons (VCs). The findings support the collective action channel, demonstrating that villagers across income levels contributed additional levies for public investment projects. However, the researchers find little evidence that accountability mechanisms impact administrative costs under large clan VCs.

The study acknowledges two unresolved questions: the potential for large clans to dominate grassroots politics—possibly benefiting disproportionately or engaging in corruption—and the interplay between formal and informal institutions.

2 Research Design

2.1 Data

The study uses a panel dataset from the Village Democracy Survey (VDS), which includes data from 220 Chinese villages collected between 1986 and 2005, supplemented by additional data from 2011 and the National Fixed-Point Survey (NFS).

The VDS provides detailed information on electoral reforms, public goods expenditures, and clan structures, sourced from village records and responses from village leaders and elders. Data on clans—including their size, cohesiveness, and activities—were well-corroborated by local consensus.

Merging the two datasets preserves the panel structure and sample size, enabling models that incorporate village and year fixed effects, village- or province-specific time trends, and time-varying covariates.

2.2 Identification Strategies

The study employs two identification strategies. The first is based on parallel trends using two-way fixed effects (TWFE) models. In the replication, the TWFE counterfactual estimator (FEct) is also used. The treatments are indicators of whether the VC leader belongs to the largest or second-largest clan in the village, and the outcome is the log of village-initiated investment amount.

The second identification strategy is a regression discontinuity (RD) design, where the running variable is the vote share of a VC candidate from one of the two largest clans in the village relative to a candidate who is not. Because it is an RD design with leader characteristics, the effect needs to be interpreted with caution (Marshall 2024).

3 Replicating the Main Findings

3.1 Installing Packages

Several R packages are required for the data analysis and visualization. The code chunk below checks for all required packages and installs the missing ones.

Packages: “tidyr”, “dplyr”, “haven”, “ggplot2”, “paneltools”, “estimatr”, “modelsummary”, “fect”, “fixest”, “kableExtra”, “rdrobust”, “panelView”.

# packages to be installed
packages <- c( "tidyr", "dplyr", "haven", "ggplot2", "paneltools", "modelsummary", "fect", "fixest", "kableExtra", "rdrobust", "panelView")

for (pkg in packages) {
  if (!requireNamespace(pkg, quietly = TRUE)) {
    install.packages(pkg)
  }
  library(pkg, character.only = TRUE)
}

After installation, call the code chunk below to load these packages. Next, import the data. The data files are located in the Replication folder, named XuYao2015.dta. The codebook, XuYao2015.pdf, can also be found in the same folder.

3.2 Summary Statistics

First, we generate a summary table of the descriptive statistics for the villages in VDS.

3.2.1 Village-Year Observations

# First set of variables
vars1 <- c("inv", "loginv", "log_levies", "logpopl", "logincome", "logasset", "hhsize","landpc", "logmigration", "logtax", "logtransfer", "share_admin", "postcont", "postopen", "secret_ballot", "proxy_voting", "moving_ballot")

datasummary_skim(df %>% select(all_of(vars1)))
Unique Missing Pct. Mean SD Min Median Max Histogram
Public investment projects during the year 2 0 0.2 0.4 0.0 0.0 1.0
Log amount of public investment (1000 yuan) 204 0 1.1 2.2 0.0 0.0 10.6
Log amount of levies handed over to the village committee (village mean) 942 71 4.2 1.9 0.0 4.7 7.1
log village population (persons) 2134 6 7.2 0.6 4.7 7.2 9.2
log net income per capita (yuan) 2729 6 7.2 0.8 1.9 7.3 10.4
log assets controlled by the village committee (yuan) 2666 6 9.0 1.6 2.7 9.0 15.4
village average household size (person) 3302 6 3.9 0.6 2.0 4.0 6.4
arable land per capita (mu) 3332 6 1.7 1.9 0.0 1.2 16.2
log number of people migrating out of the village (persons) 141 28 2.2 1.1 0.0 2.3 5.5
log taxes to the upper-level government (1,000 yuan) 953 32 2.3 1.9 0.0 2.6 8.8
Log transfers from the upper-level government (1,000 yuan) 348 32 1.1 1.6 0.0 0.0 7.5
Share of administrative expenditure in total village expenditure 2484 19 0.2 0.2 0.0 0.2 1.0
post first contested election 2 0 0.8 0.4 0.0 1.0 1.0
post first election with open nomination 2 0 0.7 0.5 0.0 1.0 1.0
secret ballot 2 0 0.4 0.5 0.0 0.0 1.0
proxy voting 2 0 0.7 0.5 0.0 1.0 1.0
moving ballot boxes 2 0 0.7 0.5 0.0 1.0 1.0

3.2.2 Village Chairperson (by Term)

# Second set of variables
df_elecyr <- df %>% filter(elecyr == 1)
vars2 <- c("vcfirst", "vcsecond", "vcfirst2", grep("^vc_char_", names(df), value = TRUE),"vote", "psfirst2", "vcps_clan", "vcps_person", "vc_pb")

datasummary_skim(df_elecyr %>% select(all_of(vars2)))
Unique Missing Pct. Mean SD Min Median Max Histogram
village chief (VC) from the largest clan 2 0 0.4 0.5 0.0 0.0 1.0
VC from the 2nd largest clan 2 0 0.1 0.3 0.0 0.0 1.0
VC from the two largest clans 2 0 0.5 0.5 0.0 0.0 1.0
VC characteristics: age when running election 54 9 41.6 8.7 19.0 42.0 90.0
VC characteristics: family background - poor peasants 3 8 0.8 0.4 0.0 1.0 1.0
VC characteristics: Communist party member 3 9 0.7 0.4 0.0 1.0 1.0
VC characteristics: years of formal education 7 8 6.4 2.3 0.0 6.0 13.0
VC characteristics: former village cadre 3 8 0.6 0.5 0.0 1.0 1.0
VC characteristics: managerial jobs before election 3 8 0.0 0.1 0.0 0.0 1.0
VC characteristics: experience of running elections 3 8 0.7 0.5 0.0 1.0 1.0
VC characteristics: denounced in the Culture Revolution 3 9 0.0 0.2 0.0 0.0 1.0
relative vote share of VCs of large clans 254 40 0.5 0.4 0.0 0.5 1.0
village party secretary (VPS) from the two largest clans 3 34 0.5 0.5 0.0 1.0 1.0
VC and VPS in the same clan 2 0 0.2 0.4 0.0 0.0 1.0
VC and VPS the same person (one shoulders) 2 0 0.1 0.3 0.0 0.0 1.0
VC in the party branch 3 37 0.6 0.5 0.0 1.0 1.0

3.2.3 Sample village (as of 2005)

# Third set of variables
df_2005 <- df %>% filter(year == 2005)
vars3 <- c("clan_num", grep("^clansz", names(df), value = TRUE), "largeclan", "family_tree", "ances_hall")

datasummary_skim(df_2005 %>% select(all_of(vars3)))
Unique Missing Pct. Mean SD Min Median Max Histogram
No. of clans (surnames) 66 0 26.8 23.4 1.0 20.0 150.0
population share of the largest clan 61 0 0.4 0.2 0.1 0.3 1.0
population share of the 2nd largest clan 38 0 0.2 0.1 0.0 0.2 0.4
population share of the 3rd largest clan 42 9 0.1 0.1 0.0 0.1 0.3
population share of the 4th largest clan 44 9 0.1 0.0 0.0 0.1 0.2
sum of the population shares of the two largest clan above median 2 0 0.5 0.5 0.0 0.0 1.0
the 2 largest families has maintained at least a family tree 3 9 0.5 0.5 0.0 0.0 1.0
the 2 largest families has maintained at least a lineage hall 3 9 0.2 0.4 0.0 0.0 1.0

Replicating Table 1 in the article.

3.3 The Dominance of Large Clans

First, Xu and Yao (2015) demonstrate that large kinship groups (based on last names) are common in the sample villages.

# Figure 1(a)
plot(density(df$clansz1,na.rm=T,bw=0.04),ylim=c(0,8),xlab="",ylab="",main="Population of Clans")
mtext("Density", side=2,line=2)
mtext("Population Share", side=1,line=2)
lines(density(df$clansz2,na.rm=T,bw=0.04),lty=5)
lines(density(df$clansz3,na.rm=T,bw=0.04),lty=2)
lines(density(df$clansz4,na.rm=T,bw=0.04),lty=3,lwd=2)
legend("topright", c("Largest", "2nd-largest","3rd-largest", "4th-largest"), 
       lty = c(1,5,2,3), col=1, lwd=c(1,1,1,2), merge = TRUE, cex=1.5,bty="n")  

Replicating Figure 1a in the article.

As shown in the plot above, the combined population of the third-largest and fourth-largest clans often accounts for less than 30% of a village’s total population.

In contrast, the largest and second-largest clans make up an average of 36% and 15% of the population, respectively. In 2005, the typical village in the sample had approximately 1,500 permanent residents, with the largest clan often comprising about 400 individuals, or 100 households.
In the subsequent analysis, the authors assume that local government officials from either of the two largest clans (based on last names) are candidates of the major clans, which serve as vehicles for informal institutions.

3.4 Presence of Government Officials from Large Clans

Similarly, Xu and Yao (2015) plot the likelihood that elected government officials come from the two largest clans.

# Summarize clan data by year
df_clans <- df %>%
  filter(!is.na(year) & !is.na(elecyr)) %>%
  group_by(year) %>%
  summarize(
    largest_clan = sum(vcfirst, na.rm = TRUE),
    second_largest_clan = sum(vcsecond, na.rm = TRUE)
  ) %>%
  ungroup()

# Create a frequency table for total observations by year
freq_table <- table(df$year)
years <- as.numeric(names(freq_table))
freq  <- as.numeric(freq_table)
num_villages_with_elec_vc <- data.frame(year = years, freq = freq)

# Merge clan data with total observations
figure1b <- left_join(df_clans, num_villages_with_elec_vc, by = "year") %>%
  mutate(
    largest_prop = largest_clan / freq,
    second_largest_prop = second_largest_clan / freq
  )

# Plot using dual y-axes
max_freq <- max(figure1b$freq, na.rm = TRUE)

ggplot(figure1b, aes(x = year)) +
  # Left axis: total observations (bars)
  geom_bar(aes(y = freq), stat = "identity", fill = "gray", alpha = 0.5) +
  # Right axis: clan proportions, rescaled to match the bar heights
  geom_line(aes(y = largest_prop * max_freq, color = "Largest Clan"), size = 1) +
  geom_line(aes(y = second_largest_prop * max_freq, color = "Second Largest Clan"), 
            size = 1, linetype = "dashed") +
  scale_y_continuous(
    name = "Number of Villages with Elected VCs",
    limits = c(0, max_freq),
    sec.axis = sec_axis(~ . / max_freq * 100, name = "Percentage %")
  ) +
  scale_color_manual(
    name = "Legend",
    values = c("Largest Clan" = "black", "Second Largest Clan" = "black")
  ) +
  labs(
    x = "Year",
    title = "Number of Elected VCs and Clan Percentages"
  ) +
  theme_minimal()

Replicating Figure 1b in the article.

Figure 1b illustrates the proportion of villages in the sample that have held elections since 1986. By that year, over half of the villages had adopted the rule, and by the mid-1990s, nearly all had conducted at least one election. The solid and dashed lines represent the proportions of VC members elected from the largest and second-largest village clans, respectively. On average, 35% of VCs came from the largest clan, while 13% were from the second-largest clan between 1986 and 2005.

3.5 VCs of Large Clans and Village Public Investment

Before conducting regression analysis, we use the panelView package to examine the treatment assignment schedule. The results indicate that the treatment has reversals.

index = c("vill_id", "year")
panelview(loginv~ vcfirst2, index = index, data = df,  axis.lab="off", ylab = "Year", xlab = "Village", by.timing = TRUE, gridOff = TRUE)

3.5.1 Public Investment Sizes

As reviewed in the Conceptual Framework section, the authors aim to analyze how informal institutions contribute to public goods provision. The table below examines the relationship between VCs from large clans and the level of public investment during their tenure.

In the original article, all regressions are clustered at the village level. Except for the first model, which includes fixed effects only at the village level, all subsequent regressions control for both village and year fixed effects.

# Model 1
model1 <- feols(loginv ~ vcfirst + vcsecond | year, data = df, cluster = ~vill_id)

# Model 2
model2 <- feols(loginv ~ vcfirst + vcsecond | vill_id + year, data = df, cluster = ~vill_id)

# Model 3
model3 <- feols(loginv ~ vcfirst + vcsecond + factor(prov_id):year | vill_id + year , data = df, cluster = ~vill_id)

# Model 4
model4 <- feols(loginv ~ vcfirst + vcsecond + factor(vill_id):year| vill_id + year, data = df, cluster = ~vill_id)

# Model 5
model5 <- feols(loginv ~ vcfirst + vcsecond + hhsize + landpc + logpopl + logincome + logasset + factor(prov_id):year| + vill_id + year, data = df, cluster = ~vill_id)

# Model 6
model6 <- feols(loginv ~ vcfirst + vcsecond + hhsize + landpc + logpopl + logincome + logasset + logmigration + logtax + logtransfer + factor(prov_id):year| vill_id + year, data = df, cluster = ~vill_id)

models_table2 <- list(
  "(1)" = model1,
  "(2)" = model2,
  "(3)" = model3,
  "(4)" = model4,
  "(5)" = model5,
  "(6)" = model6
)

modelsummary(
  models_table2,
  fmt = 3,
  stars = TRUE,
  coef_map = c(
    "vcfirst"  = "VC of the largest clan",
    "vcsecond" = "VC of the second-largest clan"
  ),
  # Omit all other terms you do not want in the main coefficient block:
  coef_omit = "Intercept|vill_id|year|prov_id|hhsize|landpc|logpopl|logincome|logasset|logmigration|logtax|logtransfer",
  # Omit unneeded goodness-of-fit stats:
  gof_omit = "AIC|BIC|RMSE|Within|R2",
  add_rows = tribble(
    ~term,                 ~`(1)`, ~`(2)`, ~`(3)`, ~`(4)`, ~`(5)`, ~`(6)`,
    "Prov. linear trends", "",     "",     "x",    "",     "x",    "x",
    "NFS controls",        "",     "",     "",     "",     "x",    "x",
    "Migrants out",        "",     "",     "",     "",     "",     "x",
    "Taxes (upper-level)", "",     "",     "",     "",     "",     "x",
    "Transfers (upper)",   "",     "",     "",     "",     "",     "x",
  ),
  output = "html"
)
(1) (2) (3) (4) (5) (6)
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
VC of the largest clan 0.332** 0.412** 0.379** 0.359+ 0.378* 0.481*
(0.126) (0.144) (0.143) (0.183) (0.152) (0.192)
VC of the second-largest clan 0.183 0.303* 0.328* 0.256 0.367* 0.421+
(0.151) (0.144) (0.140) (0.187) (0.150) (0.218)
Num.Obs. 3742 3742 3742 3742 3513 2530
Std.Errors by: vill_id by: vill_id by: vill_id by: vill_id by: vill_id by: vill_id
FE: year X X X X X X
FE: vill_id X X X X X
Prov. linear trends x x x
NFS controls x x
Migrants out x
Taxes (upper-level) x
Transfers (upper) x

Replicating Table 2 in the article.

As shown in the table above, the coefficients for both VC dummies are positive across all regressions, with the VC of the largest clan dummy consistently statistically significant at the 5% level.

In column 2, controlling for year and village fixed effects, the coefficients for the two VC dummies are 0.412 and 0.303. This result suggests that VCs from the two largest clans are associated with 35%–51% more public investment expenditure. In column 3, controlling for provincial linear time trends, the estimates remain stable. Column 4 replaces provincial trends with village-specific linear time trends; the coefficient for VCs from the largest clan becomes even larger and remains statistically significant at the 10% level.

In column 5, provincial linear trends are reinstated along with five time-varying controls from the NFS: log village population, average household size, arable land per capita, log income per capita, and log assets owned by the village committee. These controls capture village size, demographics, agricultural endowment, and economic resources, and the results remain consistent. In column 6, the model further controls for tax revenues, population migration, and intergovernmental transfers address other potential confoundings.

An event study plot helps visualize changes in treatment effects over time and assess the plausibility of the parallel trends assumption. For demonstration, VCs from the largest and second-largest clans are evaluated together to examine how those from powerful clans influence public goods provision. Below, we use a dynamic TWFE specification as reviewed in Chiu et al. (2025).

data_cohort <- get.cohort(data=df, index =c("vill_id","year"), D = "vcfirst2", start0 = TRUE)
# Dynamic TWFE
df.twfe <- data_cohort
df.twfe[which(is.na(df.twfe$Time_to_Treatment)),'Time_to_Treatment'] <- 0
twfe.est <- feols(loginv ~ i(Time_to_Treatment, vcfirst2 + hhsize + landpc + logpopl + logincome + logasset +logmigration + logtax + logtransfer, ref = -1)| vill_id +year, data = df.twfe, cluster = "vill_id")
twfe.output <- as.matrix(twfe.est$coeftable)

twfe.output <- as.data.frame(twfe.output)
twfe.output$Time <- c(c(-12:-2),c(0:18))+1 
p.twfe <- esplot(twfe.output,Period = 'Time',Estimate = 'Estimate', SE = 'Std. Error', xlim = c(-12,10))
p.twfe

In addition, imputation-based methods can be used to avoid the negative weighting problem when heterogeneous treatment effects (HTE) are present. Here, we use the fect package to estimate the ATT. The results are similar to those obtained from TWFE models.

fect_out <- fect(loginv ~ vcfirst2 +hhsize + landpc + logpopl + logincome + logasset +logmigration + logtax + logtransfer, index = c("vill_id","year"), method = "fe", force = "two-way",se = TRUE, parallel = TRUE, nboots = 200, data = df)
print(fect_out)
## Call:
## fect.formula(formula = loginv ~ vcfirst2 + hhsize + landpc + 
##     logpopl + logincome + logasset + logmigration + logtax + 
##     logtransfer, data = df, index = c("vill_id", "year"), force = "two-way", 
##     method = "fe", se = TRUE, nboots = 200, parallel = TRUE)
## 
## ATT:
##                              ATT   S.E. CI.lower CI.upper p.value
## Tr obs equally weighted   0.3885 0.2057 -0.01467   0.7917 0.05894
## Tr units equally weighted 0.4408 0.2087  0.03173   0.8498 0.03469
## 
## Covariates:
##                   Coef    S.E. CI.lower CI.upper p.value
## hhsize        0.178248 0.27732 -0.36528  0.72178  0.5204
## landpc       -0.007072 0.13419 -0.27009  0.25594  0.9580
## logpopl       0.223525 0.53235 -0.81986  1.26691  0.6746
## logincome     0.032025 0.21585 -0.39104  0.45509  0.8821
## logasset     -0.078000 0.08038 -0.23555  0.07955  0.3319
## logmigration  0.016856 0.07607 -0.13224  0.16596  0.8246
## logtax       -0.029596 0.05486 -0.13712  0.07793  0.5896
## logtransfer   0.018833 0.04637 -0.07205  0.10972  0.6846

We draw an event study plot using fect. The pattern is similar to what was obtained earlier using dynamic TWFE.

plot(fect_out)

3.5.2 Public Investment Project Type

Villages led by VCs from large clans tend to generate more public goods. Below, the tables and plot present investment amounts in various public provisions, including basic infrastructure (roads, sanitation, electricity) and education. The results indicate that schooling and irrigation benefited the most from VCs of large clans.

vars <- grep("^loginv_cat_", names(df.twfe), value = TRUE)

models_feols <- lapply(vars, function(v) {
  feols(
    as.formula(paste0(v, " ~ vcfirst2 | vill_id + year")), 
    data    = df.twfe,
    cluster = "vill_id"
  )
})
names(models_feols) <- c("Schooling", "Road & Sanitation", "Electricity", "Irrigation", "Forestation", "Others") 

modelsummary(
  setNames(models_feols, c("Schooling", "Road & Sanitation", "Electricity", "Irrigation", "Forestation", "Others")),
  fmt = 3,
  gof_omit = "AIC|BIC|RMSE|Within|R2",
  stars = TRUE,
  coef_rename = c(
    "vcfirst2" = "VC of the largest clan"
  ),
  title  = "VCs of Large Clans and Village Public Investment: by Project Type",
  output = "default"
)
VCs of Large Clans and Village Public Investment: by Project Type
Schooling Road & Sanitation Electricity Irrigation Forestation Others
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
VC of the largest clan 0.161** 0.061 0.070+ 0.148** 0.014 0.057
(0.060) (0.064) (0.040) (0.053) (0.029) (0.054)
Num.Obs. 3742 3742 3742 3742 3742 3742
Std.Errors by: vill_id by: vill_id by: vill_id by: vill_id by: vill_id by: vill_id
FE: vill_id X X X X X X
FE: year X X X X X X

Replicating Table 3 in the article.

The imputation estimator yield similar patterns.

models_fect <- lapply(vars, function(v) {
  fect( formula = as.formula(paste0(v, " ~ vcfirst2")),
    data    = df, index   = c("vill_id","year"),
    force   = "two-way", method = "fe", se = TRUE, nboots  = 200, parallel=TRUE)
})

names(models_fect) <- c("Schooling", "Road & Sanitation", "Electricity", "Irrigation", "Forestation", "Others")

# Loop through models_fect to build a results data frame.
fect_results <- do.call(rbind, lapply(seq_along(models_fect), function(i) {
  m <- models_fect[[i]]
  data.frame(
    Outcome  = names(models_fect)[i],
    Estimate = as.numeric(m$est.avg[1]),   # ATT
    CI.low   = as.numeric(m$est.avg[4]),     # Lower CI
    CI.high  = as.numeric(m$est.avg[3])      # Upper CI
  )
}))
kable(fect_results)
Outcome Estimate CI.low CI.high
Schooling 0.1718788 0.3064728 0.0372848
Road & Sanitation 0.0452557 0.1820889 -0.0915775
Electricity 0.1070878 0.1972193 0.0169563
Irrigation 0.1778121 0.2890987 0.0665254
Forestation 0.0103807 0.0609403 -0.0401788
Others 0.0187537 0.1485787 -0.1110712

Both TWFE and FEct estimators yield positive and significant treatment effect estimates for schooling and irrigation.

extract_feols_info <- function(model, model_name, coef_name = "vcfirst2") {
  cf <- coef(model)[coef_name]
  ci <- confint(model)[coef_name, ]
  data.frame(
    Outcome  = model_name,
    Estimate = as.numeric(cf),
    CI.low   = as.numeric(ci[1]),
    CI.high  = as.numeric(ci[2])
  )
}

twfe_results <- do.call(rbind, lapply(seq_along(models_feols), function(i) {
  extract_feols_info(models_feols[[i]], names(models_feols)[i], "vcfirst2")
}))


twfe_results$Method <- "TWFE"
fect_results$Method <- "FEct"

combined_results <- rbind(twfe_results, fect_results)

combined_results$Outcome <- factor(
  combined_results$Outcome,
  levels = c("Schooling","Road & Sanitation","Electricity","Irrigation","Forestation","Others")
)


ggplot(combined_results, aes(x = Outcome, y = Estimate, color = Method)) +
  geom_point(position = position_dodge(width = 0.5), size = 3) +
  geom_errorbar(
    aes(ymin = CI.low, ymax = CI.high),
    width = 0.2,
    position = position_dodge(width = 0.5)
  ) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(
    title = "ATT Estimates: TWFE vs. FECT",
    x = NULL,
    y = "ATT"
  ) +
  theme_minimal()

3.5.3 Regression Discontinuity

Results from the RD design, based on the leader characteristic of belonging to a large clan, show that when VCs were from one of the two largest clans, public investment increased. However, the RD design is slightly underpowered. The LATE estimate from rdrobust is statistically significant at the 10% level, consistent with the findings reported in the paper.

df_clean <- df %>%
  filter(!is.na(vote), !is.na(loginv)) %>%
  mutate(loginv_demeaned = loginv - mean(loginv, na.rm = TRUE))

rd_out <- rdrobust(y = df_clean$loginv_demeaned, x = df_clean$vote, c = 0.5)
kable(summary(rd_out))
## Sharp RD estimates using local polynomial regression.
## 
## Number of Obs.                 2230
## BW type                       mserd
## Kernel                   Triangular
## VCE method                       NN
## 
## Number of Obs.                 1098         1132
## Eff. Number of Obs.             174          128
## Order est. (p)                    1            1
## Order bias  (q)                   2            2
## BW est. (h)                   0.185        0.185
## BW bias (b)                   0.282        0.282
## rho (h/b)                     0.656        0.656
## Unique Obs.                     120          133
## 
## =============================================================================
##         Method     Coef. Std. Err.         z     P>|z|      [ 95% C.I. ]       
## =============================================================================
##   Conventional     0.817     0.492     1.661     0.097    [-0.147 , 1.781]     
##         Robust         -         -     1.746     0.081    [-0.119 , 2.067]     
## =============================================================================
df2 <- df %>%
  filter(!vote %in% c(0, 1)) %>%
  mutate(
    vote_bin = cut(vote, breaks = seq(0, 1, 0.05), include.lowest = TRUE, right = FALSE),
    vote_bin_mid = (as.numeric(vote_bin) - 1) * 0.05 + 0.025
  )


mod_loginv <- feols(loginv ~ factor(prov_id)*year | vill_id + year, data = df2)
df2 <- df2 %>% mutate(loginv_ad = resid(mod_loginv))

# Average the residuals within each vote bin
df_avg <- df2 %>%
  group_by(vote_bin_mid) %>%
  summarise(avg_loginv_ad = mean(loginv_ad, na.rm = TRUE))

# Plot: binned averages with lowess curves by vcfirst2 group, with a dashed dark grey line at x = 0.5
ggplot() +
  geom_point(data = df_avg, aes(x = vote_bin_mid, y = avg_loginv_ad)) +
  geom_smooth(data = filter(df2, vcfirst2 == 0), aes(x = vote, y = loginv_ad), 
              method = "loess", se = TRUE, color = "navy") +
  geom_smooth(data = filter(df2, vcfirst2 == 1), aes(x = vote, y = loginv_ad), 
              method = "loess", se = TRUE, color = "red") +
  geom_vline(xintercept = 0.5, linetype = "dashed", color = "darkgrey") +
  scale_y_continuous(breaks = seq(-1.5, 1.5, 0.5)) +
  labs(title = "Robustness Check: A Regression Discontinuity Design",
       x = "Vote",
       y = "Residualized Public Goods Investment") +
  theme_minimal()

Replicating Figure 6 in the article.

3.6 Mechanism

The authors examine two main channels to explain how large clans reinforce public goods provision: collective action (whether large-clan VCs can more effectively mobilize villagers to pay levies) and accountability (whether clan ties reduce the misuse of funds).

3.6.1 Collective Action

The authors hypothesize that a well-organized clan helps the VC solve the collective action problem, as evidenced by higher voluntary fees (levies) whenever a VC is drawn from a large clan. If large-clan VCs effectively mobilize villagers, then higher levies—and consequently, more revenue for public goods—should be collected under their leadership.

Models 2 and 3 are adapted from the original article. Instead of using a dummy for public goods investment, the Large-clan VC indicator interacts with the size of public goods investment (for each village and year) to improve precision. Despite slight variations in coefficient size, the sign and significance remain consistent with the original findings.

mod1 <- feols(log_levies ~ vcfirst2| vill_id + year, data = df, cluster = ~vill_id)
mod2 <- feols(log_levies ~ inv| vill_id + year, data = df, cluster = ~vill_id)
mod3 <- feols(log_levies ~ vcfirst2 + inv + vcfirst2:inv| vill_id+year, data = df, cluster = ~vill_id)

models_table4 <- list("Model 1" = mod1, "Model 2" = mod2,"Model 3" = mod3)

modelsummary(
  models_table4,
  fmt = 3,                         # decimal places
  stars = TRUE,                    # significance stars
  coef_rename = c(
    "vcfirst2"     = "VCof large clans",
    "inv"       = "Public Goods Investment",
    "vcfirst2:inv" = "VCof large clans x Public Goods Investment"
  ),
  title   = "VCs of Large Clans and Levies",
  gof_omit = "AIC|BIC|RMSE|Within|R2",
  notes   = "Note: All regressions include village and year fixed effects, with SEs clustered by village."
)
VCs of Large Clans and Levies
Model 1 Model 2 Model 3
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
Note: All regressions include village and year fixed effects, with SEs clustered by village.
VCof large clans 0.132 0.110
(0.186) (0.182)
Public Goods Investment 0.304** 0.321*
(0.092) (0.130)
VCof large clans x Public Goods Investment -0.037
(0.169)
Num.Obs. 1080 1080 1080
Std.Errors by: vill_id by: vill_id by: vill_id
FE: vill_id X X X
FE: year X X X

Replicating Table 4 in the article.

While large-clan VCs have a modest, positive association with higher levies, public goods projects are the primary drivers of villagers’ contributions.

3.6.2 Accountability

For the second channel, the authors examine whether large-clan VCs reduce administrative costs. They argue that a strong accountability effect would be reflected in a decline in these expenses.

mod1 <- feols(share_admin ~ vcfirst2| vill_id + year, data = df, cluster = ~vill_id)
mod2 <- feols(log_admin ~ vcfirst2| vill_id + year, data = df, cluster = ~vill_id)

models_table5 <- list("Share of administrative expenditure in total expenditure" = mod1, "Log administrative expenditure (1,000 yuan)" = mod2)

modelsummary(
  models_table5,
  fmt = 3,             # decimal places
  stars = TRUE,        # significance stars
  coef_rename = c("vcfirst2" = "VC of the largest clan"),
  title = "VCs of Large Clans and Administrative Expenditure",
  gof_omit = "AIC|BIC|RMSE|Within|R2",
  notes = "Note: All regressions include village and year fixed effects, with SEs clustered by village."
)
VCs of Large Clans and Administrative Expenditure
Share of administrative expenditure in total expenditure Log administrative expenditure (1,000 yuan)
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
Note: All regressions include village and year fixed effects, with SEs clustered by village.
VC of the largest clan 0.006 0.022
(0.013) (0.071)
Num.Obs. 3037 3037
Std.Errors by: vill_id by: vill_id
FE: vill_id X X
FE: year X X

Replicating Table 5 in the article.

As shown above, administrative costs show little change when a large-clan VC is in office. Thus, the authors conclude that there is little evidence that clan-based “informal accountability” enhances public goods provision by reducing spending abuses.

4 Conclusion

By systematically replicating Xu and Yao (2015)’s core analyses, this tutorial confirms that local leaders from large, cohesive clans consistently invest more in public goods provision.

The findings highlight that in rural China’s weak institutional environment, cohesive family networks can serve as a powerful force for mobilizing local resources for public goods provision, albeit without demonstrably stronger checks on misconduct.

5 Replication Notes

Using multiple identification strategies, including analyses based on parallel trends and and an RD approach, the replication show that original results are mostly robust.

Expanded Analyses

  1. In addition to TWFE regressions, the replication uses fect to address potential biases caused by potential HTE. Event study plots based on the imputation estimator are added to visualize dynamic treatment effects.

  2. The RD visualization for near ties in elections is improved by incorporating confidence intervals and using a more refined algorithm.

  3. The tutorial disaggregates the mechanisms into collective action (villagers’ willingness to pay levies) and accountability (reducing administrative costs). Interaction terms are included to examine how large-clan village committees respond to public goods spending, measured in logs rather than as a dichotomous variable.

  4. The main limitation appears to be that the RD analysis is underpowered. The panel analysis may also be underpowered if post-treatment data are divided into single-year slices.

Notes on Datasets

  1. XuYao2015.dta (loaded into df in the tutorial):
    • Contains a panel of village-level variables, including key public investment measures (loginv), category-specific spending variables, levies, and administrative costs.
    • Includes village leadership indicators, voter share, and essential demographic and economic controls.
  2. lineageorg.dta (found in the .zip file but not directly used in this tutorial):
    • Provides clan-level details such as lineage size and presence (clansz, clan), ceremonial activities (cerem), and the status of lineage halls (citang, citanghis).
    • Each observation can be linked to a village using prov_id and vill_id, allowing potential merges with XuYao2015.dta.
    • This dataset was used to produce Figure 2 in the original article but is not replicated here.

References

Chiu, Albert, Xingchen Lan, Ziyi Liu, and Yiqing Xu. 2025. “Causal Panel Analysis Under Parallel Trends: Lessons from a Large Reanalysis Study.” American Political Science Review.
Freedman, Maurice. 1958. Lineage Organisation in Southeast China: Fukien and Kwangtung. London: University of London; Athlone Press.
Helmke, Gretchen, and Steven Levitsky. 2012. Informal Institutions and Comparative Politics: A Research Agenda. Edward Elgar Publishing.
Madsen, Richard. 1984. Morality and Power in a Chinese Village. Berkeley; Los Angeles, CA: University of California Press.
Marshall, John. 2024. “Can Close Election Regression Discontinuity Designs Identify Effects of Winning Politician Characteristics?” American Journal of Political Science 68 (2): 494–510.
Oi, Jean C. 1999. Rural China Takes Off: Institutional Foundations of Economic Reform. Univ of California Press.
Tsai, Lily L. 2007. Solidary Groups, Informal Accountability, and Local Public Goods Provision in Rural China.” American Political Science Review 101 (2): 355–72.
Watson, James L. 1982. “Chinese Kinship Reconsidered: Anthropological Perspectives on Historical Research.” The China Quarterly 92: 589–622.
Xu, Yiqing, and Yang Yao. 2015. “Informal Institutions, Collective Action, and Public Investment in Rural China.” American Political Science Review 109 (2): 371–91.