This RMarkdown tutorial replicates the core analyses from Xu and Yao (2015): “Informal Institutions, Collective Action, and Public Investment in Rural China”. The replication is conducted by Jinwen Wu, a predoctoral fellow at Stanford University, under the supervision of Professor Yiqing Xu. It summarizes the main data analyses from the article; for a comprehensive understanding of the ideas presented, please refer to the original paper.
Click the Code
button in the top right and select
Show All Code
to reveal all code used in this RMarkdown.
Click Show
in paragraphs to reveal the code used to
generate a finding. Data and R files that replicate this RMarkdown can
be downloaded here.
Informal institutions, such as lineage groups, play a crucial role in rural governance, particularly where formal structures are weak. Xu and Yao (2015) examine the relationship between lineage-based informal institutions and public goods provision in rural China. Analyzing panel data from 220 Chinese villages from 1986 to 2005, they find that village leaders from the two largest family clans significantly increased local public investment, with stronger effects in more cohesive clans. They find that clans helped leaders overcome collective action challenges in financing public goods, but there is little evidence that they helped through improved leader accountability.
Following Helmke and Levitsky (2012), Xu and Yao (2015) define informal institutions as “rules and norms that are created and enforced by social groups rather than the state.” The paper primarily focuses on those that may influence public goods provision.
In the absence of strong formal institutions, as seen in rural China, public goods provision must address two fundamental challenges:
Xu and Yao (2015) examined whether informal institutions, such as clans, can facilitate public goods provision through one or both of these mechanisms.
Family clans share several key characteristics that contribute to village governance in rural China.
They serve as a vital social identity for villagers (Freedman 1958; Tsai 2007).
They have historically facilitated collective production. Today, rural entrepreneurs often hire relatives to strengthen clan-based business networks (Watson 1982; Oi 1999).
Clan members share a strong sense of obligation to their group, deeply valuing kinship ties and social bonds (Madsen 1984).
Clan leaders play a crucial role in maintaining order by enforcing social norms and resolving conflicts both within and beyond the clan.
Government officials from large clans in the region have stronger connections and networks within local communities. As members of robust informal institutions, they are expected to deliver more public goods projects for two reasons, both aligning with the mechanisms discussed in the previous section.
These leaders are more likely to gain support from their clans than those who are either not native to the area or affiliated with only smaller local clans. They can leverage their clans’ influence to better mobilize villagers for public goods projects, potentially alleviating the collective action problem. The authors use financial contributions as a consistent and fair metric to measure participation in social welfare provisions over time.
Leaders from large local clans may feel morally obligated to adhere to their clans’ rules and interests, which is expected to reduce corruption or embezzlement.
If clan linkages enhance public goods provision, the influence of the clans from which leaders originate can shape outcomes by determining: (1) the extent to which leaders can secure local support (through the first mechanism) and (2) the likelihood that local politicians will act more effectively (through the second mechanism).
Xu and Yao (2015) measure clan cohesiveness as a proxy for a leader’s home clan influence. This measurement is based on two indicators: (1) whether large clans have maintained family tree records over time and (2) whether they upheld lineage halls throughout the observation period (1986–2005).
This study examines two channels through which informal institutions influence village governance: facilitating collective action for public goods financing and ensuring accountability of village chairpersons (VCs). The findings support the collective action channel, demonstrating that villagers across income levels contributed additional levies for public investment projects. However, the researchers find little evidence that accountability mechanisms impact administrative costs under large clan VCs.
The study acknowledges two unresolved questions: the potential for large clans to dominate grassroots politics—possibly benefiting disproportionately or engaging in corruption—and the interplay between formal and informal institutions.
The study uses a panel dataset from the Village Democracy Survey (VDS), which includes data from 220 Chinese villages collected between 1986 and 2005, supplemented by additional data from 2011 and the National Fixed-Point Survey (NFS).
The VDS provides detailed information on electoral reforms, public goods expenditures, and clan structures, sourced from village records and responses from village leaders and elders. Data on clans—including their size, cohesiveness, and activities—were well-corroborated by local consensus.
Merging the two datasets preserves the panel structure and sample size, enabling models that incorporate village and year fixed effects, village- or province-specific time trends, and time-varying covariates.
The study employs two identification strategies. The first is based on parallel trends using two-way fixed effects (TWFE) models. In the replication, the TWFE counterfactual estimator (FEct) is also used. The treatments are indicators of whether the VC leader belongs to the largest or second-largest clan in the village, and the outcome is the log of village-initiated investment amount.
The second identification strategy is a regression discontinuity (RD) design, where the running variable is the vote share of a VC candidate from one of the two largest clans in the village relative to a candidate who is not. Because it is an RD design with leader characteristics, the effect needs to be interpreted with caution (Marshall 2024).
Several R packages are required for the data analysis and visualization. The code chunk below checks for all required packages and installs the missing ones.
Packages: “tidyr”, “dplyr”, “haven”, “ggplot2”, “paneltools”, “estimatr”, “modelsummary”, “fect”, “fixest”, “kableExtra”, “rdrobust”, “panelView”.
# packages to be installed
packages <- c( "tidyr", "dplyr", "haven", "ggplot2", "paneltools", "modelsummary", "fect", "fixest", "kableExtra", "rdrobust", "panelView")
for (pkg in packages) {
if (!requireNamespace(pkg, quietly = TRUE)) {
install.packages(pkg)
}
library(pkg, character.only = TRUE)
}
After installation, call the code chunk below to load these packages.
Next, import the data. The data files are located in the Replication
folder, named XuYao2015.dta
. The codebook, XuYao2015.pdf,
can also be found in the same folder.
First, we generate a summary table of the descriptive statistics for the villages in VDS.
# First set of variables
vars1 <- c("inv", "loginv", "log_levies", "logpopl", "logincome", "logasset", "hhsize","landpc", "logmigration", "logtax", "logtransfer", "share_admin", "postcont", "postopen", "secret_ballot", "proxy_voting", "moving_ballot")
datasummary_skim(df %>% select(all_of(vars1)))
Unique | Missing Pct. | Mean | SD | Min | Median | Max | Histogram | |
---|---|---|---|---|---|---|---|---|
Public investment projects during the year | 2 | 0 | 0.2 | 0.4 | 0.0 | 0.0 | 1.0 | |
Log amount of public investment (1000 yuan) | 204 | 0 | 1.1 | 2.2 | 0.0 | 0.0 | 10.6 | |
Log amount of levies handed over to the village committee (village mean) | 942 | 71 | 4.2 | 1.9 | 0.0 | 4.7 | 7.1 | |
log village population (persons) | 2134 | 6 | 7.2 | 0.6 | 4.7 | 7.2 | 9.2 | |
log net income per capita (yuan) | 2729 | 6 | 7.2 | 0.8 | 1.9 | 7.3 | 10.4 | |
log assets controlled by the village committee (yuan) | 2666 | 6 | 9.0 | 1.6 | 2.7 | 9.0 | 15.4 | |
village average household size (person) | 3302 | 6 | 3.9 | 0.6 | 2.0 | 4.0 | 6.4 | |
arable land per capita (mu) | 3332 | 6 | 1.7 | 1.9 | 0.0 | 1.2 | 16.2 | |
log number of people migrating out of the village (persons) | 141 | 28 | 2.2 | 1.1 | 0.0 | 2.3 | 5.5 | |
log taxes to the upper-level government (1,000 yuan) | 953 | 32 | 2.3 | 1.9 | 0.0 | 2.6 | 8.8 | |
Log transfers from the upper-level government (1,000 yuan) | 348 | 32 | 1.1 | 1.6 | 0.0 | 0.0 | 7.5 | |
Share of administrative expenditure in total village expenditure | 2484 | 19 | 0.2 | 0.2 | 0.0 | 0.2 | 1.0 | |
post first contested election | 2 | 0 | 0.8 | 0.4 | 0.0 | 1.0 | 1.0 | |
post first election with open nomination | 2 | 0 | 0.7 | 0.5 | 0.0 | 1.0 | 1.0 | |
secret ballot | 2 | 0 | 0.4 | 0.5 | 0.0 | 0.0 | 1.0 | |
proxy voting | 2 | 0 | 0.7 | 0.5 | 0.0 | 1.0 | 1.0 | |
moving ballot boxes | 2 | 0 | 0.7 | 0.5 | 0.0 | 1.0 | 1.0 |
# Second set of variables
df_elecyr <- df %>% filter(elecyr == 1)
vars2 <- c("vcfirst", "vcsecond", "vcfirst2", grep("^vc_char_", names(df), value = TRUE),"vote", "psfirst2", "vcps_clan", "vcps_person", "vc_pb")
datasummary_skim(df_elecyr %>% select(all_of(vars2)))
Unique | Missing Pct. | Mean | SD | Min | Median | Max | Histogram | |
---|---|---|---|---|---|---|---|---|
village chief (VC) from the largest clan | 2 | 0 | 0.4 | 0.5 | 0.0 | 0.0 | 1.0 | |
VC from the 2nd largest clan | 2 | 0 | 0.1 | 0.3 | 0.0 | 0.0 | 1.0 | |
VC from the two largest clans | 2 | 0 | 0.5 | 0.5 | 0.0 | 0.0 | 1.0 | |
VC characteristics: age when running election | 54 | 9 | 41.6 | 8.7 | 19.0 | 42.0 | 90.0 | |
VC characteristics: family background - poor peasants | 3 | 8 | 0.8 | 0.4 | 0.0 | 1.0 | 1.0 | |
VC characteristics: Communist party member | 3 | 9 | 0.7 | 0.4 | 0.0 | 1.0 | 1.0 | |
VC characteristics: years of formal education | 7 | 8 | 6.4 | 2.3 | 0.0 | 6.0 | 13.0 | |
VC characteristics: former village cadre | 3 | 8 | 0.6 | 0.5 | 0.0 | 1.0 | 1.0 | |
VC characteristics: managerial jobs before election | 3 | 8 | 0.0 | 0.1 | 0.0 | 0.0 | 1.0 | |
VC characteristics: experience of running elections | 3 | 8 | 0.7 | 0.5 | 0.0 | 1.0 | 1.0 | |
VC characteristics: denounced in the Culture Revolution | 3 | 9 | 0.0 | 0.2 | 0.0 | 0.0 | 1.0 | |
relative vote share of VCs of large clans | 254 | 40 | 0.5 | 0.4 | 0.0 | 0.5 | 1.0 | |
village party secretary (VPS) from the two largest clans | 3 | 34 | 0.5 | 0.5 | 0.0 | 1.0 | 1.0 | |
VC and VPS in the same clan | 2 | 0 | 0.2 | 0.4 | 0.0 | 0.0 | 1.0 | |
VC and VPS the same person (one shoulders) | 2 | 0 | 0.1 | 0.3 | 0.0 | 0.0 | 1.0 | |
VC in the party branch | 3 | 37 | 0.6 | 0.5 | 0.0 | 1.0 | 1.0 |
# Third set of variables
df_2005 <- df %>% filter(year == 2005)
vars3 <- c("clan_num", grep("^clansz", names(df), value = TRUE), "largeclan", "family_tree", "ances_hall")
datasummary_skim(df_2005 %>% select(all_of(vars3)))
Unique | Missing Pct. | Mean | SD | Min | Median | Max | Histogram | |
---|---|---|---|---|---|---|---|---|
No. of clans (surnames) | 66 | 0 | 26.8 | 23.4 | 1.0 | 20.0 | 150.0 | |
population share of the largest clan | 61 | 0 | 0.4 | 0.2 | 0.1 | 0.3 | 1.0 | |
population share of the 2nd largest clan | 38 | 0 | 0.2 | 0.1 | 0.0 | 0.2 | 0.4 | |
population share of the 3rd largest clan | 42 | 9 | 0.1 | 0.1 | 0.0 | 0.1 | 0.3 | |
population share of the 4th largest clan | 44 | 9 | 0.1 | 0.0 | 0.0 | 0.1 | 0.2 | |
sum of the population shares of the two largest clan above median | 2 | 0 | 0.5 | 0.5 | 0.0 | 0.0 | 1.0 | |
the 2 largest families has maintained at least a family tree | 3 | 9 | 0.5 | 0.5 | 0.0 | 0.0 | 1.0 | |
the 2 largest families has maintained at least a lineage hall | 3 | 9 | 0.2 | 0.4 | 0.0 | 0.0 | 1.0 |
Replicating Table 1 in the article.
First, Xu and Yao (2015) demonstrate that large kinship groups (based on last names) are common in the sample villages.
# Figure 1(a)
plot(density(df$clansz1,na.rm=T,bw=0.04),ylim=c(0,8),xlab="",ylab="",main="Population of Clans")
mtext("Density", side=2,line=2)
mtext("Population Share", side=1,line=2)
lines(density(df$clansz2,na.rm=T,bw=0.04),lty=5)
lines(density(df$clansz3,na.rm=T,bw=0.04),lty=2)
lines(density(df$clansz4,na.rm=T,bw=0.04),lty=3,lwd=2)
legend("topright", c("Largest", "2nd-largest","3rd-largest", "4th-largest"),
lty = c(1,5,2,3), col=1, lwd=c(1,1,1,2), merge = TRUE, cex=1.5,bty="n")
Replicating Figure 1a in the article.
As shown in the plot above, the combined population of the third-largest and fourth-largest clans often accounts for less than 30% of a village’s total population.
In contrast, the largest and second-largest clans make up an average
of 36% and 15% of the population, respectively. In 2005, the typical
village in the sample had approximately 1,500 permanent residents, with
the largest clan often comprising about 400 individuals, or 100
households.
In the subsequent analysis, the authors assume that local government
officials from either of the two largest clans (based on last names) are
candidates of the major clans, which serve as vehicles for informal
institutions.
Similarly, Xu and Yao (2015) plot the likelihood that elected government officials come from the two largest clans.
# Summarize clan data by year
df_clans <- df %>%
filter(!is.na(year) & !is.na(elecyr)) %>%
group_by(year) %>%
summarize(
largest_clan = sum(vcfirst, na.rm = TRUE),
second_largest_clan = sum(vcsecond, na.rm = TRUE)
) %>%
ungroup()
# Create a frequency table for total observations by year
freq_table <- table(df$year)
years <- as.numeric(names(freq_table))
freq <- as.numeric(freq_table)
num_villages_with_elec_vc <- data.frame(year = years, freq = freq)
# Merge clan data with total observations
figure1b <- left_join(df_clans, num_villages_with_elec_vc, by = "year") %>%
mutate(
largest_prop = largest_clan / freq,
second_largest_prop = second_largest_clan / freq
)
# Plot using dual y-axes
max_freq <- max(figure1b$freq, na.rm = TRUE)
ggplot(figure1b, aes(x = year)) +
# Left axis: total observations (bars)
geom_bar(aes(y = freq), stat = "identity", fill = "gray", alpha = 0.5) +
# Right axis: clan proportions, rescaled to match the bar heights
geom_line(aes(y = largest_prop * max_freq, color = "Largest Clan"), size = 1) +
geom_line(aes(y = second_largest_prop * max_freq, color = "Second Largest Clan"),
size = 1, linetype = "dashed") +
scale_y_continuous(
name = "Number of Villages with Elected VCs",
limits = c(0, max_freq),
sec.axis = sec_axis(~ . / max_freq * 100, name = "Percentage %")
) +
scale_color_manual(
name = "Legend",
values = c("Largest Clan" = "black", "Second Largest Clan" = "black")
) +
labs(
x = "Year",
title = "Number of Elected VCs and Clan Percentages"
) +
theme_minimal()
Replicating Figure 1b in the article.
Figure 1b illustrates the proportion of villages in the sample that have held elections since 1986. By that year, over half of the villages had adopted the rule, and by the mid-1990s, nearly all had conducted at least one election. The solid and dashed lines represent the proportions of VC members elected from the largest and second-largest village clans, respectively. On average, 35% of VCs came from the largest clan, while 13% were from the second-largest clan between 1986 and 2005.
Before conducting regression analysis, we use the panelView package to examine the treatment assignment schedule. The results indicate that the treatment has reversals.
index = c("vill_id", "year")
panelview(loginv~ vcfirst2, index = index, data = df, axis.lab="off", ylab = "Year", xlab = "Village", by.timing = TRUE, gridOff = TRUE)
As reviewed in the Conceptual Framework section, the authors aim to analyze how informal institutions contribute to public goods provision. The table below examines the relationship between VCs from large clans and the level of public investment during their tenure.
In the original article, all regressions are clustered at the village level. Except for the first model, which includes fixed effects only at the village level, all subsequent regressions control for both village and year fixed effects.
# Model 1
model1 <- feols(loginv ~ vcfirst + vcsecond | year, data = df, cluster = ~vill_id)
# Model 2
model2 <- feols(loginv ~ vcfirst + vcsecond | vill_id + year, data = df, cluster = ~vill_id)
# Model 3
model3 <- feols(loginv ~ vcfirst + vcsecond + factor(prov_id):year | vill_id + year , data = df, cluster = ~vill_id)
# Model 4
model4 <- feols(loginv ~ vcfirst + vcsecond + factor(vill_id):year| vill_id + year, data = df, cluster = ~vill_id)
# Model 5
model5 <- feols(loginv ~ vcfirst + vcsecond + hhsize + landpc + logpopl + logincome + logasset + factor(prov_id):year| + vill_id + year, data = df, cluster = ~vill_id)
# Model 6
model6 <- feols(loginv ~ vcfirst + vcsecond + hhsize + landpc + logpopl + logincome + logasset + logmigration + logtax + logtransfer + factor(prov_id):year| vill_id + year, data = df, cluster = ~vill_id)
models_table2 <- list(
"(1)" = model1,
"(2)" = model2,
"(3)" = model3,
"(4)" = model4,
"(5)" = model5,
"(6)" = model6
)
modelsummary(
models_table2,
fmt = 3,
stars = TRUE,
coef_map = c(
"vcfirst" = "VC of the largest clan",
"vcsecond" = "VC of the second-largest clan"
),
# Omit all other terms you do not want in the main coefficient block:
coef_omit = "Intercept|vill_id|year|prov_id|hhsize|landpc|logpopl|logincome|logasset|logmigration|logtax|logtransfer",
# Omit unneeded goodness-of-fit stats:
gof_omit = "AIC|BIC|RMSE|Within|R2",
add_rows = tribble(
~term, ~`(1)`, ~`(2)`, ~`(3)`, ~`(4)`, ~`(5)`, ~`(6)`,
"Prov. linear trends", "", "", "x", "", "x", "x",
"NFS controls", "", "", "", "", "x", "x",
"Migrants out", "", "", "", "", "", "x",
"Taxes (upper-level)", "", "", "", "", "", "x",
"Transfers (upper)", "", "", "", "", "", "x",
),
output = "html"
)
(1) | (2) | (3) | (4) | (5) | (6) | |
---|---|---|---|---|---|---|
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 | ||||||
VC of the largest clan | 0.332** | 0.412** | 0.379** | 0.359+ | 0.378* | 0.481* |
(0.126) | (0.144) | (0.143) | (0.183) | (0.152) | (0.192) | |
VC of the second-largest clan | 0.183 | 0.303* | 0.328* | 0.256 | 0.367* | 0.421+ |
(0.151) | (0.144) | (0.140) | (0.187) | (0.150) | (0.218) | |
Num.Obs. | 3742 | 3742 | 3742 | 3742 | 3513 | 2530 |
Std.Errors | by: vill_id | by: vill_id | by: vill_id | by: vill_id | by: vill_id | by: vill_id |
FE: year | X | X | X | X | X | X |
FE: vill_id | X | X | X | X | X | |
Prov. linear trends | x | x | x | |||
NFS controls | x | x | ||||
Migrants out | x | |||||
Taxes (upper-level) | x | |||||
Transfers (upper) | x |
Replicating Table 2 in the article.
As shown in the table above, the coefficients for both VC dummies are positive across all regressions, with the VC of the largest clan dummy consistently statistically significant at the 5% level.
In column 2, controlling for year and village fixed effects, the coefficients for the two VC dummies are 0.412 and 0.303. This result suggests that VCs from the two largest clans are associated with 35%–51% more public investment expenditure. In column 3, controlling for provincial linear time trends, the estimates remain stable. Column 4 replaces provincial trends with village-specific linear time trends; the coefficient for VCs from the largest clan becomes even larger and remains statistically significant at the 10% level.
In column 5, provincial linear trends are reinstated along with five time-varying controls from the NFS: log village population, average household size, arable land per capita, log income per capita, and log assets owned by the village committee. These controls capture village size, demographics, agricultural endowment, and economic resources, and the results remain consistent. In column 6, the model further controls for tax revenues, population migration, and intergovernmental transfers address other potential confoundings.
An event study plot helps visualize changes in treatment effects over time and assess the plausibility of the parallel trends assumption. For demonstration, VCs from the largest and second-largest clans are evaluated together to examine how those from powerful clans influence public goods provision. Below, we use a dynamic TWFE specification as reviewed in Chiu et al. (2025).
data_cohort <- get.cohort(data=df, index =c("vill_id","year"), D = "vcfirst2", start0 = TRUE)
# Dynamic TWFE
df.twfe <- data_cohort
df.twfe[which(is.na(df.twfe$Time_to_Treatment)),'Time_to_Treatment'] <- 0
twfe.est <- feols(loginv ~ i(Time_to_Treatment, vcfirst2 + hhsize + landpc + logpopl + logincome + logasset +logmigration + logtax + logtransfer, ref = -1)| vill_id +year, data = df.twfe, cluster = "vill_id")
twfe.output <- as.matrix(twfe.est$coeftable)
twfe.output <- as.data.frame(twfe.output)
twfe.output$Time <- c(c(-12:-2),c(0:18))+1
p.twfe <- esplot(twfe.output,Period = 'Time',Estimate = 'Estimate', SE = 'Std. Error', xlim = c(-12,10))
p.twfe
In addition, imputation-based methods can be used to avoid the negative weighting problem when heterogeneous treatment effects (HTE) are present. Here, we use the fect package to estimate the ATT. The results are similar to those obtained from TWFE models.
fect_out <- fect(loginv ~ vcfirst2 +hhsize + landpc + logpopl + logincome + logasset +logmigration + logtax + logtransfer, index = c("vill_id","year"), method = "fe", force = "two-way",se = TRUE, parallel = TRUE, nboots = 200, data = df)
print(fect_out)
## Call:
## fect.formula(formula = loginv ~ vcfirst2 + hhsize + landpc +
## logpopl + logincome + logasset + logmigration + logtax +
## logtransfer, data = df, index = c("vill_id", "year"), force = "two-way",
## method = "fe", se = TRUE, nboots = 200, parallel = TRUE)
##
## ATT:
## ATT S.E. CI.lower CI.upper p.value
## Tr obs equally weighted 0.3885 0.2057 -0.01467 0.7917 0.05894
## Tr units equally weighted 0.4408 0.2087 0.03173 0.8498 0.03469
##
## Covariates:
## Coef S.E. CI.lower CI.upper p.value
## hhsize 0.178248 0.27732 -0.36528 0.72178 0.5204
## landpc -0.007072 0.13419 -0.27009 0.25594 0.9580
## logpopl 0.223525 0.53235 -0.81986 1.26691 0.6746
## logincome 0.032025 0.21585 -0.39104 0.45509 0.8821
## logasset -0.078000 0.08038 -0.23555 0.07955 0.3319
## logmigration 0.016856 0.07607 -0.13224 0.16596 0.8246
## logtax -0.029596 0.05486 -0.13712 0.07793 0.5896
## logtransfer 0.018833 0.04637 -0.07205 0.10972 0.6846
We draw an event study plot using fect. The pattern is similar to what was obtained earlier using dynamic TWFE.
plot(fect_out)
Villages led by VCs from large clans tend to generate more public goods. Below, the tables and plot present investment amounts in various public provisions, including basic infrastructure (roads, sanitation, electricity) and education. The results indicate that schooling and irrigation benefited the most from VCs of large clans.
vars <- grep("^loginv_cat_", names(df.twfe), value = TRUE)
models_feols <- lapply(vars, function(v) {
feols(
as.formula(paste0(v, " ~ vcfirst2 | vill_id + year")),
data = df.twfe,
cluster = "vill_id"
)
})
names(models_feols) <- c("Schooling", "Road & Sanitation", "Electricity", "Irrigation", "Forestation", "Others")
modelsummary(
setNames(models_feols, c("Schooling", "Road & Sanitation", "Electricity", "Irrigation", "Forestation", "Others")),
fmt = 3,
gof_omit = "AIC|BIC|RMSE|Within|R2",
stars = TRUE,
coef_rename = c(
"vcfirst2" = "VC of the largest clan"
),
title = "VCs of Large Clans and Village Public Investment: by Project Type",
output = "default"
)
Schooling | Road & Sanitation | Electricity | Irrigation | Forestation | Others | |
---|---|---|---|---|---|---|
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 | ||||||
VC of the largest clan | 0.161** | 0.061 | 0.070+ | 0.148** | 0.014 | 0.057 |
(0.060) | (0.064) | (0.040) | (0.053) | (0.029) | (0.054) | |
Num.Obs. | 3742 | 3742 | 3742 | 3742 | 3742 | 3742 |
Std.Errors | by: vill_id | by: vill_id | by: vill_id | by: vill_id | by: vill_id | by: vill_id |
FE: vill_id | X | X | X | X | X | X |
FE: year | X | X | X | X | X | X |
Replicating Table 3 in the article.
The imputation estimator yield similar patterns.
models_fect <- lapply(vars, function(v) {
fect( formula = as.formula(paste0(v, " ~ vcfirst2")),
data = df, index = c("vill_id","year"),
force = "two-way", method = "fe", se = TRUE, nboots = 200, parallel=TRUE)
})
names(models_fect) <- c("Schooling", "Road & Sanitation", "Electricity", "Irrigation", "Forestation", "Others")
# Loop through models_fect to build a results data frame.
fect_results <- do.call(rbind, lapply(seq_along(models_fect), function(i) {
m <- models_fect[[i]]
data.frame(
Outcome = names(models_fect)[i],
Estimate = as.numeric(m$est.avg[1]), # ATT
CI.low = as.numeric(m$est.avg[4]), # Lower CI
CI.high = as.numeric(m$est.avg[3]) # Upper CI
)
}))
kable(fect_results)
Outcome | Estimate | CI.low | CI.high |
---|---|---|---|
Schooling | 0.1718788 | 0.3064728 | 0.0372848 |
Road & Sanitation | 0.0452557 | 0.1820889 | -0.0915775 |
Electricity | 0.1070878 | 0.1972193 | 0.0169563 |
Irrigation | 0.1778121 | 0.2890987 | 0.0665254 |
Forestation | 0.0103807 | 0.0609403 | -0.0401788 |
Others | 0.0187537 | 0.1485787 | -0.1110712 |
Both TWFE and FEct estimators yield positive and significant treatment effect estimates for schooling and irrigation.
extract_feols_info <- function(model, model_name, coef_name = "vcfirst2") {
cf <- coef(model)[coef_name]
ci <- confint(model)[coef_name, ]
data.frame(
Outcome = model_name,
Estimate = as.numeric(cf),
CI.low = as.numeric(ci[1]),
CI.high = as.numeric(ci[2])
)
}
twfe_results <- do.call(rbind, lapply(seq_along(models_feols), function(i) {
extract_feols_info(models_feols[[i]], names(models_feols)[i], "vcfirst2")
}))
twfe_results$Method <- "TWFE"
fect_results$Method <- "FEct"
combined_results <- rbind(twfe_results, fect_results)
combined_results$Outcome <- factor(
combined_results$Outcome,
levels = c("Schooling","Road & Sanitation","Electricity","Irrigation","Forestation","Others")
)
ggplot(combined_results, aes(x = Outcome, y = Estimate, color = Method)) +
geom_point(position = position_dodge(width = 0.5), size = 3) +
geom_errorbar(
aes(ymin = CI.low, ymax = CI.high),
width = 0.2,
position = position_dodge(width = 0.5)
) +
geom_hline(yintercept = 0, linetype = "dashed") +
labs(
title = "ATT Estimates: TWFE vs. FECT",
x = NULL,
y = "ATT"
) +
theme_minimal()
Results from the RD design, based on the leader characteristic of belonging to a large clan, show that when VCs were from one of the two largest clans, public investment increased. However, the RD design is slightly underpowered. The LATE estimate from rdrobust is statistically significant at the 10% level, consistent with the findings reported in the paper.
df_clean <- df %>%
filter(!is.na(vote), !is.na(loginv)) %>%
mutate(loginv_demeaned = loginv - mean(loginv, na.rm = TRUE))
rd_out <- rdrobust(y = df_clean$loginv_demeaned, x = df_clean$vote, c = 0.5)
kable(summary(rd_out))
## Sharp RD estimates using local polynomial regression.
##
## Number of Obs. 2230
## BW type mserd
## Kernel Triangular
## VCE method NN
##
## Number of Obs. 1098 1132
## Eff. Number of Obs. 174 128
## Order est. (p) 1 1
## Order bias (q) 2 2
## BW est. (h) 0.185 0.185
## BW bias (b) 0.282 0.282
## rho (h/b) 0.656 0.656
## Unique Obs. 120 133
##
## =============================================================================
## Method Coef. Std. Err. z P>|z| [ 95% C.I. ]
## =============================================================================
## Conventional 0.817 0.492 1.661 0.097 [-0.147 , 1.781]
## Robust - - 1.746 0.081 [-0.119 , 2.067]
## =============================================================================
df2 <- df %>%
filter(!vote %in% c(0, 1)) %>%
mutate(
vote_bin = cut(vote, breaks = seq(0, 1, 0.05), include.lowest = TRUE, right = FALSE),
vote_bin_mid = (as.numeric(vote_bin) - 1) * 0.05 + 0.025
)
mod_loginv <- feols(loginv ~ factor(prov_id)*year | vill_id + year, data = df2)
df2 <- df2 %>% mutate(loginv_ad = resid(mod_loginv))
# Average the residuals within each vote bin
df_avg <- df2 %>%
group_by(vote_bin_mid) %>%
summarise(avg_loginv_ad = mean(loginv_ad, na.rm = TRUE))
# Plot: binned averages with lowess curves by vcfirst2 group, with a dashed dark grey line at x = 0.5
ggplot() +
geom_point(data = df_avg, aes(x = vote_bin_mid, y = avg_loginv_ad)) +
geom_smooth(data = filter(df2, vcfirst2 == 0), aes(x = vote, y = loginv_ad),
method = "loess", se = TRUE, color = "navy") +
geom_smooth(data = filter(df2, vcfirst2 == 1), aes(x = vote, y = loginv_ad),
method = "loess", se = TRUE, color = "red") +
geom_vline(xintercept = 0.5, linetype = "dashed", color = "darkgrey") +
scale_y_continuous(breaks = seq(-1.5, 1.5, 0.5)) +
labs(title = "Robustness Check: A Regression Discontinuity Design",
x = "Vote",
y = "Residualized Public Goods Investment") +
theme_minimal()
Replicating Figure 6 in the article.
The authors examine two main channels to explain how large clans reinforce public goods provision: collective action (whether large-clan VCs can more effectively mobilize villagers to pay levies) and accountability (whether clan ties reduce the misuse of funds).
The authors hypothesize that a well-organized clan helps the VC solve the collective action problem, as evidenced by higher voluntary fees (levies) whenever a VC is drawn from a large clan. If large-clan VCs effectively mobilize villagers, then higher levies—and consequently, more revenue for public goods—should be collected under their leadership.
Models 2 and 3 are adapted from the original article. Instead of using a dummy for public goods investment, the Large-clan VC indicator interacts with the size of public goods investment (for each village and year) to improve precision. Despite slight variations in coefficient size, the sign and significance remain consistent with the original findings.
mod1 <- feols(log_levies ~ vcfirst2| vill_id + year, data = df, cluster = ~vill_id)
mod2 <- feols(log_levies ~ inv| vill_id + year, data = df, cluster = ~vill_id)
mod3 <- feols(log_levies ~ vcfirst2 + inv + vcfirst2:inv| vill_id+year, data = df, cluster = ~vill_id)
models_table4 <- list("Model 1" = mod1, "Model 2" = mod2,"Model 3" = mod3)
modelsummary(
models_table4,
fmt = 3, # decimal places
stars = TRUE, # significance stars
coef_rename = c(
"vcfirst2" = "VCof large clans",
"inv" = "Public Goods Investment",
"vcfirst2:inv" = "VCof large clans x Public Goods Investment"
),
title = "VCs of Large Clans and Levies",
gof_omit = "AIC|BIC|RMSE|Within|R2",
notes = "Note: All regressions include village and year fixed effects, with SEs clustered by village."
)
Model 1 | Model 2 | Model 3 | |
---|---|---|---|
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 | |||
Note: All regressions include village and year fixed effects, with SEs clustered by village. | |||
VCof large clans | 0.132 | 0.110 | |
(0.186) | (0.182) | ||
Public Goods Investment | 0.304** | 0.321* | |
(0.092) | (0.130) | ||
VCof large clans x Public Goods Investment | -0.037 | ||
(0.169) | |||
Num.Obs. | 1080 | 1080 | 1080 |
Std.Errors | by: vill_id | by: vill_id | by: vill_id |
FE: vill_id | X | X | X |
FE: year | X | X | X |
Replicating Table 4 in the article.
While large-clan VCs have a modest, positive association with higher levies, public goods projects are the primary drivers of villagers’ contributions.
For the second channel, the authors examine whether large-clan VCs reduce administrative costs. They argue that a strong accountability effect would be reflected in a decline in these expenses.
mod1 <- feols(share_admin ~ vcfirst2| vill_id + year, data = df, cluster = ~vill_id)
mod2 <- feols(log_admin ~ vcfirst2| vill_id + year, data = df, cluster = ~vill_id)
models_table5 <- list("Share of administrative expenditure in total expenditure" = mod1, "Log administrative expenditure (1,000 yuan)" = mod2)
modelsummary(
models_table5,
fmt = 3, # decimal places
stars = TRUE, # significance stars
coef_rename = c("vcfirst2" = "VC of the largest clan"),
title = "VCs of Large Clans and Administrative Expenditure",
gof_omit = "AIC|BIC|RMSE|Within|R2",
notes = "Note: All regressions include village and year fixed effects, with SEs clustered by village."
)
Share of administrative expenditure in total expenditure | Log administrative expenditure (1,000 yuan) | |
---|---|---|
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 | ||
Note: All regressions include village and year fixed effects, with SEs clustered by village. | ||
VC of the largest clan | 0.006 | 0.022 |
(0.013) | (0.071) | |
Num.Obs. | 3037 | 3037 |
Std.Errors | by: vill_id | by: vill_id |
FE: vill_id | X | X |
FE: year | X | X |
Replicating Table 5 in the article.
As shown above, administrative costs show little change when a large-clan VC is in office. Thus, the authors conclude that there is little evidence that clan-based “informal accountability” enhances public goods provision by reducing spending abuses.
By systematically replicating Xu and Yao (2015)’s core analyses, this tutorial confirms that local leaders from large, cohesive clans consistently invest more in public goods provision.
The findings highlight that in rural China’s weak institutional environment, cohesive family networks can serve as a powerful force for mobilizing local resources for public goods provision, albeit without demonstrably stronger checks on misconduct.
Using multiple identification strategies, including analyses based on parallel trends and and an RD approach, the replication show that original results are mostly robust.
Expanded Analyses
In addition to TWFE regressions, the replication uses fect to address potential biases caused by potential HTE. Event study plots based on the imputation estimator are added to visualize dynamic treatment effects.
The RD visualization for near ties in elections is improved by incorporating confidence intervals and using a more refined algorithm.
The tutorial disaggregates the mechanisms into collective action (villagers’ willingness to pay levies) and accountability (reducing administrative costs). Interaction terms are included to examine how large-clan village committees respond to public goods spending, measured in logs rather than as a dichotomous variable.
The main limitation appears to be that the RD analysis is underpowered. The panel analysis may also be underpowered if post-treatment data are divided into single-year slices.
Notes on Datasets
XuYao2015.dta
(loaded into
df
in the tutorial):
loginv
), category-specific spending
variables, levies, and administrative costs.lineageorg.dta
(found in the .zip file
but not directly used in this tutorial):
clansz
, clan
), ceremonial activities
(cerem
), and the status of lineage halls
(citang
, citanghis
).prov_id
and vill_id
, allowing potential merges
with XuYao2015.dta
.