file_path <- "DataDNA Dataset Challenge - Consumer Financial Complaints Dataset - October 2025.xlsx"
dataset <- readxl::read_excel(file_path, sheet = 2)Causal Inference Analysis
In this document we perform a causal inference analysis to estimate the effect of multiple variables on average response time.
1 Load data
Here we load the dataset and clean the column names for handling in R code.
2 Causal Inference Analysis
Let’s estimate the causal effect of enforcement history on average response time.
2.1 Full Linear Model
Let’s start with the model that includes all relevant covariates.
library(dplyr)
library(janitor)
library(lmtest)
library(sandwich)
dataset <- dataset |>
janitor::clean_names()
df <- dataset |>
filter(!is.na(complaints_per_1pct_share)) |>
mutate(
enforcement = ifelse(enforcement_history == "Yes", 1, 0),
company_size = case_when(
company_size_tier == "Small" ~ 1,
company_size_tier == "Medium" ~ 2,
company_size_tier == "Large" ~ 3,
TRUE ~ NA_real_
),
log_complaints_rate = log1p(complaints_per_1pct_share)
)
model <- lm(log_complaints_rate ~ enforcement + company_size +
reputation_score + timely_response_rate, data = df)
# Robust standard errors
robust_coefs <- coeftest(model, vcov = vcovHC(model, type = "HC1"))
# Convert to dataframe
robust_summary <- data.frame(
Variable = rownames(robust_coefs),
Estimate = robust_coefs[, 1],
Std_Error = robust_coefs[, 2],
t_value = robust_coefs[, 3],
p_value = robust_coefs[, 4],
row.names = NULL
)
# Add metrics
model_fit <- data.frame(
r_squared = summary(model)$r.squared,
adj_r_squared = summary(model)$adj.r.squared,
f_statistic = summary(model)$fstatistic[1],
p_value = pf(summary(model)$fstatistic[1],
summary(model)$fstatistic[2],
summary(model)$fstatistic[3],
lower.tail = FALSE)
)
robust_summary Variable Estimate Std_Error t_value p_value
1 (Intercept) 7.99167516 0.636558704 12.5544983 7.767644e-34
2 enforcement -0.03870143 0.053817884 -0.7191183 4.722242e-01
3 company_size -1.15762563 0.023847562 -48.5427242 2.697904e-273
4 reputation_score 0.00184525 0.001421788 1.2978374 1.946214e-01
5 timely_response_rate 0.41870190 0.668402473 0.6264218 5.311712e-01
model_fit r_squared adj_r_squared f_statistic p_value
value 0.5498326 0.5481592 328.5555 9.720546e-185
There is only one statistically significant variable in the full model: company size. Let’s explore a reduced model with only this variable.
2.2 Reduced Linear Model
Here we are going to fit a reduced model that only includes company size as a predictor.
library(dplyr)
library(lmtest)
library(sandwich)
library(janitor)
dataset <- dataset |>
clean_names()
df <- dataset |>
filter(!is.na(complaints_per_1pct_share)) |>
mutate(
enforcement = ifelse(enforcement_history == "Yes", 1, 0),
company_size = case_when(
company_size_tier == "Small" ~ 1,
company_size_tier == "Medium" ~ 2,
company_size_tier == "Large" ~ 3,
TRUE ~ NA_real_
),
log_complaints_rate = log1p(complaints_per_1pct_share)
)
model <- lm(log_complaints_rate ~ company_size, data = df)
# Robust standard errors
robust_coefs <- coeftest(model, vcov = vcovHC(model, type = "HC1"))
# Convert to dataframe
robust_summary <- data.frame(
Variable = rownames(robust_coefs),
Estimate = robust_coefs[, 1],
Std_Error = robust_coefs[, 2],
t_value = robust_coefs[, 3],
p_value = robust_coefs[, 4],
row.names = NULL
)
# Add metrics
model_fit <- data.frame(
r_squared = summary(model)$r.squared,
adj_r_squared = summary(model)$adj.r.squared,
f_statistic = summary(model)$fstatistic[1],
p_value = pf(summary(model)$fstatistic[1],
summary(model)$fstatistic[2],
summary(model)$fstatistic[3],
lower.tail = FALSE)
)
robust_summary Variable Estimate Std_Error t_value p_value
1 (Intercept) 8.514505 0.04889703 174.13134 0.000000e+00
2 company_size -1.157146 0.02378334 -48.65364 2.444387e-274
model_fit r_squared adj_r_squared f_statistic p_value
value 0.5487796 0.5483614 1312.292 1.142991e-188
The model’s quality metrics indicate that company size alone explains a significant portion of the variance in complaint rate. This suggests that larger companies tend to have lower complaint rates compared to smaller ones.
Next we include both code chunks in Power BI report.