Causal Inference Analysis

Author

Aleksei Prishchepo

Published

October 22, 2025

In this document we perform a causal inference analysis to estimate the effect of multiple variables on average response time.

1 Load data

Here we load the dataset and clean the column names for handling in R code.

file_path <- "DataDNA Dataset Challenge - Consumer Financial Complaints Dataset - October 2025.xlsx"
dataset <- readxl::read_excel(file_path, sheet = 2)

2 Causal Inference Analysis

Let’s estimate the causal effect of enforcement history on average response time.

2.1 Full Linear Model

Let’s start with the model that includes all relevant covariates.

library(dplyr)
library(janitor)
library(lmtest)
library(sandwich)

dataset <- dataset |>
  janitor::clean_names()

df <- dataset |>
  filter(!is.na(complaints_per_1pct_share)) |>
  mutate(
    enforcement = ifelse(enforcement_history == "Yes", 1, 0),
    company_size = case_when(
      company_size_tier == "Small" ~ 1,
      company_size_tier == "Medium" ~ 2,
      company_size_tier == "Large" ~ 3,
      TRUE ~ NA_real_
    ),
    log_complaints_rate = log1p(complaints_per_1pct_share)
  )

model <- lm(log_complaints_rate ~ enforcement + company_size +
  reputation_score + timely_response_rate, data = df)

# Robust standard errors
robust_coefs <- coeftest(model, vcov = vcovHC(model, type = "HC1"))

# Convert to dataframe
robust_summary <- data.frame(
  Variable = rownames(robust_coefs),
  Estimate = robust_coefs[, 1],
  Std_Error = robust_coefs[, 2],
  t_value = robust_coefs[, 3],
  p_value = robust_coefs[, 4],
  row.names = NULL
)

# Add metrics
model_fit <- data.frame(
  r_squared = summary(model)$r.squared,
  adj_r_squared = summary(model)$adj.r.squared,
  f_statistic = summary(model)$fstatistic[1],
  p_value = pf(summary(model)$fstatistic[1],
               summary(model)$fstatistic[2],
               summary(model)$fstatistic[3],
               lower.tail = FALSE)
)
robust_summary

              Variable    Estimate   Std_Error     t_value       p_value
1          (Intercept)  7.99167516 0.636558704  12.5544983  7.767644e-34
2          enforcement -0.03870143 0.053817884  -0.7191183  4.722242e-01
3         company_size -1.15762563 0.023847562 -48.5427242 2.697904e-273
4     reputation_score  0.00184525 0.001421788   1.2978374  1.946214e-01
5 timely_response_rate  0.41870190 0.668402473   0.6264218  5.311712e-01

model_fit

      r_squared adj_r_squared f_statistic       p_value
value 0.5498326     0.5481592    328.5555 9.720546e-185

There is only one statistically significant variable in the full model: company size. Let’s explore a reduced model with only this variable.

2.2 Reduced Linear Model

Here we are going to fit a reduced model that only includes company size as a predictor.

library(dplyr)
library(lmtest)
library(sandwich)
library(janitor)

dataset <- dataset |>
  clean_names()

df <- dataset |>
  filter(!is.na(complaints_per_1pct_share)) |>
  mutate(
    enforcement = ifelse(enforcement_history == "Yes", 1, 0),
    company_size = case_when(
      company_size_tier == "Small" ~ 1,
      company_size_tier == "Medium" ~ 2,
      company_size_tier == "Large" ~ 3,
      TRUE ~ NA_real_
    ),
    log_complaints_rate = log1p(complaints_per_1pct_share)
  )

model <- lm(log_complaints_rate ~ company_size, data = df)

# Robust standard errors
robust_coefs <- coeftest(model, vcov = vcovHC(model, type = "HC1"))

# Convert to dataframe
robust_summary <- data.frame(
  Variable = rownames(robust_coefs),
  Estimate = robust_coefs[, 1],
  Std_Error = robust_coefs[, 2],
  t_value = robust_coefs[, 3],
  p_value = robust_coefs[, 4],
  row.names = NULL
)

# Add metrics
model_fit <- data.frame(
  r_squared = summary(model)$r.squared,
  adj_r_squared = summary(model)$adj.r.squared,
  f_statistic = summary(model)$fstatistic[1],
  p_value = pf(summary(model)$fstatistic[1],
               summary(model)$fstatistic[2],
               summary(model)$fstatistic[3],
               lower.tail = FALSE)
)
robust_summary

      Variable  Estimate  Std_Error   t_value       p_value
1  (Intercept)  8.514505 0.04889703 174.13134  0.000000e+00
2 company_size -1.157146 0.02378334 -48.65364 2.444387e-274

model_fit

      r_squared adj_r_squared f_statistic       p_value
value 0.5487796     0.5483614    1312.292 1.142991e-188

The model’s quality metrics indicate that company size alone explains a significant portion of the variance in complaint rate. This suggests that larger companies tend to have lower complaint rates compared to smaller ones.

Next we include both code chunks in Power BI report.

1 Load data

2 Causal Inference Analysis

2.1 Full Linear Model

2.2 Reduced Linear Model

3 References