file_path <- "DataDNA Dataset Challenge - Consumer Financial Complaints Dataset - October 2025.xlsx"
complaints <- readxl::read_excel(file_path, sheet = 1)
companies <- readxl::read_excel(file_path, sheet = 2)
complaints <- clean_names(complaints)
companies <- clean_names(companies)Exploratory Data Analysis of the Dataset
In this document, we will perform exploratory data analysis (EDA) on the provided dataset to understand its structure, distributions and correlations, clean data, and prepare it for further analysis.
1 Load Data
2 Basic Descriptives
complaints |> skim()| Name | complaints |
| Number of rows | 62516 |
| Number of columns | 19 |
| _______________________ | |
| Column type frequency: | |
| character | 12 |
| numeric | 7 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| submitted_via | 0 | 1.00 | 3 | 12 | 0 | 7 | 0 |
| state | 0 | 1.00 | 2 | 2 | 0 | 51 | 0 |
| product | 0 | 1.00 | 8 | 76 | 0 | 9 | 0 |
| sub_product | 7 | 1.00 | 4 | 42 | 0 | 46 | 0 |
| issue | 0 | 1.00 | 11 | 80 | 0 | 76 | 0 |
| sub_issue | 10858 | 0.83 | 11 | 85 | 0 | 158 | 0 |
| company_public_response | 2175 | 0.97 | 53 | 119 | 0 | 6 | 0 |
| company_response_to_consumer | 0 | 1.00 | 6 | 31 | 0 | 5 | 0 |
| timely_response | 1494 | 0.98 | 2 | 3 | 0 | 2 | 0 |
| census_region | 0 | 1.00 | 4 | 9 | 0 | 4 | 0 |
| census_division | 0 | 1.00 | 7 | 18 | 0 | 9 | 0 |
| company_id_1081 | 0 | 1.00 | 9 | 9 | 0 | 1081 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| complaint_id | 0 | 1 | 4512642.47 | 1442917.08 | 2471340.00 | 3254020.25 | 4178582.00 | 5771283.50 | 7458912.00 | ▇▆▃▃▃ |
| date_submitted | 0 | 1 | 44159.67 | 675.23 | 42856.00 | 43607.00 | 44257.00 | 44756.00 | 45166.00 | ▅▅▅▇▇ |
| date_received | 0 | 1 | 44160.90 | 675.09 | 42856.00 | 43613.00 | 44258.00 | 44757.00 | 45166.00 | ▅▅▅▇▇ |
| state_latitude | 0 | 1 | 36.89 | 5.02 | 21.09 | 33.73 | 36.12 | 40.35 | 61.37 | ▂▇▇▁▁ |
| state_longitude | 0 | 1 | -93.96 | 18.47 | -157.50 | -119.68 | -84.67 | -78.17 | -69.38 | ▁▁▅▂▇ |
| company_response_date | 0 | 1 | 44174.76 | 675.29 | 42857.00 | 43622.00 | 44272.00 | 44771.00 | 45190.00 | ▅▅▆▇▇ |
| response_time_days | 0 | 1 | 15.09 | 8.91 | 0.00 | 7.00 | 15.00 | 23.00 | 30.00 | ▇▇▇▇▇ |
companies |> skim()| Name | companies |
| Number of rows | 1081 |
| Number of columns | 9 |
| _______________________ | |
| Column type frequency: | |
| character | 3 |
| numeric | 6 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| company_id_1081 | 0 | 1 | 9 | 9 | 0 | 1081 | 0 |
| enforcement_history | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
| company_size_tier | 0 | 1 | 5 | 6 | 0 | 3 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| market_share_percent | 0 | 1 | 0.09 | 0.09 | 0.01 | 0.03 | 0.06 | 0.13 | 0.64 | ▇▂▁▁▁ |
| reputation_score | 0 | 1 | 75.07 | 15.08 | 50.00 | 62.00 | 75.00 | 88.00 | 100.00 | ▇▆▇▆▇ |
| complaint_count | 0 | 1 | 57.83 | 7.75 | 32.00 | 52.00 | 57.00 | 63.00 | 83.00 | ▁▃▇▃▁ |
| timely_response_rate | 0 | 1 | 0.94 | 0.03 | 0.81 | 0.92 | 0.94 | 0.96 | 1.00 | ▁▁▅▇▃ |
| avg_response_time_days | 0 | 1 | 15.09 | 1.21 | 11.40 | 14.30 | 15.09 | 15.89 | 18.37 | ▁▃▇▅▁ |
| complaints_per_1pct_share | 0 | 1 | 1671.79 | 1810.41 | 78.21 | 457.57 | 895.80 | 2080.00 | 8080.81 | ▇▂▁▁▁ |
complaints |> glimpse()Rows: 62,516
Columns: 19
$ complaint_id <dbl> 4848023, 3621464, 5818349, 7233015, 58202…
$ submitted_via <chr> "Referral", "Web", "Web", "Referral", "Re…
$ date_submitted <dbl> 44493, 43945, 44769, 45117, 44769, 44888,…
$ date_received <dbl> 44496, 43945, 44769, 45118, 44770, 44888,…
$ state <chr> "NY", "FL", "CA", "CA", "VA", "CA", "TX",…
$ state_latitude <dbl> 42.16573, 27.76628, 36.11620, 36.11620, 3…
$ state_longitude <dbl> -74.94805, -81.68678, -119.68156, -119.68…
$ product <chr> "Mortgage", "Money transfer, virtual curr…
$ sub_product <chr> "Conventional home mortgage", "Refund ant…
$ issue <chr> "Applying for a mortgage or refinancing a…
$ sub_issue <chr> NA, NA, "Account information incorrect", …
$ company_public_response <chr> "Company has responded to the consumer an…
$ company_response_to_consumer <chr> "Closed with explanation", "Closed with m…
$ timely_response <chr> "Yes", "Yes", "Yes", NA, "Yes", "Yes", "Y…
$ census_region <chr> "Northeast", "South", "West", "West", "So…
$ census_division <chr> "Middle Atlantic", "South Atlantic", "Pac…
$ company_response_date <dbl> 44499, 43964, 44797, 45131, 44779, 44895,…
$ company_id_1081 <chr> "COMP-0820", "COMP-0115", "COMP-0408", "C…
$ response_time_days <dbl> 6, 19, 28, 14, 10, 7, 28, 20, 6, 25, 18, …
companies |> glimpse()Rows: 1,081
Columns: 9
$ company_id_1081 <chr> "COMP-0001", "COMP-0002", "COMP-0003", "COMP…
$ market_share_percent <dbl> 0.0385, 0.0781, 0.0169, 0.0271, 0.0273, 0.05…
$ reputation_score <dbl> 75, 82, 63, 95, 69, 60, 56, 62, 88, 91, 55, …
$ enforcement_history <chr> "No", "No", "No", "No", "No", "No", "Yes", "…
$ company_size_tier <chr> "Small", "Small", "Small", "Small", "Small",…
$ complaint_count <dbl> 50, 63, 61, 56, 70, 68, 63, 58, 49, 48, 54, …
$ timely_response_rate <dbl> 0.9000000, 0.9206349, 0.9672131, 0.9464286, …
$ avg_response_time_days <dbl> 14.60000, 13.28571, 13.27869, 16.16071, 15.2…
$ complaints_per_1pct_share <dbl> 1298.7013, 806.6581, 3609.4675, 2066.4207, 2…
3 Data Cleaning
The date variables in Excel sheets are stored as numeric values representing the number of days since December 31, 1899 (why?). We need to convert them to proper Date objects.
complaints <- complaints |>
mutate(
date_submitted =
as.Date(date_submitted, origin = "1899-12-30"),
date_received =
as.Date(date_received, origin = "1899-12-30"),
company_response_date =
as.Date(company_response_date, origin = "1899-12-30")
)There is also openxlsx::convertToDate() function that can be used to convert Excel date values to R Date objects.
4 Join Tables
We will join the two tables to create a single dataset for analysis.
dataset <- complaints |> left_join(companies, by = c("company_id_1081"))
dataset |> glimpse()Rows: 62,516
Columns: 27
$ complaint_id <dbl> 4848023, 3621464, 5818349, 7233015, 58202…
$ submitted_via <chr> "Referral", "Web", "Web", "Referral", "Re…
$ date_submitted <date> 2021-10-24, 2020-04-24, 2022-07-27, 2023…
$ date_received <date> 2021-10-27, 2020-04-24, 2022-07-27, 2023…
$ state <chr> "NY", "FL", "CA", "CA", "VA", "CA", "TX",…
$ state_latitude <dbl> 42.16573, 27.76628, 36.11620, 36.11620, 3…
$ state_longitude <dbl> -74.94805, -81.68678, -119.68156, -119.68…
$ product <chr> "Mortgage", "Money transfer, virtual curr…
$ sub_product <chr> "Conventional home mortgage", "Refund ant…
$ issue <chr> "Applying for a mortgage or refinancing a…
$ sub_issue <chr> NA, NA, "Account information incorrect", …
$ company_public_response <chr> "Company has responded to the consumer an…
$ company_response_to_consumer <chr> "Closed with explanation", "Closed with m…
$ timely_response <chr> "Yes", "Yes", "Yes", NA, "Yes", "Yes", "Y…
$ census_region <chr> "Northeast", "South", "West", "West", "So…
$ census_division <chr> "Middle Atlantic", "South Atlantic", "Pac…
$ company_response_date <date> 2021-10-30, 2020-05-13, 2022-08-24, 2023…
$ company_id_1081 <chr> "COMP-0820", "COMP-0115", "COMP-0408", "C…
$ response_time_days <dbl> 6, 19, 28, 14, 10, 7, 28, 20, 6, 25, 18, …
$ market_share_percent <dbl> 0.0099, 0.0334, 0.0256, 0.0099, 0.0099, 0…
$ reputation_score <dbl> 92, 89, 70, 91, 88, 92, 57, 79, 71, 73, 7…
$ enforcement_history <chr> "No", "No", "No", "No", "No", "No", "No",…
$ company_size_tier <chr> "Small", "Small", "Small", "Small", "Smal…
$ complaint_count <dbl> 56, 66, 55, 72, 54, 72, 66, 61, 55, 63, 4…
$ timely_response_rate <dbl> 0.9107143, 0.9090909, 0.9272727, 0.916666…
$ avg_response_time_days <dbl> 12.87500, 13.30303, 15.69091, 16.00000, 1…
$ complaints_per_1pct_share <dbl> 5656.5657, 1976.0479, 2148.4375, 7272.727…
Using latitude and longitude variables for predictions doesn’t make much sense, so we should remove them, along with IDs, which serve only as identifiers.
dataset <- dataset |>
select(-state_latitude, -state_longitude, -complaint_id, -company_id_1081)5 Univariate Distributions
Let’s plot the distributions of numeric variables in the dataset.
dataset |>
select(where(is.numeric)) |>
pivot_longer(everything()) |>
ggplot(aes(x = value)) +
facet_wrap(~name, scales = "free", ncol = 4) +
geom_histogram(bins = 30, fill = "steelblue", color = "white") +
theme(axis.title = element_blank())6 Correlations
Studying correlations between numeric variables can help identify relationships and potential multicollinearity issues.
library(corrplot)
cor_data <- dataset |>
select_if(is.numeric) |>
cor(use = "pairwise.complete.obs")
cor_data |>
corrplot(
type = "upper", order = "hclust",
tl.col = "black", tl.srt = 45
)7 Save dataset
Finally, we will save the cleaned dataset to a CSV file for future use.
write_csv(dataset, "dataset.csv")