Exploratory Data Analysis of the Dataset

Author

Aleksei Prishchepo

Published

October 21, 2025

In this document, we will perform exploratory data analysis (EDA) on the provided dataset to understand its structure, distributions and correlations, clean data, and prepare it for further analysis.

1 Load Data

file_path <- "DataDNA Dataset Challenge - Consumer Financial Complaints Dataset - October 2025.xlsx"

complaints <- readxl::read_excel(file_path, sheet = 1)
companies <- readxl::read_excel(file_path, sheet = 2)

complaints <- clean_names(complaints)
companies <- clean_names(companies)

2 Basic Descriptives

complaints |> skim()

Data summary
Name	complaints
Number of rows	62516
Number of columns	19
_______________________
Column type frequency:
character	12
numeric	7
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
submitted_via	0	1.00	3	12	7
state	0	1.00	2	2	51
product	0	1.00	8	76	9
sub_product	7	1.00	4	42	46
issue	0	1.00	11	80	76
sub_issue	10858	0.83	11	85	158
company_public_response	2175	0.97	53	119	6
company_response_to_consumer	0	1.00	6	31	5
timely_response	1494	0.98	2	3	2
census_region	0	1.00	4	9	4
census_division	0	1.00	7	18	9
company_id_1081	0	1.00	9	9	1081

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
complaint_id	1	4512642.47	1442917.08	2471340.00	3254020.25	4178582.00	5771283.50	7458912.00	▇▆▃▃▃
date_submitted	1	44159.67	675.23	42856.00	43607.00	44257.00	44756.00	45166.00	▅▅▅▇▇
date_received	1	44160.90	675.09	42856.00	43613.00	44258.00	44757.00	45166.00	▅▅▅▇▇
state_latitude	1	36.89	5.02	21.09	33.73	36.12	40.35	61.37	▂▇▇▁▁
state_longitude	1	-93.96	18.47	-157.50	-119.68	-84.67	-78.17	-69.38	▁▁▅▂▇
company_response_date	1	44174.76	675.29	42857.00	43622.00	44272.00	44771.00	45190.00	▅▅▆▇▇
response_time_days	1	15.09	8.91	0.00	7.00	15.00	23.00	30.00	▇▇▇▇▇

companies |> skim()

Data summary
Name	companies
Number of rows	1081
Number of columns	9
_______________________
Column type frequency:
character	3
numeric	6
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
company_id_1081	1	9	9	1081
enforcement_history	1	2	3	2
company_size_tier	1	5	6	3

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
market_share_percent	1	0.09	0.09	0.01	0.03	0.06	0.13	0.64	▇▂▁▁▁
reputation_score	1	75.07	15.08	50.00	62.00	75.00	88.00	100.00	▇▆▇▆▇
complaint_count	1	57.83	7.75	32.00	52.00	57.00	63.00	83.00	▁▃▇▃▁
timely_response_rate	1	0.94	0.03	0.81	0.92	0.94	0.96	1.00	▁▁▅▇▃
avg_response_time_days	1	15.09	1.21	11.40	14.30	15.09	15.89	18.37	▁▃▇▅▁
complaints_per_1pct_share	1	1671.79	1810.41	78.21	457.57	895.80	2080.00	8080.81	▇▂▁▁▁

complaints |> glimpse()

Rows: 62,516
Columns: 19
$ complaint_id                 <dbl> 4848023, 3621464, 5818349, 7233015, 58202…
$ submitted_via                <chr> "Referral", "Web", "Web", "Referral", "Re…
$ date_submitted               <dbl> 44493, 43945, 44769, 45117, 44769, 44888,…
$ date_received                <dbl> 44496, 43945, 44769, 45118, 44770, 44888,…
$ state                        <chr> "NY", "FL", "CA", "CA", "VA", "CA", "TX",…
$ state_latitude               <dbl> 42.16573, 27.76628, 36.11620, 36.11620, 3…
$ state_longitude              <dbl> -74.94805, -81.68678, -119.68156, -119.68…
$ product                      <chr> "Mortgage", "Money transfer, virtual curr…
$ sub_product                  <chr> "Conventional home mortgage", "Refund ant…
$ issue                        <chr> "Applying for a mortgage or refinancing a…
$ sub_issue                    <chr> NA, NA, "Account information incorrect", …
$ company_public_response      <chr> "Company has responded to the consumer an…
$ company_response_to_consumer <chr> "Closed with explanation", "Closed with m…
$ timely_response              <chr> "Yes", "Yes", "Yes", NA, "Yes", "Yes", "Y…
$ census_region                <chr> "Northeast", "South", "West", "West", "So…
$ census_division              <chr> "Middle Atlantic", "South Atlantic", "Pac…
$ company_response_date        <dbl> 44499, 43964, 44797, 45131, 44779, 44895,…
$ company_id_1081              <chr> "COMP-0820", "COMP-0115", "COMP-0408", "C…
$ response_time_days           <dbl> 6, 19, 28, 14, 10, 7, 28, 20, 6, 25, 18, …

companies |> glimpse()

Rows: 1,081
Columns: 9
$ company_id_1081           <chr> "COMP-0001", "COMP-0002", "COMP-0003", "COMP…
$ market_share_percent      <dbl> 0.0385, 0.0781, 0.0169, 0.0271, 0.0273, 0.05…
$ reputation_score          <dbl> 75, 82, 63, 95, 69, 60, 56, 62, 88, 91, 55, …
$ enforcement_history       <chr> "No", "No", "No", "No", "No", "No", "Yes", "…
$ company_size_tier         <chr> "Small", "Small", "Small", "Small", "Small",…
$ complaint_count           <dbl> 50, 63, 61, 56, 70, 68, 63, 58, 49, 48, 54, …
$ timely_response_rate      <dbl> 0.9000000, 0.9206349, 0.9672131, 0.9464286, …
$ avg_response_time_days    <dbl> 14.60000, 13.28571, 13.27869, 16.16071, 15.2…
$ complaints_per_1pct_share <dbl> 1298.7013, 806.6581, 3609.4675, 2066.4207, 2…

3 Data Cleaning

The date variables in Excel sheets are stored as numeric values representing the number of days since December 31, 1899 (why?). We need to convert them to proper Date objects.

complaints <- complaints |>
  mutate(
    date_submitted =
      as.Date(date_submitted, origin = "1899-12-30"),
    date_received =
      as.Date(date_received, origin = "1899-12-30"),
    company_response_date =
      as.Date(company_response_date, origin = "1899-12-30")
  )

There is also openxlsx::convertToDate() function that can be used to convert Excel date values to R Date objects.

4 Join Tables

We will join the two tables to create a single dataset for analysis.

dataset <- complaints |> left_join(companies, by = c("company_id_1081"))
dataset |> glimpse()

Rows: 62,516
Columns: 27
$ complaint_id                 <dbl> 4848023, 3621464, 5818349, 7233015, 58202…
$ submitted_via                <chr> "Referral", "Web", "Web", "Referral", "Re…
$ date_submitted               <date> 2021-10-24, 2020-04-24, 2022-07-27, 2023…
$ date_received                <date> 2021-10-27, 2020-04-24, 2022-07-27, 2023…
$ state                        <chr> "NY", "FL", "CA", "CA", "VA", "CA", "TX",…
$ state_latitude               <dbl> 42.16573, 27.76628, 36.11620, 36.11620, 3…
$ state_longitude              <dbl> -74.94805, -81.68678, -119.68156, -119.68…
$ product                      <chr> "Mortgage", "Money transfer, virtual curr…
$ sub_product                  <chr> "Conventional home mortgage", "Refund ant…
$ issue                        <chr> "Applying for a mortgage or refinancing a…
$ sub_issue                    <chr> NA, NA, "Account information incorrect", …
$ company_public_response      <chr> "Company has responded to the consumer an…
$ company_response_to_consumer <chr> "Closed with explanation", "Closed with m…
$ timely_response              <chr> "Yes", "Yes", "Yes", NA, "Yes", "Yes", "Y…
$ census_region                <chr> "Northeast", "South", "West", "West", "So…
$ census_division              <chr> "Middle Atlantic", "South Atlantic", "Pac…
$ company_response_date        <date> 2021-10-30, 2020-05-13, 2022-08-24, 2023…
$ company_id_1081              <chr> "COMP-0820", "COMP-0115", "COMP-0408", "C…
$ response_time_days           <dbl> 6, 19, 28, 14, 10, 7, 28, 20, 6, 25, 18, …
$ market_share_percent         <dbl> 0.0099, 0.0334, 0.0256, 0.0099, 0.0099, 0…
$ reputation_score             <dbl> 92, 89, 70, 91, 88, 92, 57, 79, 71, 73, 7…
$ enforcement_history          <chr> "No", "No", "No", "No", "No", "No", "No",…
$ company_size_tier            <chr> "Small", "Small", "Small", "Small", "Smal…
$ complaint_count              <dbl> 56, 66, 55, 72, 54, 72, 66, 61, 55, 63, 4…
$ timely_response_rate         <dbl> 0.9107143, 0.9090909, 0.9272727, 0.916666…
$ avg_response_time_days       <dbl> 12.87500, 13.30303, 15.69091, 16.00000, 1…
$ complaints_per_1pct_share    <dbl> 5656.5657, 1976.0479, 2148.4375, 7272.727…

Using latitude and longitude variables for predictions doesn’t make much sense, so we should remove them, along with IDs, which serve only as identifiers.

dataset <- dataset |>
  select(-state_latitude, -state_longitude, -complaint_id, -company_id_1081)

5 Univariate Distributions

Let’s plot the distributions of numeric variables in the dataset.

dataset |>
  select(where(is.numeric)) |>
  pivot_longer(everything()) |>
  ggplot(aes(x = value)) +
  facet_wrap(~name, scales = "free", ncol = 4) +
  geom_histogram(bins = 30, fill = "steelblue", color = "white") + 
  theme(axis.title = element_blank())

Figure 1: Distributions of numeric variables in the dataset

6 Correlations

Studying correlations between numeric variables can help identify relationships and potential multicollinearity issues.

library(corrplot)

cor_data <- dataset |>
  select_if(is.numeric) |>
  cor(use = "pairwise.complete.obs")

cor_data |>
  corrplot(
    type = "upper", order = "hclust",
    tl.col = "black", tl.srt = 45
  )

Figure 2: Correlation matrix of numeric variables

7 Save dataset

Finally, we will save the cleaned dataset to a CSV file for future use.

write_csv(dataset, "dataset.csv")