Exploratory Data Analysis of the Dataset

Author

Aleksei Prishchepo

Published

October 21, 2025

In this document, we will perform exploratory data analysis (EDA) on the provided dataset to understand its structure, distributions and correlations, clean data, and prepare it for further analysis.

1 Load Data

file_path <- "DataDNA Dataset Challenge - Consumer Financial Complaints Dataset - October 2025.xlsx"

complaints <- readxl::read_excel(file_path, sheet = 1)
companies <- readxl::read_excel(file_path, sheet = 2)

complaints <- clean_names(complaints)
companies <- clean_names(companies)

2 Basic Descriptives

complaints |> skim()
Data summary
Name complaints
Number of rows 62516
Number of columns 19
_______________________
Column type frequency:
character 12
numeric 7
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
submitted_via 0 1.00 3 12 0 7 0
state 0 1.00 2 2 0 51 0
product 0 1.00 8 76 0 9 0
sub_product 7 1.00 4 42 0 46 0
issue 0 1.00 11 80 0 76 0
sub_issue 10858 0.83 11 85 0 158 0
company_public_response 2175 0.97 53 119 0 6 0
company_response_to_consumer 0 1.00 6 31 0 5 0
timely_response 1494 0.98 2 3 0 2 0
census_region 0 1.00 4 9 0 4 0
census_division 0 1.00 7 18 0 9 0
company_id_1081 0 1.00 9 9 0 1081 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
complaint_id 0 1 4512642.47 1442917.08 2471340.00 3254020.25 4178582.00 5771283.50 7458912.00 ▇▆▃▃▃
date_submitted 0 1 44159.67 675.23 42856.00 43607.00 44257.00 44756.00 45166.00 ▅▅▅▇▇
date_received 0 1 44160.90 675.09 42856.00 43613.00 44258.00 44757.00 45166.00 ▅▅▅▇▇
state_latitude 0 1 36.89 5.02 21.09 33.73 36.12 40.35 61.37 ▂▇▇▁▁
state_longitude 0 1 -93.96 18.47 -157.50 -119.68 -84.67 -78.17 -69.38 ▁▁▅▂▇
company_response_date 0 1 44174.76 675.29 42857.00 43622.00 44272.00 44771.00 45190.00 ▅▅▆▇▇
response_time_days 0 1 15.09 8.91 0.00 7.00 15.00 23.00 30.00 ▇▇▇▇▇
companies |> skim()
Data summary
Name companies
Number of rows 1081
Number of columns 9
_______________________
Column type frequency:
character 3
numeric 6
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
company_id_1081 0 1 9 9 0 1081 0
enforcement_history 0 1 2 3 0 2 0
company_size_tier 0 1 5 6 0 3 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
market_share_percent 0 1 0.09 0.09 0.01 0.03 0.06 0.13 0.64 ▇▂▁▁▁
reputation_score 0 1 75.07 15.08 50.00 62.00 75.00 88.00 100.00 ▇▆▇▆▇
complaint_count 0 1 57.83 7.75 32.00 52.00 57.00 63.00 83.00 ▁▃▇▃▁
timely_response_rate 0 1 0.94 0.03 0.81 0.92 0.94 0.96 1.00 ▁▁▅▇▃
avg_response_time_days 0 1 15.09 1.21 11.40 14.30 15.09 15.89 18.37 ▁▃▇▅▁
complaints_per_1pct_share 0 1 1671.79 1810.41 78.21 457.57 895.80 2080.00 8080.81 ▇▂▁▁▁
complaints |> glimpse()
Rows: 62,516
Columns: 19
$ complaint_id                 <dbl> 4848023, 3621464, 5818349, 7233015, 58202…
$ submitted_via                <chr> "Referral", "Web", "Web", "Referral", "Re…
$ date_submitted               <dbl> 44493, 43945, 44769, 45117, 44769, 44888,…
$ date_received                <dbl> 44496, 43945, 44769, 45118, 44770, 44888,…
$ state                        <chr> "NY", "FL", "CA", "CA", "VA", "CA", "TX",…
$ state_latitude               <dbl> 42.16573, 27.76628, 36.11620, 36.11620, 3…
$ state_longitude              <dbl> -74.94805, -81.68678, -119.68156, -119.68…
$ product                      <chr> "Mortgage", "Money transfer, virtual curr…
$ sub_product                  <chr> "Conventional home mortgage", "Refund ant…
$ issue                        <chr> "Applying for a mortgage or refinancing a…
$ sub_issue                    <chr> NA, NA, "Account information incorrect", …
$ company_public_response      <chr> "Company has responded to the consumer an…
$ company_response_to_consumer <chr> "Closed with explanation", "Closed with m…
$ timely_response              <chr> "Yes", "Yes", "Yes", NA, "Yes", "Yes", "Y…
$ census_region                <chr> "Northeast", "South", "West", "West", "So…
$ census_division              <chr> "Middle Atlantic", "South Atlantic", "Pac…
$ company_response_date        <dbl> 44499, 43964, 44797, 45131, 44779, 44895,…
$ company_id_1081              <chr> "COMP-0820", "COMP-0115", "COMP-0408", "C…
$ response_time_days           <dbl> 6, 19, 28, 14, 10, 7, 28, 20, 6, 25, 18, …
companies |> glimpse()
Rows: 1,081
Columns: 9
$ company_id_1081           <chr> "COMP-0001", "COMP-0002", "COMP-0003", "COMP…
$ market_share_percent      <dbl> 0.0385, 0.0781, 0.0169, 0.0271, 0.0273, 0.05…
$ reputation_score          <dbl> 75, 82, 63, 95, 69, 60, 56, 62, 88, 91, 55, …
$ enforcement_history       <chr> "No", "No", "No", "No", "No", "No", "Yes", "…
$ company_size_tier         <chr> "Small", "Small", "Small", "Small", "Small",…
$ complaint_count           <dbl> 50, 63, 61, 56, 70, 68, 63, 58, 49, 48, 54, …
$ timely_response_rate      <dbl> 0.9000000, 0.9206349, 0.9672131, 0.9464286, …
$ avg_response_time_days    <dbl> 14.60000, 13.28571, 13.27869, 16.16071, 15.2…
$ complaints_per_1pct_share <dbl> 1298.7013, 806.6581, 3609.4675, 2066.4207, 2…

3 Data Cleaning

The date variables in Excel sheets are stored as numeric values representing the number of days since December 31, 1899 (why?). We need to convert them to proper Date objects.

complaints <- complaints |>
  mutate(
    date_submitted =
      as.Date(date_submitted, origin = "1899-12-30"),
    date_received =
      as.Date(date_received, origin = "1899-12-30"),
    company_response_date =
      as.Date(company_response_date, origin = "1899-12-30")
  )

There is also openxlsx::convertToDate() function that can be used to convert Excel date values to R Date objects.

4 Join Tables

We will join the two tables to create a single dataset for analysis.

dataset <- complaints |> left_join(companies, by = c("company_id_1081"))
dataset |> glimpse()
Rows: 62,516
Columns: 27
$ complaint_id                 <dbl> 4848023, 3621464, 5818349, 7233015, 58202…
$ submitted_via                <chr> "Referral", "Web", "Web", "Referral", "Re…
$ date_submitted               <date> 2021-10-24, 2020-04-24, 2022-07-27, 2023…
$ date_received                <date> 2021-10-27, 2020-04-24, 2022-07-27, 2023…
$ state                        <chr> "NY", "FL", "CA", "CA", "VA", "CA", "TX",…
$ state_latitude               <dbl> 42.16573, 27.76628, 36.11620, 36.11620, 3…
$ state_longitude              <dbl> -74.94805, -81.68678, -119.68156, -119.68…
$ product                      <chr> "Mortgage", "Money transfer, virtual curr…
$ sub_product                  <chr> "Conventional home mortgage", "Refund ant…
$ issue                        <chr> "Applying for a mortgage or refinancing a…
$ sub_issue                    <chr> NA, NA, "Account information incorrect", …
$ company_public_response      <chr> "Company has responded to the consumer an…
$ company_response_to_consumer <chr> "Closed with explanation", "Closed with m…
$ timely_response              <chr> "Yes", "Yes", "Yes", NA, "Yes", "Yes", "Y…
$ census_region                <chr> "Northeast", "South", "West", "West", "So…
$ census_division              <chr> "Middle Atlantic", "South Atlantic", "Pac…
$ company_response_date        <date> 2021-10-30, 2020-05-13, 2022-08-24, 2023…
$ company_id_1081              <chr> "COMP-0820", "COMP-0115", "COMP-0408", "C…
$ response_time_days           <dbl> 6, 19, 28, 14, 10, 7, 28, 20, 6, 25, 18, …
$ market_share_percent         <dbl> 0.0099, 0.0334, 0.0256, 0.0099, 0.0099, 0…
$ reputation_score             <dbl> 92, 89, 70, 91, 88, 92, 57, 79, 71, 73, 7…
$ enforcement_history          <chr> "No", "No", "No", "No", "No", "No", "No",…
$ company_size_tier            <chr> "Small", "Small", "Small", "Small", "Smal…
$ complaint_count              <dbl> 56, 66, 55, 72, 54, 72, 66, 61, 55, 63, 4…
$ timely_response_rate         <dbl> 0.9107143, 0.9090909, 0.9272727, 0.916666…
$ avg_response_time_days       <dbl> 12.87500, 13.30303, 15.69091, 16.00000, 1…
$ complaints_per_1pct_share    <dbl> 5656.5657, 1976.0479, 2148.4375, 7272.727…

Using latitude and longitude variables for predictions doesn’t make much sense, so we should remove them, along with IDs, which serve only as identifiers.

dataset <- dataset |>
  select(-state_latitude, -state_longitude, -complaint_id, -company_id_1081)

5 Univariate Distributions

Let’s plot the distributions of numeric variables in the dataset.

dataset |>
  select(where(is.numeric)) |>
  pivot_longer(everything()) |>
  ggplot(aes(x = value)) +
  facet_wrap(~name, scales = "free", ncol = 4) +
  geom_histogram(bins = 30, fill = "steelblue", color = "white") + 
  theme(axis.title = element_blank())
Figure 1: Distributions of numeric variables in the dataset

6 Correlations

Studying correlations between numeric variables can help identify relationships and potential multicollinearity issues.

library(corrplot)

cor_data <- dataset |>
  select_if(is.numeric) |>
  cor(use = "pairwise.complete.obs")

cor_data |>
  corrplot(
    type = "upper", order = "hclust",
    tl.col = "black", tl.srt = 45
  )
Figure 2: Correlation matrix of numeric variables

7 Save dataset

Finally, we will save the cleaned dataset to a CSV file for future use.

write_csv(dataset, "dataset.csv")

8 References