Exploratory Data Analysis

Author

Aleksei Prishchepo

Published

November 16, 2025

1 Load Data

The dataset is provided in an Excel file with three sheets: events, products, and customers. We will load each sheet into a separate data frame for analysis.

file_name <- "DataDNA Dataset Challenge - E-commerce Dataset - November 2025.xlsx"

df_events <- readxl::read_xlsx(file_name, sheet = 1)
df_products <- readxl::read_xlsx(file_name, sheet = 2)
df_customers <- readxl::read_xlsx(file_name, sheet = 3)

2 Data Dictionary

We are provided with the data dictionary. Let’s have a look at it to understand the fields better.

data_dictionary <- readxl::read_xlsx(file_name, sheet = 4)
data_dictionary |> na.omit() |>  knitr::kable()
Column Description
customer_id Unique ID for each customer.
signup_date When the customer signed up.
region State/province/region.
currency_preference Preferred currency if the customer chose one.
segment Customer segment (e.g., Student, SMB, Enterprise).
acquisition_channel How the customer first found or was acquired by the business.
age_band Age group (e.g., 18–24).
currency Currency used for pricing on events.
country Customer’s country.
country_latitude Customer’s country centroid latitude coordinates
country_longitude Customer’s country centroid longitude coordinates
product_id Unique ID for each product/plan.
product_name Product/plan name.
category Product category (e.g., Analytics, Collaboration).
is_subscription Subscription vs one-time item.
billing_cycle Monthly, Annual, or One-time.
base_price_usd List price in USD before discounts/taxes.
first_release_date Approximate “first availability” of the product family/plan.
event_id Unique ID for each transaction-like record.
event_type order (new sale) or invoice (renewal/billing).
event_date Timestamp when the event occurred.
customer_id Customer on the event.
product_id Product/plan on the event.
country / region Customer location on the event.
channel Sales channel (Website, Direct Sales, etc.).
payment_method Payment type (Credit Card, Invoice, etc.).
currency Currency used for this event.
fx_rate_to_usd Conversion rate used for USD amounts.
quantity Number of seats/units.
unit_price_local Unit list price in local currency.
discount_code Code applied to the event.
discount_local Discount amount in local currency.
tax_local Tax/VAT in local currency.
net_revenue_local Net amount in local currency.
net_revenue_usd Net amount converted to USD.
is_refunded True/False if refunded.
refund_datetime When refund happened.
refund_reason Reason for refund.

3 Data Glimpse

Now let’s take a quick look at the values in the fields of each dataset.

3.1 Events

df_events |> glimpse()
Rows: 48,000
Columns: 23
$ event_id          <chr> "DB4F1A096586", "3550A69EEA36", "0C3CAE65E90C", "7DD…
$ event_type        <chr> "order", "order", "order", "invoice", "order", "orde…
$ event_date        <dttm> 2024-12-09 00:11:27, 2025-08-11 21:31:03, 2025-09-3…
$ customer_id       <chr> "CUST0001946", "CUST0003776", "CUST0001105", "CUST00…
$ product_id        <chr> "PROD0083", "PROD0049", "PROD0075", "PROD0017", "PRO…
$ country           <chr> "United States", "Australia", "Canada", "United Stat…
$ latitude          <chr> "39.8283", "-25.2744", "56.1304", "39.8283", "39.828…
$ longitude         <chr> "-98.5795", "133.7751", "-106.3468", "-98.5795", "-9…
$ region            <chr> NA, "APAC", NA, NA, NA, NA, "APAC", "EU", NA, "LATAM…
$ channel           <chr> "Website", "Website", "Website", "Partner", "Website…
$ payment_method    <chr> "Credit Card", "Invoice", "Credit Card", "Credit Car…
$ currency          <chr> "USD", "AUD", "USD", "USD", "USD", "USD", "AUD", "EU…
$ quantity          <dbl> 3, 5, 1, 10, 3, 25, 3, 20, 1, 1, 3, 3, 10, 5, 5, 3, …
$ unit_price_local  <chr> "36.7", "19.7", "20.12", "19.0", "29.38", "289.0", "…
$ discount_code     <chr> "N/A", "N/A", "N/A", "N/A", "N/A", "N/A", "N/A", "N/…
$ discount_local    <chr> "0.0", "0.0", "0.0", "0.0", "0.0", "0.0", "0.0", "0.…
$ tax_local         <chr> "0.0", "9.85", "1.01", "0.0", "0.0", "0.0", "13.18",…
$ net_revenue_local <chr> "110.1", "108.35", "21.13", "190.0", "88.14", "7225.…
$ fx_rate_to_usd    <chr> "1.0", "0.66", "1.0", "1.0", "1.0", "1.0", "0.66", "…
$ net_revenue_usd   <chr> "110.1", "71.51", "21.13", "190.0", "88.14", "7225.0…
$ is_refunded       <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
$ refund_datetime   <dttm> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ refund_reason     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …

3.2 Customers

df_customers |> glimpse()
Rows: 4,000
Columns: 10
$ customer_id         <chr> "CUST0000001", "CUST0000002", "CUST0000003", "CUST…
$ signup_date         <dttm> 2025-07-28 15:21:00, 2024-09-01 00:03:00, 2024-06…
$ region              <chr> "EU", "APAC", "EU", "APAC", NA, NA, "EU", "APAC", …
$ currency_preference <chr> "GBP", "AUD", "GBP", "AUD", "USD", "USD", "EUR", "…
$ segment             <chr> "Consumer", "Consumer", "Consumer", "Consumer", "C…
$ acquisition_channel <chr> "Affiliate", "Organic", "Paid Search", "Email", "P…
$ age_band            <chr> "55+", "55+", "35-44", "55+", "35-44", "25-34", "3…
$ country             <chr> "United Kingdom", "Australia", "United Kingdom", "…
$ country_latitude    <chr> "55.3781", "-25.2744", "55.3781", "-25.2744", "39.…
$ country_longitude   <chr> "-3.436", "133.7751", "-3.436", "133.7751", "-98.5…

3.3 Products

df_products |> glimpse()
Rows: 101
Columns: 14
$ product_id          <chr> "PROD0001", "PROD0002", "PROD0003", "PROD0004", "P…
$ product_name        <chr> "Microsoft 365 Business Standard Monthly", "Micros…
$ category            <chr> "Productivity Suite", "Productivity Suite", "Produ…
$ is_subscription     <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FA…
$ billing_cycle       <chr> "Monthly", "Annual", "Monthly", "Annual", "Monthly…
$ base_price_usd      <chr> "49.0", "499.0", "29.0", "299.0", "15.0", "149.0",…
$ first_release_date  <dttm> 2023-11-24, 2023-02-27, 2023-01-13, 2023-05-21, 2…
$ vendor              <chr> "Microsoft", "Microsoft", "Adobe", "Adobe", "AI To…
$ resale_model        <chr> "Direct resale", "Direct resale", "Direct resale",…
$ brand_safe_name     <chr> "Office Suite 365 Business Standard Monthly", "Off…
$ product_name_orig   <chr> "Microsoft 365 Business Standard Monthly", "Micros…
$ base_price_usd_orig <chr> "49.0", "499.0", "29.0", "299.0", "15.0", "149.0",…
$ base_key            <chr> "microsoft 365 business standard", "microsoft 365 …
$ product_version     <chr> "v3.0", "v3.0", "v3.0", "v3.0", "v3.0", "v3.0", "v…

4 Field Types

We need to set correct data type for each field.

df_events <- df_events |>
  mutate(
    event_date = as.POSIXct(event_date,
      format = "%Y-%m-%d %H:%M:%S", tz = "UTC"
    ),
    country = as.factor(country),
    region = as.factor(region),
    channel = as.factor(channel),
    payment_method = as.factor(payment_method),
    currency = as.factor(currency),
    latitude = as.numeric(latitude),
    longitude = as.numeric(longitude),
    quantity = as.numeric(quantity),
    unit_price_local = as.numeric(unit_price_local),
    discount_local = as.numeric(discount_local),
    tax_local = as.numeric(tax_local),
    net_revenue_local = as.numeric(net_revenue_local),
    fx_rate_to_usd = as.numeric(fx_rate_to_usd),
    net_revenue_usd = as.numeric(net_revenue_usd),
    refund_datetime = as.POSIXct(refund_datetime,
      format = "%Y-%m-%d %H:%M:%S", tz = "UTC"
    ),
    refund_reason = as.factor(refund_reason)
  )
df_customers <- df_customers |>
  mutate(
    signup_date = as.POSIXct(signup_date,
      format = "%Y-%m-%d %H:%M:%S", tz = "UTC"
    ),
    region = as.factor(region),
    country = as.factor(country),
    currency_preference = as.factor(currency_preference),
    segment = as.factor(segment),
    acquisition_channel = as.factor(acquisition_channel),
    age_band = factor(age_band,
      levels = c("18-24", "25-34", "35-44", "45-54", "55+")
    ),
    country = as.factor(country),
    country_latitude = as.numeric(country_latitude),
    country_longitude = as.numeric(country_longitude)
  )
df_products <- df_products |>
  mutate(
    product_name = as.factor(product_name),
    category = as.factor(category),
    billing_cycle = factor(billing_cycle,
      levels = c("One-time", "Monthly", "Annual")
    ),
    first_release_date = as.POSIXct(first_release_date,
      format = "%Y-%m-%d %H:%M:%S", tz = "UTC"
    ),
    vendor = as.factor(vendor),
    resale_model = as.factor(resale_model),
    base_price_usd_orig = as.numeric(base_price_usd_orig)
  )

5 Summary Statistics

5.1 Events

Events dataset consists of 48000 rows and 23 columns. The event_date field ranges from 2024-04-22 00:09:01 to 2025-10-21 23:56:20.

skimr::skim(df_events)
Data summary
Name df_events
Number of rows 48000
Number of columns 23
_______________________
Column type frequency:
character 5
factor 6
logical 1
numeric 9
POSIXct 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
event_id 0 1 12 12 0 48000 0
event_type 0 1 5 7 0 2 0
customer_id 0 1 11 11 0 4000 0
product_id 0 1 8 8 0 101 0
discount_code 0 1 3 13 0 14 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
country 0 1.00 FALSE 10 Uni: 14735, Uni: 7262, Can: 4844, Aus: 4836
region 19579 0.59 FALSE 3 EU: 19040, APA: 7306, LAT: 2075
channel 0 1.00 FALSE 5 Web: 21549, Dir: 9792, Res: 6914, Par: 4884
payment_method 0 1.00 FALSE 4 Cre: 26340, Inv: 12055, Pay: 7215, Wir: 2390
currency 0 1.00 FALSE 4 USD: 24124, EUR: 11778, GBP: 7262, AUD: 4836
refund_reason 46995 0.02 FALSE 4 Bil: 269, Acc: 265, Dup: 238, Ser: 233

Variable type: logical

skim_variable n_missing complete_rate mean count
is_refunded 0 1 0.02 FAL: 46995, TRU: 1005

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
latitude 0 1 35.51 25.61 -25.27 39.83 39.83 55.38 56.13 ▂▁▁▅▇
longitude 0 1 -22.98 79.32 -106.35 -98.58 -3.44 5.29 133.78 ▇▁▇▁▃
quantity 0 1 6.03 6.25 1.00 1.00 3.00 10.00 25.00 ▇▁▁▁▁
unit_price_local 0 1 106.67 113.83 4.10 15.81 49.00 186.89 756.06 ▇▂▁▁▁
discount_local 0 1 26.52 96.08 0.00 0.00 0.00 7.34 2353.75 ▇▁▁▁▁
tax_local 0 1 66.09 155.25 0.00 0.00 8.49 47.54 2353.75 ▇▁▁▁▁
net_revenue_local 0 1 668.31 1214.80 4.00 55.24 202.79 675.86 15121.20 ▇▁▁▁▁
fx_rate_to_usd 0 1 1.01 0.14 0.66 1.00 1.00 1.06 1.22 ▁▁▁▇▂
net_revenue_usd 0 1 663.17 1188.44 4.00 54.34 199.71 675.46 14970.13 ▇▁▁▁▁

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
event_date 0 1.00 2024-04-22 00:09:01 2025-10-21 23:56:20 2025-01-20 11:45:33 47976
refund_datetime 46995 0.02 2024-04-25 21:38:12 2025-12-11 16:52:56 2025-02-23 06:50:48 1005

5.2 Impute Missing Region Values

We can see that there are 19579 rows with missing region values. Let’s check which countries these rows belong to.

df_events |>
  filter(is.na(region)) |>
  group_by(country) |>
  summarise(count = n()) |>
  arrange(desc(count))
# A tibble: 2 × 2
  country       count
  <fct>         <int>
1 United States 14735
2 Canada         4844

We can impute the missing region values based on the country field since each country belongs to a specific region. Missing region values correspond to countries in North America (US, CA). Therefore, we will impute the missing region values with “NOAM”.

df_events <- df_events |>
  mutate(region = ifelse(is.na(region), "NOAM", as.character(region)))
df_events <- df_events |> mutate(region = as.factor(region))

5.3 Customers

skimr::skim(df_customers)
Data summary
Name df_customers
Number of rows 4000
Number of columns 10
_______________________
Column type frequency:
character 1
factor 6
numeric 2
POSIXct 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
customer_id 0 1 11 11 0 4000 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
region 1628 0.59 FALSE 3 EU: 1583, APA: 617, LAT: 172
currency_preference 0 1.00 FALSE 4 USD: 2006, EUR: 990, GBP: 593, AUD: 411
segment 0 1.00 FALSE 4 Con: 2529, SOH: 761, SMB: 597, Ent: 113
acquisition_channel 0 1.00 FALSE 6 Org: 1121, Pai: 869, Soc: 708, Ema: 664
age_band 0 1.00 FALSE 5 25-: 1425, 35-: 1099, 45-: 617, 18-: 451
country 0 1.00 FALSE 10 Uni: 1213, Uni: 593, Can: 415, Aus: 411

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
country_latitude 0 1 35.41 25.75 -25.27 39.83 39.83 55.38 56.13 ▂▁▁▅▇
country_longitude 0 1 -22.61 79.63 -106.35 -98.58 -3.44 5.29 133.78 ▇▁▇▁▃

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
signup_date 0 1 2024-01-01 05:27:00 2025-08-30 15:03:00 2024-11-04 13:14:00 3989

5.4 Impute Missing Region Values

The Customers dataset also has missing values in the region field. Let’s check which countries these rows belong to.

df_customers |>
  filter(is.na(region)) |>
  group_by(country) |>
  summarise(count = n()) |>
  arrange(desc(count))
# A tibble: 2 × 2
  country       count
  <fct>         <int>
1 United States  1213
2 Canada          415

We can impute the missing region values based on the country field since each country belongs to a specific region. Missing region values correspond to countries in North America (US, CA). Therefore, we will impute the missing region values with “NOAM”.

df_customers <- df_customers |>
  mutate(region = ifelse(is.na(region), "NOAM", as.character(region)))
df_customers <- df_customers |> mutate(region = as.factor(region))

5.5 Products

skimr::skim(df_products)
Data summary
Name df_products
Number of rows 101
Number of columns 14
_______________________
Column type frequency:
character 6
factor 5
logical 1
numeric 1
POSIXct 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
product_id 0 1 8 8 0 101 0
base_price_usd 0 1 3 6 0 67 0
brand_safe_name 0 1 20 51 0 101 0
product_name_orig 0 1 15 39 0 70 0
base_key 0 1 8 35 0 36 0
product_version 0 1 4 4 0 2 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
product_name 0 1 FALSE 101 Ado: 1, Ado: 1, Ado: 1, Ado: 1
category 0 1 FALSE 17 Add: 12, Des: 12, Ana: 10, Dev: 10
billing_cycle 0 1 FALSE 3 Mon: 50, Ann: 49, One: 2
vendor 0 1 FALSE 17 Mic: 24, AI : 14, Oth: 11, Ado: 10
resale_model 0 1 FALSE 3 Dir: 58, Mar: 28, Aff: 15

Variable type: logical

skim_variable n_missing complete_rate mean count
is_subscription 0 1 0.98 TRU: 99, FAL: 2

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
base_price_usd_orig 0 1 99.32 104.26 5 15 50 160 499 ▇▃▂▁▁

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
first_release_date 0 1 2023-01-13 2024-03-01 2024-02-01 14

The Products dataset has no missing values.

6 Univariate Distributions

6.1 Events Categorical Variables

# Categorical distributions
df_events |>
  select(-longitude, -latitude) |>
  select(where(is.factor)) |>
  pivot_longer(everything()) |>
  ggplot(aes(x = value)) +
  facet_wrap(~name, scales = "free", ncol = 3) +
  geom_bar(fill = "steelblue") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        axis.title = element_blank())
Figure 1: Bar Plots of Categorical Variables in Events Dataset

6.2 Events Numeric Variables

Before doing bivariate analysis, let’s convert local currency to USD for all relevant fields in the Events dataset.

local_fields <- df_events |>
  colnames() |>
  str_subset("_local$")

for (field in local_fields) {
  usd_field <- str_replace(field, "_local$", "_usd")
  df_events <- df_events |>
    mutate(!!sym(usd_field) := !!sym(field) / fx_rate_to_usd)
}

# Drop local currency fields
df_events <- df_events |>
  select(-all_of(local_fields))
# Numeric distributions
df_events |>
  select(-longitude, -latitude) |>
  select(where(is.numeric)) |>
  pivot_longer(everything()) |>
  ggplot(aes(x = value)) +
  facet_wrap(~name, scales = "free", ncol = 3) +
  geom_histogram(bins = 30, fill = "steelblue", color = "white") +
  theme(axis.title = element_blank())
Figure 2: Histograms of Numeric Variables in Events Dataset

There is no need to do correlation analysis for the monetary fields.

6.3 Check Integrity of Monetary Fields

Some variables in the Events dataset need clarification:

  • quantity: This field represents the number of units sold in each event.
  • unit_price_usd: This field represents the price per unit in USD.
  • discount_usd: This field represents the discount applied in USD.
  • tax_usd: This field represents the tax applied in USD.
  • net_revenue_usd: This field represents the net revenue after discounts and taxes in USD.

Let’s check the integrity of these fields.

df_events <- df_events |>
  mutate(
    calculated_net_revenue = (quantity * unit_price_usd) - discount_usd +
      tax_usd
  )
integrity_issues <- df_events |>
  filter(abs(calculated_net_revenue - net_revenue_usd) > 0.01)
integrity_issues_count <- nrow(integrity_issues)
integrity_issues_count
[1] 0

It seems that there are 0 rows with integrity issues in the monetary fields.

6.4 Customers Categorical Variables

# Categorical distributions
df_customers |>
  select(where(is.factor)) |>
  pivot_longer(everything()) |>
  ggplot(aes(x = value)) +
  facet_wrap(~name, scales = "free", ncol = 3) +
  geom_bar(fill = "steelblue") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        axis.title = element_blank())
Figure 3: Bar Plots of Categorical Variables in Customers Dataset

6.5 Customers Numeric Variables

Only country_latitude and country_longitude are numeric variables in the Customers dataset. We are not going to do correlation analysis for these fields.

6.6 Products Categorical Variables

# Categorical distributions
df_products |>
  select(where(is.factor)) |>
  select(-product_name) |>
  pivot_longer(everything()) |>
  ggplot(aes(x = value)) +
  facet_wrap(~name, scales = "free", ncol = 2) +
  geom_bar(fill = "steelblue") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        axis.title = element_blank())
Figure 4: Bar Plots of Categorical Variables in Products Dataset

6.7 Products Numeric Variables

There is only one numeric variable in the Products dataset: base_price_usd_orig. We will plot its distribution.

# Numeric distributions
df_products |>
  select(where(is.numeric)) |>
  pivot_longer(everything()) |>
  ggplot(aes(x = value)) +
  facet_wrap(~name, scales = "free", ncol = 1) +
  geom_histogram(bins = 30, fill = "steelblue", color = "white") +
  theme(axis.title = element_blank())
Figure 5: Histograms of Numeric Variables in Products Dataset

7 References