Exploratory Data Analysis

1 Load Data

The dataset is provided in an Excel file with three sheets: events, products, and customers. We will load each sheet into a separate data frame for analysis.

file_name <- "DataDNA Dataset Challenge - E-commerce Dataset - November 2025.xlsx"

df_events <- readxl::read_xlsx(file_name, sheet = 1)
df_products <- readxl::read_xlsx(file_name, sheet = 2)
df_customers <- readxl::read_xlsx(file_name, sheet = 3)

2 Data Dictionary

We are provided with the data dictionary. Let’s have a look at it to understand the fields better.

data_dictionary <- readxl::read_xlsx(file_name, sheet = 4)
data_dictionary |> na.omit() |>  knitr::kable()

Column	Description
customer_id	Unique ID for each customer.
signup_date	When the customer signed up.
region	State/province/region.
currency_preference	Preferred currency if the customer chose one.
segment	Customer segment (e.g., Student, SMB, Enterprise).
acquisition_channel	How the customer first found or was acquired by the business.
age_band	Age group (e.g., 18–24).
currency	Currency used for pricing on events.
country	Customer’s country.
country_latitude	Customer’s country centroid latitude coordinates
country_longitude	Customer’s country centroid longitude coordinates
product_id	Unique ID for each product/plan.
product_name	Product/plan name.
category	Product category (e.g., Analytics, Collaboration).
is_subscription	Subscription vs one-time item.
billing_cycle	Monthly, Annual, or One-time.
base_price_usd	List price in USD before discounts/taxes.
first_release_date	Approximate “first availability” of the product family/plan.
event_id	Unique ID for each transaction-like record.
event_type	order (new sale) or invoice (renewal/billing).
event_date	Timestamp when the event occurred.
customer_id	Customer on the event.
product_id	Product/plan on the event.
country / region	Customer location on the event.
channel	Sales channel (Website, Direct Sales, etc.).
payment_method	Payment type (Credit Card, Invoice, etc.).
currency	Currency used for this event.
fx_rate_to_usd	Conversion rate used for USD amounts.
quantity	Number of seats/units.
unit_price_local	Unit list price in local currency.
discount_code	Code applied to the event.
discount_local	Discount amount in local currency.
tax_local	Tax/VAT in local currency.
net_revenue_local	Net amount in local currency.
net_revenue_usd	Net amount converted to USD.
is_refunded	True/False if refunded.
refund_datetime	When refund happened.
refund_reason	Reason for refund.

3 Data Glimpse

Now let’s take a quick look at the values in the fields of each dataset.

3.1 Events

df_events |> glimpse()

Rows: 48,000
Columns: 23
$ event_id          <chr> "DB4F1A096586", "3550A69EEA36", "0C3CAE65E90C", "7DD…
$ event_type        <chr> "order", "order", "order", "invoice", "order", "orde…
$ event_date        <dttm> 2024-12-09 00:11:27, 2025-08-11 21:31:03, 2025-09-3…
$ customer_id       <chr> "CUST0001946", "CUST0003776", "CUST0001105", "CUST00…
$ product_id        <chr> "PROD0083", "PROD0049", "PROD0075", "PROD0017", "PRO…
$ country           <chr> "United States", "Australia", "Canada", "United Stat…
$ latitude          <chr> "39.8283", "-25.2744", "56.1304", "39.8283", "39.828…
$ longitude         <chr> "-98.5795", "133.7751", "-106.3468", "-98.5795", "-9…
$ region            <chr> NA, "APAC", NA, NA, NA, NA, "APAC", "EU", NA, "LATAM…
$ channel           <chr> "Website", "Website", "Website", "Partner", "Website…
$ payment_method    <chr> "Credit Card", "Invoice", "Credit Card", "Credit Car…
$ currency          <chr> "USD", "AUD", "USD", "USD", "USD", "USD", "AUD", "EU…
$ quantity          <dbl> 3, 5, 1, 10, 3, 25, 3, 20, 1, 1, 3, 3, 10, 5, 5, 3, …
$ unit_price_local  <chr> "36.7", "19.7", "20.12", "19.0", "29.38", "289.0", "…
$ discount_code     <chr> "N/A", "N/A", "N/A", "N/A", "N/A", "N/A", "N/A", "N/…
$ discount_local    <chr> "0.0", "0.0", "0.0", "0.0", "0.0", "0.0", "0.0", "0.…
$ tax_local         <chr> "0.0", "9.85", "1.01", "0.0", "0.0", "0.0", "13.18",…
$ net_revenue_local <chr> "110.1", "108.35", "21.13", "190.0", "88.14", "7225.…
$ fx_rate_to_usd    <chr> "1.0", "0.66", "1.0", "1.0", "1.0", "1.0", "0.66", "…
$ net_revenue_usd   <chr> "110.1", "71.51", "21.13", "190.0", "88.14", "7225.0…
$ is_refunded       <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
$ refund_datetime   <dttm> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ refund_reason     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …

3.2 Customers

df_customers |> glimpse()

Rows: 4,000
Columns: 10
$ customer_id         <chr> "CUST0000001", "CUST0000002", "CUST0000003", "CUST…
$ signup_date         <dttm> 2025-07-28 15:21:00, 2024-09-01 00:03:00, 2024-06…
$ region              <chr> "EU", "APAC", "EU", "APAC", NA, NA, "EU", "APAC", …
$ currency_preference <chr> "GBP", "AUD", "GBP", "AUD", "USD", "USD", "EUR", "…
$ segment             <chr> "Consumer", "Consumer", "Consumer", "Consumer", "C…
$ acquisition_channel <chr> "Affiliate", "Organic", "Paid Search", "Email", "P…
$ age_band            <chr> "55+", "55+", "35-44", "55+", "35-44", "25-34", "3…
$ country             <chr> "United Kingdom", "Australia", "United Kingdom", "…
$ country_latitude    <chr> "55.3781", "-25.2744", "55.3781", "-25.2744", "39.…
$ country_longitude   <chr> "-3.436", "133.7751", "-3.436", "133.7751", "-98.5…

3.3 Products

df_products |> glimpse()

Rows: 101
Columns: 14
$ product_id          <chr> "PROD0001", "PROD0002", "PROD0003", "PROD0004", "P…
$ product_name        <chr> "Microsoft 365 Business Standard Monthly", "Micros…
$ category            <chr> "Productivity Suite", "Productivity Suite", "Produ…
$ is_subscription     <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FA…
$ billing_cycle       <chr> "Monthly", "Annual", "Monthly", "Annual", "Monthly…
$ base_price_usd      <chr> "49.0", "499.0", "29.0", "299.0", "15.0", "149.0",…
$ first_release_date  <dttm> 2023-11-24, 2023-02-27, 2023-01-13, 2023-05-21, 2…
$ vendor              <chr> "Microsoft", "Microsoft", "Adobe", "Adobe", "AI To…
$ resale_model        <chr> "Direct resale", "Direct resale", "Direct resale",…
$ brand_safe_name     <chr> "Office Suite 365 Business Standard Monthly", "Off…
$ product_name_orig   <chr> "Microsoft 365 Business Standard Monthly", "Micros…
$ base_price_usd_orig <chr> "49.0", "499.0", "29.0", "299.0", "15.0", "149.0",…
$ base_key            <chr> "microsoft 365 business standard", "microsoft 365 …
$ product_version     <chr> "v3.0", "v3.0", "v3.0", "v3.0", "v3.0", "v3.0", "v…

4 Field Types

We need to set correct data type for each field.

df_events <- df_events |>
  mutate(
    event_date = as.POSIXct(event_date,
      format = "%Y-%m-%d %H:%M:%S", tz = "UTC"
    ),
    country = as.factor(country),
    region = as.factor(region),
    channel = as.factor(channel),
    payment_method = as.factor(payment_method),
    currency = as.factor(currency),
    latitude = as.numeric(latitude),
    longitude = as.numeric(longitude),
    quantity = as.numeric(quantity),
    unit_price_local = as.numeric(unit_price_local),
    discount_local = as.numeric(discount_local),
    tax_local = as.numeric(tax_local),
    net_revenue_local = as.numeric(net_revenue_local),
    fx_rate_to_usd = as.numeric(fx_rate_to_usd),
    net_revenue_usd = as.numeric(net_revenue_usd),
    refund_datetime = as.POSIXct(refund_datetime,
      format = "%Y-%m-%d %H:%M:%S", tz = "UTC"
    ),
    refund_reason = as.factor(refund_reason)
  )
df_customers <- df_customers |>
  mutate(
    signup_date = as.POSIXct(signup_date,
      format = "%Y-%m-%d %H:%M:%S", tz = "UTC"
    ),
    region = as.factor(region),
    country = as.factor(country),
    currency_preference = as.factor(currency_preference),
    segment = as.factor(segment),
    acquisition_channel = as.factor(acquisition_channel),
    age_band = factor(age_band,
      levels = c("18-24", "25-34", "35-44", "45-54", "55+")
    ),
    country = as.factor(country),
    country_latitude = as.numeric(country_latitude),
    country_longitude = as.numeric(country_longitude)
  )
df_products <- df_products |>
  mutate(
    product_name = as.factor(product_name),
    category = as.factor(category),
    billing_cycle = factor(billing_cycle,
      levels = c("One-time", "Monthly", "Annual")
    ),
    first_release_date = as.POSIXct(first_release_date,
      format = "%Y-%m-%d %H:%M:%S", tz = "UTC"
    ),
    vendor = as.factor(vendor),
    resale_model = as.factor(resale_model),
    base_price_usd_orig = as.numeric(base_price_usd_orig)
  )

5 Summary Statistics

5.1 Events

Events dataset consists of 48000 rows and 23 columns. The event_date field ranges from 2024-04-22 00:09:01 to 2025-10-21 23:56:20.

skimr::skim(df_events)

Data summary
Name	df_events
Number of rows	48000
Number of columns	23
_______________________
Column type frequency:
character	5
factor	6
logical	1
numeric	9
POSIXct	2
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
event_id	1	12	12	48000
event_type	1	5	7	2
customer_id	1	11	11	4000
product_id	1	8	8	101
discount_code	1	3	13	14

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
country	0	1.00	FALSE	10	Uni: 14735, Uni: 7262, Can: 4844, Aus: 4836
region	19579	0.59	FALSE	3	EU: 19040, APA: 7306, LAT: 2075
channel	0	1.00	FALSE	5	Web: 21549, Dir: 9792, Res: 6914, Par: 4884
payment_method	0	1.00	FALSE	4	Cre: 26340, Inv: 12055, Pay: 7215, Wir: 2390
currency	0	1.00	FALSE	4	USD: 24124, EUR: 11778, GBP: 7262, AUD: 4836
refund_reason	46995	0.02	FALSE	4	Bil: 269, Acc: 265, Dup: 238, Ser: 233

Variable type: logical

skim_variable	n_missing	complete_rate	mean	count
is_refunded	0	1	0.02	FAL: 46995, TRU: 1005

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
latitude	1	35.51	25.61	-25.27	39.83	39.83	55.38	56.13	▂▁▁▅▇
longitude	1	-22.98	79.32	-106.35	-98.58	-3.44	5.29	133.78	▇▁▇▁▃
quantity	1	6.03	6.25	1.00	1.00	3.00	10.00	25.00	▇▁▁▁▁
unit_price_local	1	106.67	113.83	4.10	15.81	49.00	186.89	756.06	▇▂▁▁▁
discount_local	1	26.52	96.08	0.00	0.00	0.00	7.34	2353.75	▇▁▁▁▁
tax_local	1	66.09	155.25	0.00	0.00	8.49	47.54	2353.75	▇▁▁▁▁
net_revenue_local	1	668.31	1214.80	4.00	55.24	202.79	675.86	15121.20	▇▁▁▁▁
fx_rate_to_usd	1	1.01	0.14	0.66	1.00	1.00	1.06	1.22	▁▁▁▇▂
net_revenue_usd	1	663.17	1188.44	4.00	54.34	199.71	675.46	14970.13	▇▁▁▁▁

Variable type: POSIXct

skim_variable	n_missing	complete_rate	min	max	median	n_unique
event_date	0	1.00	2024-04-22 00:09:01	2025-10-21 23:56:20	2025-01-20 11:45:33	47976
refund_datetime	46995	0.02	2024-04-25 21:38:12	2025-12-11 16:52:56	2025-02-23 06:50:48	1005

5.2 Impute Missing Region Values

We can see that there are 19579 rows with missing region values. Let’s check which countries these rows belong to.

df_events |>
  filter(is.na(region)) |>
  group_by(country) |>
  summarise(count = n()) |>
  arrange(desc(count))

# A tibble: 2 × 2
  country       count
  <fct>         <int>
1 United States 14735
2 Canada         4844

We can impute the missing region values based on the country field since each country belongs to a specific region. Missing region values correspond to countries in North America (US, CA). Therefore, we will impute the missing region values with “NOAM”.

df_events <- df_events |>
  mutate(region = ifelse(is.na(region), "NOAM", as.character(region)))
df_events <- df_events |> mutate(region = as.factor(region))

5.3 Customers

skimr::skim(df_customers)

Data summary
Name	df_customers
Number of rows	4000
Number of columns	10
_______________________
Column type frequency:
character	1
factor	6
numeric	2
POSIXct	1
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
customer_id	0	1	11	11	0	4000	0

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
region	1628	0.59	FALSE	3	EU: 1583, APA: 617, LAT: 172
currency_preference	0	1.00	FALSE	4	USD: 2006, EUR: 990, GBP: 593, AUD: 411
segment	0	1.00	FALSE	4	Con: 2529, SOH: 761, SMB: 597, Ent: 113
acquisition_channel	0	1.00	FALSE	6	Org: 1121, Pai: 869, Soc: 708, Ema: 664
age_band	0	1.00	FALSE	5	25-: 1425, 35-: 1099, 45-: 617, 18-: 451
country	0	1.00	FALSE	10	Uni: 1213, Uni: 593, Can: 415, Aus: 411

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
country_latitude	0	1	35.41	25.75	-25.27	39.83	39.83	55.38	56.13	▂▁▁▅▇
country_longitude	0	1	-22.61	79.63	-106.35	-98.58	-3.44	5.29	133.78	▇▁▇▁▃

Variable type: POSIXct

skim_variable	n_missing	complete_rate	min	max	median	n_unique
signup_date	0	1	2024-01-01 05:27:00	2025-08-30 15:03:00	2024-11-04 13:14:00	3989

5.4 Impute Missing Region Values

The Customers dataset also has missing values in the region field. Let’s check which countries these rows belong to.

df_customers |>
  filter(is.na(region)) |>
  group_by(country) |>
  summarise(count = n()) |>
  arrange(desc(count))

# A tibble: 2 × 2
  country       count
  <fct>         <int>
1 United States  1213
2 Canada          415

We can impute the missing region values based on the country field since each country belongs to a specific region. Missing region values correspond to countries in North America (US, CA). Therefore, we will impute the missing region values with “NOAM”.

df_customers <- df_customers |>
  mutate(region = ifelse(is.na(region), "NOAM", as.character(region)))
df_customers <- df_customers |> mutate(region = as.factor(region))

5.5 Products

skimr::skim(df_products)

Data summary
Name	df_products
Number of rows	101
Number of columns	14
_______________________
Column type frequency:
character	6
factor	5
logical	1
numeric	1
POSIXct	1
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
product_id	1	8	8	101
base_price_usd	1	3	6	67
brand_safe_name	1	20	51	101
product_name_orig	1	15	39	70
base_key	1	8	35	36
product_version	1	4	4	2

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
product_name	1	FALSE	101	Ado: 1, Ado: 1, Ado: 1, Ado: 1
category	1	FALSE	17	Add: 12, Des: 12, Ana: 10, Dev: 10
billing_cycle	1	FALSE	3	Mon: 50, Ann: 49, One: 2
vendor	1	FALSE	17	Mic: 24, AI : 14, Oth: 11, Ado: 10
resale_model	1	FALSE	3	Dir: 58, Mar: 28, Aff: 15

Variable type: logical

skim_variable	n_missing	complete_rate	mean	count
is_subscription	0	1	0.98	TRU: 99, FAL: 2

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
base_price_usd_orig	0	1	99.32	104.26	5	15	50	160	499	▇▃▂▁▁

Variable type: POSIXct

skim_variable	n_missing	complete_rate	min	max	median	n_unique
first_release_date	0	1	2023-01-13	2024-03-01	2024-02-01	14

The Products dataset has no missing values.

6 Univariate Distributions

6.1 Events Categorical Variables

# Categorical distributions
df_events |>
  select(-longitude, -latitude) |>
  select(where(is.factor)) |>
  pivot_longer(everything()) |>
  ggplot(aes(x = value)) +
  facet_wrap(~name, scales = "free", ncol = 3) +
  geom_bar(fill = "steelblue") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        axis.title = element_blank())

Figure 1: Bar Plots of Categorical Variables in Events Dataset

6.2 Events Numeric Variables

Before doing bivariate analysis, let’s convert local currency to USD for all relevant fields in the Events dataset.

local_fields <- df_events |>
  colnames() |>
  str_subset("_local$")

for (field in local_fields) {
  usd_field <- str_replace(field, "_local$", "_usd")
  df_events <- df_events |>
    mutate(!!sym(usd_field) := !!sym(field) / fx_rate_to_usd)
}

# Drop local currency fields
df_events <- df_events |>
  select(-all_of(local_fields))

# Numeric distributions
df_events |>
  select(-longitude, -latitude) |>
  select(where(is.numeric)) |>
  pivot_longer(everything()) |>
  ggplot(aes(x = value)) +
  facet_wrap(~name, scales = "free", ncol = 3) +
  geom_histogram(bins = 30, fill = "steelblue", color = "white") +
  theme(axis.title = element_blank())

Figure 2: Histograms of Numeric Variables in Events Dataset

There is no need to do correlation analysis for the monetary fields.

6.3 Check Integrity of Monetary Fields

Some variables in the Events dataset need clarification:

quantity: This field represents the number of units sold in each event.
unit_price_usd: This field represents the price per unit in USD.
discount_usd: This field represents the discount applied in USD.
tax_usd: This field represents the tax applied in USD.
net_revenue_usd: This field represents the net revenue after discounts and taxes in USD.

Let’s check the integrity of these fields.

df_events <- df_events |>
  mutate(
    calculated_net_revenue = (quantity * unit_price_usd) - discount_usd +
      tax_usd
  )
integrity_issues <- df_events |>
  filter(abs(calculated_net_revenue - net_revenue_usd) > 0.01)
integrity_issues_count <- nrow(integrity_issues)
integrity_issues_count

[1] 0

It seems that there are 0 rows with integrity issues in the monetary fields.

6.4 Customers Categorical Variables

# Categorical distributions
df_customers |>
  select(where(is.factor)) |>
  pivot_longer(everything()) |>
  ggplot(aes(x = value)) +
  facet_wrap(~name, scales = "free", ncol = 3) +
  geom_bar(fill = "steelblue") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        axis.title = element_blank())

Figure 3: Bar Plots of Categorical Variables in Customers Dataset

6.5 Customers Numeric Variables

Only country_latitude and country_longitude are numeric variables in the Customers dataset. We are not going to do correlation analysis for these fields.

6.6 Products Categorical Variables

# Categorical distributions
df_products |>
  select(where(is.factor)) |>
  select(-product_name) |>
  pivot_longer(everything()) |>
  ggplot(aes(x = value)) +
  facet_wrap(~name, scales = "free", ncol = 2) +
  geom_bar(fill = "steelblue") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        axis.title = element_blank())

Figure 4: Bar Plots of Categorical Variables in Products Dataset

6.7 Products Numeric Variables

There is only one numeric variable in the Products dataset: base_price_usd_orig. We will plot its distribution.

# Numeric distributions
df_products |>
  select(where(is.numeric)) |>
  pivot_longer(everything()) |>
  ggplot(aes(x = value)) +
  facet_wrap(~name, scales = "free", ncol = 1) +
  geom_histogram(bins = 30, fill = "steelblue", color = "white") +
  theme(axis.title = element_blank())

Figure 5: Histograms of Numeric Variables in Products Dataset

1 Load Data

2 Data Dictionary

3 Data Glimpse

3.1 Events

3.2 Customers

3.3 Products

4 Field Types

5 Summary Statistics

5.1 Events

5.2 Impute Missing Region Values

5.3 Customers

5.4 Impute Missing Region Values

5.5 Products

6 Univariate Distributions

6.1 Events Categorical Variables

6.2 Events Numeric Variables

6.3 Check Integrity of Monetary Fields

6.4 Customers Categorical Variables

6.5 Customers Numeric Variables

6.6 Products Categorical Variables

6.7 Products Numeric Variables

7 References