Refund Prediction Model

Author

Aleksei Prishchepo

Published

November 16, 2025

1 Load data

dataset <- read.csv("Events.csv")

file_name <- "DataDNA Dataset Challenge - E-commerce Dataset - November 2025.xlsx"
products <- readxl::read_excel(file_name, sheet = 2)
dataset <- dataset |> left_join(products, by = "product_id")
dataset |> glimpse()
Rows: 48,000
Columns: 40
$ event_id                <chr> "ABFBC39EA034", "9EC5B0C5C929", "11E3C1E01029"…
$ event_type              <chr> "order", "order", "order", "invoice", "order",…
$ event_date              <chr> "13.06.2024 05:22:41", "04.07.2024 02:06:29", …
$ customer_id             <chr> "CUST0000001", "CUST0000001", "CUST0000001", "…
$ product_id              <chr> "PROD0094", "PROD0085", "PROD0002", "PROD0050"…
$ country                 <chr> "United Kingdom", "United Kingdom", "United Ki…
$ latitude                <dbl> 55.3781, 55.3781, 55.3781, 55.3781, 55.3781, 5…
$ longitude               <dbl> -3.4360, -3.4360, -3.4360, -3.4360, -3.4360, -…
$ region                  <chr> "EU", "EU", "EU", "EU", "EU", "EU", "EU", "EU"…
$ channel                 <chr> "Direct Sales", "Website", "Reseller", "Market…
$ payment_method          <chr> "Credit Card", "Invoice", "Invoice", "Credit C…
$ currency                <chr> "GBP", "GBP", "GBP", "GBP", "GBP", "GBP", "GBP…
$ quantity                <int> 1, 1, 1, 15, 15, 15, 3, 1, 25, 5, 1, 1, 20, 3,…
$ discount_code           <chr> "N/A", "N/A", "N/A", "N/A", "SAVE5", "BFCM10",…
$ fx_rate_to_usd          <dbl> 1.22, 1.22, 1.22, 1.22, 1.22, 1.22, 1.22, 1.22…
$ net_revenue_usd         <dbl> 189.426230, 9.991803, 402.311475, 1572.196721,…
$ is_refunded             <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
$ refund_datetime         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ refund_reason           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ Date                    <chr> "2024-06-13", "2024-07-04", "2024-07-15", "202…
$ unit_price_usd          <dbl> 157.852459, 8.327869, 335.262295, 87.344262, 1…
$ discount_usd            <dbl> 0.000000, 0.000000, 0.000000, 0.000000, 134.74…
$ tax_usd                 <dbl> 31.573770, 1.663934, 67.049180, 262.032787, 53…
$ purchase_sequence       <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 1, 2, 3, 4,…
$ time_to_second_purchase <dbl> NA, 20.86375, NA, NA, NA, NA, NA, NA, NA, NA, …
$ time_between_purchases  <dbl> NA, 20.863750, 10.940058, 44.105428, 1.342627,…
$ time_to_refund          <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ product_name            <chr> "Adobe Firefly Creative AI Annual Business", "…
$ category                <chr> "Design", "Productivity", "Productivity Suite"…
$ is_subscription         <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE…
$ billing_cycle           <chr> "Annual", "Monthly", "Annual", "Annual", "Annu…
$ base_price_usd          <chr> "234.95", "12.4", "499.0", "130.0", "267.4", "…
$ first_release_date      <dttm> 2024-03-01, 2024-03-01, 2023-02-27, 2024-02-0…
$ vendor                  <chr> "Adobe", "Notion", "Microsoft", "Zendesk", "AI…
$ resale_model            <chr> "Direct resale", "Marketplace", "Direct resale…
$ brand_safe_name         <chr> "Creative Suite Firefly Creative AI Annual Bus…
$ product_name_orig       <chr> "Adobe Firefly Creative AI Annual", "Notion AI…
$ base_price_usd_orig     <chr> "200.0", "10.0", "499.0", "130.0", "250.0", "2…
$ base_key                <chr> "adobe firefly creative ai", "notion ai", "mic…
$ product_version         <chr> "v4.0", "v4.0", "v3.0", "v4.0", "v4.0", "v4.0"…

Let’s select meaningful variables for our analysis.

dataset <- dataset |> select(
  event_type, country, channel, payment_method, quantity, discount_code,
  is_refunded, refund_reason, category
)

dataset <- dataset |>
  mutate(across(where(is.character), as.factor))

dataset <- dataset |> mutate(
  is_refunded = ifelse(is_refunded, "Yes", "No"),
  is_refunded = factor(is_refunded, levels = c("No", "Yes"))
)

dataset |> glimpse()
Rows: 48,000
Columns: 9
$ event_type     <fct> order, order, order, invoice, order, invoice, order, or…
$ country        <fct> United Kingdom, United Kingdom, United Kingdom, United …
$ channel        <fct> Direct Sales, Website, Reseller, Marketplace, Partner, …
$ payment_method <fct> Credit Card, Invoice, Invoice, Credit Card, Invoice, In…
$ quantity       <int> 1, 1, 1, 15, 15, 15, 3, 1, 25, 5, 1, 1, 20, 3, 1, 1, 1,…
$ discount_code  <fct> N/A, N/A, N/A, N/A, SAVE5, BFCM10, BFCM20, N/A, LOYALTY…
$ is_refunded    <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, No,…
$ refund_reason  <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ category       <fct> Design, Productivity, Productivity Suite, Support, AI T…

We’re going to predict only refunds with the reason “Service dissatisfaction”.

dataset <- dataset |>
  filter(is_refunded == "Yes" & refund_reason == "Service dissatisfaction" |
    is_refunded == "No") |>
  select(-refund_reason)

2 Preprocess Data

We’re going to split dataset to train and test sets.

set.seed(123)
train_index <- createDataPartition(dataset$is_refunded, p = 0.7, list = FALSE)
train <- dataset[train_index, ]
test  <- dataset[-train_index, ]

3 Logistic Regression

We’re going to use selected at the previous step variables.

glm_model <- glm(is_refunded ~ .,
  family = "binomial", data = train
)
summary(glm_model)

Call:
glm(formula = is_refunded ~ ., family = "binomial", data = train)

Coefficients:
                            Estimate Std. Error z value Pr(>|z|)    
(Intercept)                 -5.06413    0.50445 -10.039   <2e-16 ***
event_typeorder              0.19202    0.17818   1.078   0.2812    
countryBrazil                0.33136    0.38479   0.861   0.3892    
countryCanada                0.09030    0.32625   0.277   0.7819    
countryFrance               -0.22634    0.39586  -0.572   0.5675    
countryGermany               0.17590    0.34518   0.510   0.6103    
countryNetherlands          -0.47082    0.47279  -0.996   0.3193    
countryPhilippines          -0.43500    0.47289  -0.920   0.3576    
countrySpain                -0.21574    0.44700  -0.483   0.6294    
countryUnited Kingdom        0.13450    0.29758   0.452   0.6513    
countryUnited States        -0.35852    0.28904  -1.240   0.2148    
channelMarketplace          -0.07718    0.27891  -0.277   0.7820    
channelPartner               0.02116    0.26970   0.078   0.9375    
channelReseller             -0.11573    0.25191  -0.459   0.6459    
channelWebsite              -0.47222    0.20570  -2.296   0.0217 *  
payment_methodInvoice       -0.18648    0.20486  -0.910   0.3627    
payment_methodPayPal         0.27573    0.20744   1.329   0.1838    
payment_methodWire           0.34143    0.32185   1.061   0.2888    
quantity                    -0.02239    0.01373  -1.631   0.1028    
discount_codeBFCM20         -0.56288    0.55921  -1.007   0.3141    
discount_codeEDU10         -13.16926  630.51146  -0.021   0.9833    
discount_codeEDU15           0.54904    1.06139   0.517   0.6050    
discount_codeEDU20         -13.25528  589.15440  -0.022   0.9821    
discount_codeLOYALTY15      -0.08949    0.53100  -0.169   0.8662    
discount_codeN/A             0.06503    0.34795   0.187   0.8518    
discount_codeNEWCUSTOMER10   0.44071    0.42915   1.027   0.3045    
discount_codeSALE15         -0.03058    0.79713  -0.038   0.9694    
discount_codeSAVE5          -0.35222    0.52874  -0.666   0.5053    
discount_codeSTUDENT10       1.31090    0.79067   1.658   0.0973 .  
discount_codeSTUDENT15     -13.24478  672.47914  -0.020   0.9843    
discount_codeSTUDENT20       0.56259    1.06164   0.530   0.5962    
discount_codeWELCOME10      -0.12528    0.48757  -0.257   0.7972    
categoryAI Productivity     -0.33907    0.40599  -0.835   0.4036    
categoryAI Tools            -0.45539    0.44340  -1.027   0.3044    
categoryAnalytics           -0.59033    0.40574  -1.455   0.1457    
categoryCollaboration        0.08662    0.38040   0.228   0.8199    
categoryDesign               0.21795    0.30835   0.707   0.4797    
categoryDeveloper Tools      0.03595    0.33514   0.107   0.9146    
categoryFile Storage        -0.52895    0.74528  -0.710   0.4779    
categoryInfrastructure      -0.08313    0.62341  -0.133   0.8939    
categoryMonitoring          -0.51181    0.74512  -0.687   0.4922    
categoryPerpetual            0.26837    0.74758   0.359   0.7196    
categoryProductivity         0.32209    0.31455   1.024   0.3059    
categoryProductivity Suite  -0.23860    0.50421  -0.473   0.6361    
categoryProject Management   0.48066    0.34103   1.409   0.1587    
categorySecurity            -0.46766    0.74518  -0.628   0.5303    
categoryServices            -0.52622    1.02876  -0.512   0.6090    
categorySupport              0.27772    0.36141   0.768   0.4422    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2067.6  on 33060  degrees of freedom
Residual deviance: 2013.7  on 33013  degrees of freedom
AIC: 2109.7

Number of Fisher Scoring iterations: 17

4 Model Evaluation

glm_probs <- predict(glm_model, newdata = test, type = "response")

glm_pred <- ifelse(glm_probs > 0.5, "Yes", "No") |> factor(levels = c("No", "Yes"))

# Confusion matrix
confusionMatrix(glm_pred, test$is_refunded, positive = "Yes")
Confusion Matrix and Statistics

          Reference
Prediction    No   Yes
       No  14098    69
       Yes     0     0
                                          
               Accuracy : 0.9951          
                 95% CI : (0.9938, 0.9962)
    No Information Rate : 0.9951          
    P-Value [Acc > NIR] : 0.532           
                                          
                  Kappa : 0               
                                          
 Mcnemar's Test P-Value : 2.695e-16       
                                          
            Sensitivity : 0.00000         
            Specificity : 1.00000         
         Pos Pred Value :     NaN         
         Neg Pred Value : 0.99513         
             Prevalence : 0.00487         
         Detection Rate : 0.00000         
   Detection Prevalence : 0.00000         
      Balanced Accuracy : 0.50000         
                                          
       'Positive' Class : Yes             
                                          

5 Conclusion

The logistic regression model couldn’t predict refunds linked to service dissatisfaction, so any insights from slicing the data in the dashboard would be unreliable. We’ve decided not to move forward with more complex approaches as they’re not suitable for visualization in the dashboard.

6 References