Time-Series Clustering with R’s dtwclust Package

This article is a practical guide for time-series clustering using the dtwclust package. The dtwclust package in R (see vignette) provides a powerful and flexible framework for time-series clustering, allowing you to implement and compare various algorithms, particularly those leveraging Dynamic Time Warping (DTW). This showcase will guide you through a practical example of time-series clustering using dtwclust, including data preparation, clustering execution, visualization, and evaluation.

1 What is Dynamic Time Warping (DTW)

Dynamic Time Warping (DTW) is a prominent distance measure used in shape-based time-series clustering. Unlike Euclidean distance, which compares points at the same time index, DTW allows for “warping” or stretching/compressing the time axis of one series to find an optimal alignment with another. This enables it to accurately measure similarity between time-series that may vary in speed, length, or have phase shifts, but exhibit similar overall shapes.

Sample alignment performed by the DTW algorithm between two series. The dashed blue lines exemplify how some points are mapped to each other, which shows how they can be warped in time. Note that the vertical position of each series was artificially altered for visualization. Credits: Alexis Sardá-Espinosa

2 Data Preparation

We will use daily closing prices of various cryptocurrencies from the dYdX exchange. The data will be fetched using the httr2 package.

Note

dYdX Exchange is a decentralized finance (DeFi) platform that allows users to trade perpetual derivatives, margin, and spot crypto assets without a centralized intermediary.

Let’s start by fetching the list of available perpetual markets from the dYdX API. We will then filter for popular cryptocurrencies and retrieve their daily closing prices.

library(httr2)
library(tibble)
library(dplyr)
library(tidyr)

base_url <- "https://indexer.dydx.trade/v4"
full_url <- paste0(base_url, "/perpetualMarkets")

req <- request(full_url) |>
  req_headers(Accept = "application/json")

resp <- req_perform(req)
body <- resp_body_json(resp, simplifyVector = TRUE)

markets <- bind_rows(body$markets) |> as_tibble()
markets <- markets |> mutate(volume24H = as.numeric(volume24H))
markets <- markets |> arrange(desc(volume24H)) 

markets |> select(ticker, oraclePrice, priceChange24H, volume24H) |> head(5)

# A tibble: 5 × 4
  ticker   oraclePrice  priceChange24H  volume24H
  <chr>    <chr>        <chr>               <dbl>
1 ETH-USD  4585.44      -10.614039     150676770.
2 BTC-USD  112855.34    2015.74841      43359731.
3 SOL-USD  212.08       7.72            18677423.
4 XRP-USD  2.9988498849 -0.0133507301    3201248.
5 LINK-USD 23.7155      -0.694490999     2551624.

Let’s take 30 the most traded markets.

tickers <- lapply(body$markets[1:31], function(market) {
  market$ticker
})
# remove MATIC-USD due to missing data
tickers <- tickers[!tickers %in% c("MATIC-USD")]
tickers <- tickers |>
  unlist() |>
  array()

Below are are helper functions to convert date-times to the required format and to fetch daily candle data from the dYdX API. For some reason at the time of writing this, the API ignores the fromIso parameter, so we will filter the data after fetching it.

to_iso8601_ns_utc <- function(datetime) {
  datetime_utc <- as.POSIXct(datetime, tz = "UTC")
  format(datetime_utc, format = "%Y-%m-%dT%H:%M:%OS9Z", usetz = FALSE)
}

get_dydx_daily_candles <- function(
    market,
    from_iso = NULL,
    to_iso = NULL,
    limit = NULL) {
  base_url <- "https://indexer.dydx.trade/v4"
  full_url <- paste0(base_url, "/candles/perpetualMarkets/", market)

  req <- request(full_url) |>
    req_headers(Accept = "application/json") |>
    req_url_query(resolution = "1DAY")

  from_iso <- if (!is.null(from_iso)) to_iso8601_ns_utc(from_iso) else NULL
  to_iso <- if (!is.null(to_iso)) to_iso8601_ns_utc(to_iso) else NULL

  if (!is.null(from_iso)) req <- req |> req_url_query(fromIso = from_iso)
  if (!is.null(to_iso)) req <- req |> req_url_query(toIso = to_iso)
  if (!is.null(limit)) req <- req |> req_url_query(limit = limit)

  resp <- req_perform(req)
  body <- resp_body_json(resp, simplifyVector = TRUE)

  df <- body$candles
  df <- df |> mutate(
    startedAt = as.POSIXct(startedAt,
      format = "%Y-%m-%dT%H:%M:%OSZ",
      tz = "UTC"
    ),
    open = as.numeric(open),
    high = as.numeric(high),
    low = as.numeric(low),
    close = as.numeric(close)
  )
  df |> as_tibble()
}

get_dydx_daily_candles("BTC-USD",
  from_iso = "2025-01-01T00:00:00Z",
  to_iso = "2025-08-26T00:00:00Z"
) |> head()

# A tibble: 6 × 13
  startedAt           ticker  resolution    low   high   open  close
  <dttm>              <chr>   <chr>       <dbl>  <dbl>  <dbl>  <dbl>
1 2025-08-28 00:00:00 BTC-USD 1DAY       110841 113400 111243 112888
2 2025-08-27 00:00:00 BTC-USD 1DAY       110395 112674 111809 111260
3 2025-08-26 00:00:00 BTC-USD 1DAY       108713 112397 110124 111797
4 2025-08-25 00:00:00 BTC-USD 1DAY       109296 113679 113537 110132
5 2025-08-24 00:00:00 BTC-USD 1DAY       110550 115662 115405 113539
6 2025-08-23 00:00:00 BTC-USD 1DAY       114569 117024 116948 115406
# ℹ 6 more variables: baseTokenVolume <chr>, usdVolume <chr>, trades <int>,
#   startingOpenInterest <chr>, orderbookMidPriceOpen <chr>,
#   orderbookMidPriceClose <chr>

Now, we will fetch daily closing prices for our selection of popular cryptocurrencies.

cryptos_list <- lapply(tickers, function(coin) {
  # cat("Fetching data for:", coin, "\n")
  df <- get_dydx_daily_candles(coin,
    from_iso = "2025-01-01T00:00:00Z",
    to_iso = "2025-08-26T00:00:00Z"
  )
  df <- df |> filter(startedAt >= as.POSIXct("2025-01-01", tz = "UTC"))
  df <- df |> select(ds = startedAt, y = close) |> arrange(ds)
  # z-score normalization
  df <- df |> mutate(y = (y - mean(y, na.rm = TRUE)) / sd(y, na.rm = TRUE))
  df$coin <- toupper(coin) %>% gsub("-USD", "", .)
  return(df)
})

cryptos_list <- bind_rows(cryptos_list)

Next, we will reshape the data into a wide format suitable for clustering, where each row represents a cryptocurrency and each column represents a daily closing price.

cryptos_list_wide <- cryptos_list  |> 
  pivot_wider(names_from = coin, values_from = y) |> 
  arrange(ds) |> 
  select(-ds) |>  
  as.list()

3 Performing Hierarchical Clustering

We will perform hierarchical clustering using the DTW distance and the “ward.D2” agglomeration method. Hierarchical clustering builds a hierarchy of groups without requiring a pre-specified number of clusters initially, and the process is deterministic.

k = 4 specifies the desired number of clusters.
type = "hierarchical" sets the clustering algorithm type.
distance = "dtw" uses Dynamic Time Warping distance.
seed = 42 for reproducibility of random initializations (if applicable).
control = hierarchical_control(method = "ward.D2") specifies the linkage method.
args = tsclust_args(dist = list(window.size = 7)) sets DTW window constraint.

library(dtwclust)

# Perform hierarchical clustering
hc_4_ward <- tsclust(cryptos_list_wide,
  k = 4,
  type = "hierarchical",
  distance = "dtw",
  seed = 42,
  control = hierarchical_control(method = "ward.D2"),
  args = tsclust_args(dist = list(window.size = 7))
)

# View the clustering summary
hc_4_ward

hierarchical clustering with 4 clusters
Using dtw distance
Using PAM (Hierarchical) centroids
Using method ward.D2 

Time required for analysis:
   user  system elapsed 
  2.958   0.047   3.005 

Cluster sizes with average intra-cluster distance:

  size  av_dist
1    4 87.52806
2    2 80.73321
3   12 66.03449
4   12 51.73463

The output provides details about the clustering, including the distance measure, centroid method, linkage method, and cluster sizes with their average intra-cluster distances.

4 Accessing Clustering Results

The tsclust() function returns an S4 object of class TSClusters. You can access its slots, such as the cluster assignments, using the @ operator.

# View cluster assignments for each time series
hc_4_ward@cluster

 BTC  ETH LINK  CRV  SOL  ADA AVAX  FIL  LTC DOGE ATOM  DOT  UNI  BCH  TRX NEAR 
   1    2    3    2    3    3    3    4    3    3    4    4    3    1    1    4 
 MKR  XLM  ETC COMP  WLD  APE  APT  ARB BLUR  LDO   OP PEPE  SEI SHIB 
   1    3    3    4    4    4    4    3    4    3    4    4    3    4

5 Visualizing Clustering Results

The plot() method for TSClusters objects offers various visualization types.

5.1 Dendrogram

A dendrogram visually represents the hierarchy of clusters.

par(mar = c(0, 4, 2, 2)) # Adjust margins for better plot
plot(hc_4_ward,
  xlab = "", ylab = "", sub = "",
  main = "Hierarchical Clustering Dendrogram (DTW, Ward.D2)"
)

Figure 1: Dendrogram of hierarchical clustering using DTW distance and Ward.D2 linkage.

5.2 Series and Centroids

Visualize the time series grouped by cluster, along with their representative prototypes (centroids). By default, prototypes for hierarchical clustering with PAM centroids are actual series from the data.

plot(hc_4_ward, type = "sc") # sc = series + centroids

Figure 2: Time series clustered with their centroids using DTW distance and Ward.D2 linkage.

You can also plot a specific centroid, and even customize its appearance.

plot(hc_4_ward,
  type = "centroids", clus = 1,
  linetype = 1, size = 1, alpha = 0.8
)

Figure 3: Specific centroid (cluster 1).

6 Comparing Multiple Clustering Solutions and Evaluation

In practice, choosing the optimal number of clusters (k) and other parameters is crucial. dtwclust allows you to test multiple configurations simultaneously and evaluate them using Cluster Validity Indices (CVIs).

Note

Cluster Validity Indices are quantitative metrics used to assess the quality and “purity” of clustering results. Since clustering is often an unsupervised process, CVIs provide an objective way to evaluate performance, especially when comparing different clustering algorithms or configurations.

To accelerate the process, especially when testing many combinations, parallelization is highly recommended. dtwclust integrates with the foreach and doParallel packages for this purpose.

library(bigmemory)
library(doParallel)

# Define a range of k values and agglomeration methods to test
k_values <- 3:6
linkage_methods <- c("ward.D2", "average", "single", "complete")

# Initialize a parallel backend
# Use detectCores() - 1 to leave one core free
num_cores <- detectCores() - 1
if (num_cores < 1) num_cores <- 1 # Ensure at least one core is used

cl <- makeCluster(num_cores)
registerDoParallel(cl)

# Perform multiple hierarchical clusterings in parallel
hc_par <- tsclust(cryptos_list_wide,
  k = k_values,
  type = "hierarchical",
  distance = "dtw",
  seed = 42,
  control = hierarchical_control(method = linkage_methods),
  args = tsclust_args(dist = list(window.size = 7)),
  trace = TRUE
)


Calculating distance matrix...
Performing hierarchical clustering...
Extracting centroids...

    Elapsed time is 13.749 seconds.

# Stop the parallel cluster and revert to sequential computation
stopCluster(cl)
registerDoSEQ()

6.1 Evaluate the results using internal CVIs

We’ll use Silhouette (Sil), Dunn (D), and Calinski-Harabasz (CH) indices. Higher values generally indicate better clustering for these indices.

cvi_results <- lapply(hc_par, cvi, type = c("Sil", "D", "CH")) %>%
  do.call(rbind, .)

# Find the configuration that maximizes each CVI
optimal_indices <- apply(cvi_results, MARGIN = 2, FUN = which.max)

6.1.1 Display CVI results and optimal configurations

print(cvi_results)

            Sil         D        CH
 [1,] 0.3463459 0.1355412 21.828708
 [2,] 0.5144728 0.3810831  9.673635
 [3,] 0.1888799 0.2514864  2.709671
 [4,] 0.3369159 0.2050711 18.657689
 [5,] 0.2989078 0.1745498 13.778813
 [6,] 0.3477774 0.3091404  8.406351
 [7,] 0.3689743 0.3451739  6.165031
 [8,] 0.3290618 0.2245861 14.349598
 [9,] 0.2838572 0.2257579 12.063620
[10,] 0.3054947 0.3091404  6.059278
[11,] 0.3441495 0.4739975  5.473468
[12,] 0.2518630 0.2031676 11.466633
[13,] 0.2731625 0.2257579 10.492873
[14,] 0.3439047 0.3098938 10.261007
[15,] 0.2602655 0.4210699  7.605540
[16,] 0.2297632 0.2559787  9.802860

print(optimal_indices)

Sil   D  CH 
  2  11   1

6.1.2 Retrieve the best clustering based on Silhouette index

Let’s extract the clustering that achieved the highest Silhouette score.

best_clustering_sil <- hc_par[[optimal_indices["Sil"]]]
best_clustering_sil

hierarchical clustering with 3 clusters
Using dtw distance
Using PAM (Hierarchical) centroids
Using method average 

Time required for analysis:
   user  system elapsed 
  1.364   0.122  13.749 

Cluster sizes with average intra-cluster distance:

  size  av_dist
1    4 80.06529
2   25 84.14893
3    1  0.00000

6.1.3 Retrieve the best clustering based on Dunn index

Similarly, we can extract the clustering that achieved the highest Dunn score.

best_clustering_dunn <- hc_par[[optimal_indices["D"]]]
best_clustering_dunn

hierarchical clustering with 5 clusters
Using dtw distance
Using PAM (Hierarchical) centroids
Using method single 

Time required for analysis:
   user  system elapsed 
  1.364   0.122  13.749 

Cluster sizes with average intra-cluster distance:

  size  av_dist
1    3 62.58062
2    1  0.00000
3   24 78.78400
4    1  0.00000
5    1  0.00000

6.1.4 Retrieve the best clustering based on Calinski-Harabasz index

Finally, we can extract the clustering that achieved the highest Calinski-Harabasz score.

best_clustering_ch <- hc_par[[optimal_indices["CH"]]]
best_clustering_ch

hierarchical clustering with 3 clusters
Using dtw distance
Using PAM (Hierarchical) centroids
Using method ward.D2 

Time required for analysis:
   user  system elapsed 
  1.364   0.122  13.749 

Cluster sizes with average intra-cluster distance:

  size   av_dist
1    6 106.59632
2   12  66.03449
3   12  51.73463

This output helps you objectively compare different clustering outcomes and select the most suitable solution for your data based on various validity metrics.

7 Performing Clustering with Optimal Configuration

Let’s take a look at the clustering configuration that achieved the maximum results according to the Calinski-Harabasz index.

hc_3_ward <- tsclust(cryptos_list_wide,
  k = 3,
  type = "hierarchical",
  distance = "dtw",
  seed = 42,
  control = hierarchical_control(method = "ward.D2"),
  args = tsclust_args(dist = list(window.size = 7))
)

# View the clustering summary
hc_3_ward

hierarchical clustering with 3 clusters
Using dtw distance
Using PAM (Hierarchical) centroids
Using method ward.D2 

Time required for analysis:
   user  system elapsed 
  2.910   0.106   3.017 

Cluster sizes with average intra-cluster distance:

  size   av_dist
1    6 106.59632
2   12  66.03449
3   12  51.73463

# View cluster assignments for each time series
hc_3_ward@cluster

 BTC  ETH LINK  CRV  SOL  ADA AVAX  FIL  LTC DOGE ATOM  DOT  UNI  BCH  TRX NEAR 
   1    1    2    1    2    2    2    3    2    2    3    3    2    1    1    3 
 MKR  XLM  ETC COMP  WLD  APE  APT  ARB BLUR  LDO   OP PEPE  SEI SHIB 
   1    2    2    3    3    3    3    2    3    2    3    3    2    3

Plot the dendrogram for this clustering.

# Plot the dendrogram
par(mar = c(0, 4, 2, 2))
plot(hc_3_ward,
  xlab = "", ylab = "", sub = "",
  main = "Hierarchical Clustering Dendrogram (DTW, Ward.D2)"
)

The line plots of the time series grouped by cluster show very distinct patterns.

# Plot time series and their centroids
plot(hc_3_ward, type = "sc")

Figure 4: Time series clustered using DTW distance and Ward.D2 linkage.

8 Conclusion

dtwclust provides a modular and efficient framework for time-series clustering in R, implementing various algorithms (especially DTW-related ones) and allowing for customization and comparison. It serves as a strong starting point for time-series clustering tasks.