= pd.read_csv("data/customers.csv")
customers = pd.read_csv("data/emails.csv")
emails = pd.read_csv("data/phones.csv") phones
Introduction
Traditional relational databases and spreadsheets fall short in capturing complex relationships among customers. Enter graph theory – a powerful framework for representing and analyzing interconnected data. By visualizing customer relationships as a graph, we can uncover hidden patterns, identify clusters, and improve data quality.
Motivation
Over time, customers records can become fragmented and duplicated. For example, a customer may use multiple email addresses or phone numbers when interacting with a company. Creating a new record for each email or phone can lead to duplicate records for the same customer. This is especially common in B2B scenarios, where customers often have multiple representatives. Furthermore, some customers represent multiple companies, and their records may be duplicated across different companies.
Doing any type of marketing analysis on such dataset can lead to incorrect results. We cannot be sure about the latest purchase, the total amount spent, or the number of orders. Is this customer a loyal one or not? Is that customer a new one or not? Is this customer going to leave us or they just started buying from another company? Do we need to send a discount to this customer or not? To answer these questions, we need to have customers database defragmented and deduplicated.
Merging records manually can be time-consuming and error-prone. By using graphs, we can represent the relationships between customers, emails, and phones and find groups of connected customers. This can help us identify duplicate records and perform actions depending on our business logic.
Sample Data
We have three datasets: customers
, emails
, and phones.
Each customer can have multiple emails and phones. The names, emails, and phones are generated randomly and do not correspond to real people, though the structure of the data is similar to what you might find in a real-world scenario. In fact, it is the sample taken from the real data, but the names and other personal information are generated randomly to replace the actual ones.
Take a look at the data.
3) customers.head(
customer_id | name | |
---|---|---|
0 | 330087 | William Sparks |
1 | 443237 | Joseph Williams |
2 | 329867 | Eddie Porter |
Length: 1000 Unique: 1000
3) emails.head(
customer_id | ||
---|---|---|
0 | 599100 | brian12@example.net |
1 | 330087 | emyers@example.com |
2 | 25494 | cindymurphy@example.net |
Length: 957 Unique: 626 Duplicated: 331 Empty: 0
3) phones.head(
customer_id | phone | |
---|---|---|
0 | 15962 | 876.997.0254 |
1 | 99723 | 001-706-213-0362 |
2 | 99723 | 886.527.4420x90003 |
Length: 855 Unique: 524 Duplicated: 331 Empty: 0
Icons for Nodes
Next chunk of code creates a dictionary of icons for different types of nodes in the graph. It will be used later to visualize the subgraphs.
import PIL
= {
icons "customer": "icons/customer.png",
"phone": "icons/phone.png",
"email": "icons/email.png",
}
= {k: PIL.Image.open(fname) for k, fname in icons.items()} images
Creating a Graph
Let’s create graph and add nodes. Each node will represent a customer, email, or phone. We will use the images dictionary to assign an image to each node, but it’s not necessary for the procedure, as well as setting the type of the node.
= nx.Graph()
G
= []
nodes
for x in emails["email"].dropna().unique():
=images["email"], type="email")
G.add_node(x, image
for x in phones["phone"].dropna().unique():
=images["phone"], type="phone")
G.add_node(x, image
for x in customers["customer_id"].unique():
=images["customer"], type="customer") G.add_node(x, image
Next, we will add edges to the graph. The edges will connect customers with their emails and phones.
= []
edges
for x in customers[["customer_id"]].merge(emails).values:
edges.append(x)
for x in customers[["customer_id"]].merge(phones).values:
edges.append(x)
G.add_edges_from(edges)
Finding Groups of Connected Customers
Customers that share the same email or phone will be connected by the edges. Let’s find groups of connected customers.
= list(nx.connected_components(G))
groups print("Groups:", len(groups))
Groups: 559
Visualizing the Graph
The number of connected components is rather large to visualize all of them, and most of the groups will contain only a few nodes. Let’s find the groups with the largest number of nodes and visualize them.
= pd.DataFrame([groups]).T
df = ["group"]
df.columns "size"] = df["group"].apply(len)
df["size"].hist(bins=20, log=True)
df["Group Size Distribution")
plt.title(;
plt.show() plt.close()
The simplest way to visualize the graph is to use the draw function from the networkx
library. We will use the nx.draw
function to visualize the graph. We will create a grid of subplots and visualize the top groups. Parameter seed
is set to 42 to make the layout reproducible.
= plt.subplots(3, 4, figsize=(8, 6))
fig, axes
= list(
top_groups
df.sort_values("size",
=False,
ascending
)len(axes.flatten()))
.head(
.index
)
for i, g in enumerate(top_groups):
= axes.flatten()[i]
ax = G.subgraph(groups[g])
subgraph = nx.spring_layout(subgraph, seed=42)
pos
nx.draw(
subgraph,=pos,
pos=False,
with_labels=25,
node_size=ax,
ax
)f"Group {g}");
ax.set_title(
plt.tight_layout();
plt.show() plt.close()
There are literally constellations of different shapes and sizes. Let’s visualize some of them in more detail.
Visualizing Subgraphs
Let’s visualize one of the largest group in more detail. We will use the nx.draw_networkx_edges
function to draw the edges and the imshow
function to display the icons of the nodes. We will also add the customer id to the customers’ nodes. The value of parameter seed
is set to the same value as in the previous chunk to keep the layout. You can change it to see different layouts.
= G.subgraph(groups[91])
subgraph = plt.subplots(figsize=(8, 8))
fig, ax
= nx.spring_layout(subgraph, seed=42)
pos
nx.draw_networkx_edges(
subgraph,=pos,
pos=ax,
ax=True,
arrows="-",
arrowstyle=10,
min_source_margin=10,
min_target_margin
)
= ax.transData.transform
tr_figure = fig.transFigure.inverted().transform
tr_axes
= (ax.get_xlim()[1] - ax.get_xlim()[0]) * 0.015
icon_size = icon_size / 2.0
icon_center
for n in subgraph.nodes:
= tr_figure(pos[n])
xf, yf = tr_axes((xf, yf))
xa, ya = plt.axes([xa - icon_center, ya - icon_center, icon_size, icon_size])
a "image"])
a.imshow(subgraph.nodes[n][if G.nodes[n]["type"] == "customer":
a.text(0.5,
0.5,
n,="center",
ha="center",
va=8,
fontsize="red",
color="white",
backgroundcolor=dict(color="white", facecolor="white", alpha=0.5),
bbox
)"off")
a.axis(
=True, bottom=True)
sns.despine(left
;
plt.show() plt.close()
Why Graphs and Not SQL?
We can see here that the customers in this group form pretty complex relationships. A customer may be connected to another one by the phone numbers, and the other one may be connected to the third one by the email, forming a chain of connections. I believe that it is nearly impossible to find this kind of relationship using SQL. The more complex the relationships are, the more time and effort it will take to find them using SQL. For example, if we have a chain of 10 customers, where each customer is connected to the next one by the phone number, it will take 10 joins to find this chain using SQL. If we have 100 customers in the chain, it will take 100 joins to find it using SQL, and the query will probably never complete. But it takes fractions of a second to find it using the graph.
Assigning Groups to Customers
Finally, we will assign a group to each customer. For that, we will expand the groups list and create a new DataFrame with the group_id
and customer_id
columns.
= pd.DataFrame([groups]).T
df = ["customer_id"]
df.columns = df.explode("customer_id")
df "group_id"] = df.index
df[3) df.tail(
customer_id | group_id | |
---|---|---|
557 | 601053 | 557 |
558 | 571.212.7377x69843 | 558 |
558 | 590385 | 558 |
Note that customer_id
column contains phone numbers and emails as well as customer ids, but when we merge the data, there will remain only the customer ids.
= customers.merge(df)
customers 3) customers.head(
customer_id | name | group_id | |
---|---|---|---|
0 | 330087 | William Sparks | 1 |
1 | 443237 | Joseph Williams | 4 |
2 | 329867 | Eddie Porter | 6 |
Let’s check the number of customers and unique customer ids to make sure that we didn’t lose any customers neither we added duplicates.
len(customers), len(customers["customer_id"].unique())
(1000, 1000)
Looks good. Now we can save the data to the file.
"data/customers_grouped.csv", index=False) customers.to_csv(
Conclusion
In this article, we explored how to merge customer records using graphs. We created a graph of customers, emails, and phones and found groups of connected customers. We assigned a group to each customer and saved the data to a file. This approach can help us identify duplicate records and perform actions depending on our business logic. We also visualized the graph and subgraphs to better understand the relationships between customers. This can be useful for marketing analysis, customer segmentation, and other tasks that require a deep understanding of customer relationships.