How we handled profile ID reconciliation using Treasure Data Unification and SQL

Author

Nurbakyt Kulbatshayeva

Senior Data Engineer at Artefact Germany

In this article we explain the challenges of ID reconciliation and demonstrate our approach to create a unified profile ID in Customer Data Platform, specifically Treasure Data.

Data was gathered from an offline Data Warehouse and online web site tracking.

Customer Data Platform

Customer Data Platform (CDP) allows businesses to collect, manage, and analyze customer data from various sources in a unified and centralized manner. The goal of a CDP is to create a comprehensive and cohesive view of each customer by aggregating data from multiple touchpoints, such as websites, mobile apps, social media, email interactions, and more.

Treasure Data

Treasure Data is a cloud-based Customer Data Platform. Here are some common business use cases for Treasure Data:

Customer Data Analysis. Treasure Data can be used to aggregate and analyze customer interaction data. This helps businesses create a unified view of customer behavior and preferences.
Personalized Marketing. By leveraging the customer inisghts gathered through Treasure Data, businesses can create personalized marketing campaigns. This includes targeted messaging, personalized recommendations, and tailored offers based on individual customer profiles.

ID reconciliation

ID reconciliation is a process in which data related to a person is compared and matched across different systems or databases to ensure accuracy, consistency, and a unified view of that person.

This is particularly important in scenarios where multiple systems or data sources store information about the same individuals, but the data may be incomplete or inconsistent.

The challenge of ID reconciliation

ID reconciliation encounters numerous challenges, especially with extensive datasets and diverse information sources. These challenges include inconsistent data formats, quality issues leading to false matches, and the management of duplicate records resulting from errors or system changes. The key challenge is the lack of a unique ID representing the user accross the different sources of information, tables and databases.

Organizations address these challenges by investing in advanced data tools like Treasure Data.

ID reconciliation implementation

We implemented ID reconciliation for one of our clients using Treasure Data Unification functionality. In this article we want to share our approach and implementation logic.

Given conditions: in CDP databases there were available tables with information about the user journey on the website, as well as customer information from the CRM (Customer Relationship Management) database in data warehouse.

The goal is to have a unique profile identifier in each CDP table. This will allow us to create a Master Segment with a 360 degree view on the customer and its journey.

Online data available: views on website pages, form submissions on website, other website specific information, such as product page details that are stored in the separate table.

Offline data available: CRM data from the data warehouse.

Overview of steps to create a unique profile ID:

Unification of data from website tracking
Create master table that contains both online and offline data
Enrich other tables with profile ID
Create audience table that contains profile_ids from all tables to have a 360 degree view of the customer

Web tables unification

In order to unify web data we will leverage the capability of Treasure Data Unification functionality. It requires defining the source tables and the columns that need to be unified in YAML-formatted configuration. The step-by-step guide on how to implement Unification in Treasure Data can be found in Treasure Data Documentation.

As a result it creates canonical ID in each table that is used in unification. The goal of canonical ID is to create an ID that represents each web user. It is useful when applied for web data, where it is challenging to identify user sessions and unify web data across different sessions.

We defined that canonical ID is based on the following merge keys:

canonical_ids:
– name: td_canonical_id
merge_by_keys: [email, phone, td_client_id] source_tables: [web_form_submitted, web_product_detail, web_page_visits]

The order of merge keys defines the priority. The first merge key identifies all web data with the same email (emails provided in the web form submitted in the website) as from the same user.

Second priority is phone number (submitted by visitors on the website): data with the same phone number is also identified as from the same user. Third priority is td_client_id — First-Party Cookie ID. The First-Party Cookie ID is persistent across the same domains. Different domains will have different td_client_id. Td_client_id changes if the website visitor clears the browser cookies.

The result of the unification is shown in the examples below.

Example 1: Treasure Data defined those 3 inputs as the same user ID (td_canonical_id), since they have the same email. Different phone numbers don’t affect the result because we defined that email has higher priority than phone number:

Example 2: when email is not provided, same phone number represents same user ID (td_canonical_id):

Example 3: when higher priority matching fields are not defined, then first party cookie ID is used. Same td_client_id = same user ID (td_canonical_id). Such unidentified visitors without any Personal Identifiable Information (PII) are common in web site visitors tables. Td_client_id is not fully reliable since it can change after cookie clearing, but in the absence of PII data, it is the best shot we have.

If we assume that only form submission on a website contains email and form, then what happens to the tables, where no PII is available, for example web_page_visits? Let’s assume website visits have only First-Party Cookie ID — td_client_id. In the first example web_page_visits table with those td_client_id ‘asd-123-qwe-456’, ‘zxc-234-ert-345’ and ‘567-tyu-ghj-234’ will have the td_canonical_id = ‘abcd1234’. Since web_form_submitted table has already defined that those 3 cookie ids belong to the same user, web_page_visits will rely on this information in the absence of email or phone.

Same logic applies to other web data tables that do not have PII data.

Create master table that joins online and offline data

Now that web data is unified across each other, it is time to unify online (web) and offline data. By offline data we mean CRM data that contains PII. From online data we will take usage of PII data (email, phone) in the web_form_submitted table.

In an ideal case data engineers can add a CRM table into a unification yml file, but we had a case when client’s condition on unifying online and offline data was more complex than automatic unification can provide. Thus we rely here on the power of SQL.

We performed FULL OUTER JOIN of web tables with form submission data and CRM tables ON email or phone. First and last names are not unique and therefore not reliable.

CREATE master_table AS
SELECT t.crm_client_id, d.td_canonical_id
FROM crm_user_table t
FULL OUTER JOIN web_form_submitted d
ON (t.email = d.email
OR t.phone = d.phone)

It is also advised to define a logic for matching score, in case several forms are submitted with the same email or phone number. For example, the web forms whose first and last names are matched with CRM customer’s names have higher priority than those who don’t.

Enrich other tables with profile_id

In this section we define the final profile_id that is unique for each user, identified and unidentified.

Web tables profile_id is the td_canonical_id, that has 1-to-many relation, as it is shown in the example tables – several web data rows can be related to the same profile.
When there is a match between web form data and CRM customer data, web table’s profile_id is the crm_client_id.
For online data with no match with offline data, profile_id stays td_canonical_id.

In summary, profile_id = COALESCE(crm_client_id, td_canonical_id).

For web tables crm_client_id are provided per web key (td_canonical_id) from the master table created in step 2. Meaning, if users submitted a form on a website, and they provided email or phone, we can identify them in the CRM database as known customers (in case they are already a customer of the company). As a result, we can say that those web page visits with the same cookie ID as submitted web form also belong to the identified CRM customer.

Select COALESCE(d.crm_client_id, t.td_canonical_id) as profile_id, t.*
from web_page_visits t
left join master_table d
on t.td_canonical_id = d.td_canonical_id
and d.rn_web_key = 1

rn_web_key is defined as follows:

row_number() over (
partition by td_canonical_id
order by match_score,
time desc) as rn_web_key

Its role is to avoid causing duplicates in the destination table as all web tables have many-to-many relationships. Moreover, it shares crm_client_id with the profile that has the best match between offline and online data. For each td_canonical_id it takes the one with the highest match score, and in case match score is the same, then the newest entry.

Create audience table that contains profile from all tables

Now that all tables have a profile id, it is time to create a Master Segment. Master Segment (Parent Segment) is a feature of Treasure Data that unifies all information regarding the user, including user interaction with the website and product. It allows Business Analysts, Marketers and other users to see in Audience Studio all identified and unidentified profiles and create various segments with them without writing any SQL queries.

To create a Master Segment we need to have an audience table that will be used to join all attribute and behavior tables. The goal of the audience table is to have all profiles — online and offline. And it will contain only profile_id, as it should be enough to join it with all other tables (crm_user_table, web_page_visits, web_form_submitted, web_product_detail).

The implementation is as simple as union of all tables that contain profile_id, and deduplication of all profile_ids.

Medium Blog by Artefact.

This article was initially published on Medium.com.
Follow us on our Medium Blog !

Read Our ArticleRead Our Article