DUENDERS

Case Study

Get a unified view of your customers

Record linking using Fuzzy Matching and LLMs on Azure Databricks

Overview

Entity matching is a common challenge in data engineering.

Consider a bank with branches across continents. Database 1 tracks customer transactions in Europe, while Database 2 logs transactions in Asia. A customer might strategically move funds between accounts in both regions, staying just below the reporting threshold in each individual database. By examining databases separately, the bank might miss this suspicious behavior. But by merging the data, patterns of consistent, strategic fund transfers emerge, flagging potential money laundering activities. A unified view ensures that banks can identify and report suspicious activities, crucial for AML compliance.

Similar scenarios apply to pretty much every industry. In this article, we present a project we did for a pharmaceutical company who was trying to match information about doctors among different third-party data providers.

Problem

The pharmaceutical industry often relies on data from multiple third-party sources about doctors and hospitals. Each of these sources might have its own way of representing information, leading to discrepancies.

Data Duplication: With multiple sources, there's a high chance of having duplicate records for the same doctor, leading to redundancy and potential misinformation.
Inconsistent Data Representation: A doctor might be listed with a middle name in one source and without it in another. Such inconsistencies make it hard to ascertain if two records refer to the same individual.
On the other hand, it is common for different people to have the same full name, think about James Smith, Thomas Muller or Mario Rossi.
Misspellings: many records are manually inserted even today, which means the names of people might include misspellings due to human error, missing accents on keyboard (think about umlaut in German) or phonetic similarity (Sofia or Sophia?).
Data Integrity: Mismatched records can lead to errors in communication, misallocation of resources, and potential missed opportunities in reaching out to healthcare professionals.
Complexity of Data Sources: Different platforms might provide data in varied formats, adding layers of complexity to the matching process.

Approach

Entity Matching with Zingg on Azure Databricks

Zingg's prowess in entity resolution combined with the computational power of Azure Databricks offers a robust and scalable solutions.

Large Language Models (LLMs) allowed us to increase performances thank to their understanding of the world.

Key Components:

Zingg: Specialized in entity resolution, Zingg can identify and link records referring to the same entity across different data sources using fuzzy matching, machine learning and an active learning framework which improves over time with
Azure Databricks: Zingg has support for both Databricks (beside Snowflake and on-premise), arguably the two most common data platforms on the cloud. Databricks provided the enterprise-grade platform for preprocessing and Zingg computations.
LLMs: yes they are everywhere, below you'll see how we used them for entity matching.

Solution

Data Integration: Ingest data from all third-party sources into Azure Databricks.

Data Preprocessing: Utilize the Medallion architecture, which consists of three layers:

Bronze Layer: This is the raw data ingestion layer where data is stored in its original state.
Silver Layer: In this layer, data undergoes cleansing, enrichment, and is stored in an intermediate state, ready for further refinement.
Gold Layer: This is the final layer where data is presented in its most refined state, having undergone all necessary transformations. It ensures that the data is in an optimal format for entity matching.

Record Linking with Fuzzy Matching: Zingg stands out as a premier solution for entity resolution, especially in scenarios demanding large-scale data handling. Here are some of its distinctive features and advantages:

Smart Blocking: Zingg employs an intelligent blocking mechanism, ensuring that only relevant records are compared, optimizing the matching process.
Interactive Labeling: One of Zingg's unique offerings is its interactive approach. It actively involves users in the labeling process, ensuring that the system learns and refines its matching criteria based on user feedback.
Scalability with Spark: Zingg operates on Spark, making it adept at handling massive datasets without compromising on performance. This scalability is particularly beneficial for enterprises dealing with extensive data sources.
Continuous Improvement: Zingg is not static. It's continuously evolving, with new features being rolled out regularly. This dynamic nature ensures that it remains at the forefront of entity resolution technology.

The output from Zingg is presented in clusters, each containing a score that represents the similarity between each record from dataset1 and the corresponding matched records from dataset2. These scores range between 0 and 1.

LLMs matching

We utilized the score generated by Zingg as an indicator of the model's certainty. A higher score indicates greater confidence from the model that a particular pair is a match.

We established two thresholds: t1 and t2, where t1 is set at a high value (for instance, 0.9) and t1 exceeds t2.

If the Zingg score is greater than t1, we accept Zingg's prediction.
We dismiss any pair with a score less than t2.
For pairs with a Zingg score between t1 and t2, we send the pair to the LLM. We then inquire whether the LLM perceives the two doctors as the same individual and seek its reasoning. This approach is potent because LLMs can grasp concepts that fuzzy matching might overlook. For instance, LLMs can discern whether a last name actually exists (based on their training data) or is likely a typographical error.

Moreover, there were known issues within the datasets that fuzzy matching couldn't rectify. With LLMs, we could simply incorporate data quirks and biases into the prompt, enabling the LLM to account for them.

Results

Unified Data: Achieved a single, consistent view of each doctor, eliminating duplicates and ensuring data integrity.

Efficiency and Cost Savings: Reduced the time and resources previously spent on manual data reconciliation.

Improved Outreach: With a unified view, the pharma company could tailor its outreach and communication strategies more effectively, ensuring that doctors received relevant information without redundancy.

The use LLMs with fuzzy matching allowed us to reach pretty much human performance. The only cases where the system failed were also cases where also a human couldn't clarify the ambiguity between a pair of doctors.

The reader might be wondering why not using LLMs for all pairs. It is technically possible but not a good solution because of cost and latency issues.

Entity Matching is not just limited to the pharmaceutical sector.

Want to unify customer data across your retail platforms?
Need to link records in your financial databases?
Looking to consolidate user profiles in your tech product?

The applications are vast and varied. Customers, employee, suppliers, companies. Tons of information is spread around in different databases and sources. Book a free meeting to discuss how Entity Matching can enable new insights by creating a unified view of all the stakeholders in your business.