Data Quality

Getting Ahead of Data Matching: Building the Right Strategies for Today and Tomorrow

January 22, 2022

Harald Smith

This article on data matching was originally published on Dataversity.

Every organization understands the importance of matching and connecting the same entities together (and the role of data quality capabilities to do so), but too often they take a limited view toward it, building data matching processes to address the situation in the moment but not taking future demands into account.

But not keeping future sustainability in mind is a mistake. Not only are organizations dealing with growing volumes and varieties of data, they’re also confronting increasing compliance requirements introduced by HIPAA, GDPR, CCPA, and other pending legislation. Although you may connect two records together today, there will always be new evidence to consider, including data that did not previously exist and sparse data records, and that’s why you need plans in place to address these updates going forward.

What is a sustainable strategy for data matching? Often, I hear questions from customers such as: How do we know that we put the right algorithms in place? How do we know that the algorithms we put in place are still good/valid? Or, how do we adjust or respond to new data?

To build a sustainable data matching strategy, first you need four building blocks: data literacy around data matching; data profiling to understand the data in use; business context for the data; and a data quality tool that supports a broad set of data matching functionality for batch and real-time systems.

Second, you need three additional capabilities to sustain and adapt to change: a broadly defined master Data Management solution; a knowledge repository; and data matching available “as a service.”

Starting with the building blocks

Fundamentally, the data you have about a given entity, whether a person, a household, a product or asset, are how an organization (yours as well as other third parties) has represented or modeled that individual person or object – it’s never the actual person or object. The data are descriptive attributes or features that are used to identify the person or entity, and so the first foundational question you need to ask is: what data is enough to uniquely describe this person or that object?

For a person, there might be one piece of data that is sufficient, such as a National ID, a Social Security #, or a Loyalty Account #. For a household, it may be the address. However, you usually need a name coupled with some other data such as date of birth, address, phone, or email to be accurate. Each permutation you identify defines how you need to match the data.

However, since entering and capturing data is not perfect, there is a second foundational question you need to ask: what conditions occur with this data, and how does that inform the choice of matching algorithms?

We use data profiling and business rules to identify possible issues and validate the state of these data attributes. Problems to address include, but are not limited to:

Compound Data Domains: The issue of word order (e.g. first name/last name, product descriptions)
Mishearing: The issue of spelling and phonetics (e.g. Jon, John, Jan, Jean, Gene)
Mistyping: Handling keystroke issues (e.g. reversed digits in an ID or phone #)
Non-standard Formats: The issue of global or organizational variations (e.g. month-day-year or day-month-year)
Proximity: Handling issues with the location of devices (e.g. use of a centroid vs. precise coordinates)
Defaults: The issue of imprecise data (e.g. use of January 1 of the current year)
Constancy and Currency: Whether the data remains constant (a date of birth), or if it can change (such as address, phone, or even last name); and if the latter whether the data is current

Some issues you’ll want to address through data cleansing with appropriate transformations and business rule validations. But once those are resolved, you also want to ensure that you can apply useful matching techniques, keeping in mind that different matching algorithms address different issues. Your data quality tool needs to provide a broad range of algorithms to handle these “fuzzy” differences, keystroke errors, distances, and defaults.

Read our white paper

Data Quality Gets Smart

Data cleansing and entity resolution platforms typically require IT expertise, but there is a better way. See how new innovations combine machine learning with intuitive tools to help businesses today.

Read

Ensuring business context

Matching algorithms produce scores or grades that together provide a standardized result. Scores above a threshold indicate a match, while scores below that there is no match. It is up to you to apply business context or meaning to that result and find the relevant threshold.

If your scoring is too strict (e.g. you use exact character-based matching) or threshold too high, you under-match the data. Customer or household data isn’t brought together, resulting in “duplicate” data and poor customer engagement. Or, if product data isn’t matched, your product catalog ends up with too many of the same thing, creating issues with inventory management or procurement. Often there is tangible business cost that can be identified for duplicated data, and with focus now on data-driven analytics and machine learning, that “cost” ripples downstream through your organization.

If your scoring is too “loose” or threshold too low, you over-match. In the past, for some situations such as mass-market mailings to households, this was ok, but with growing regulations such as the California Consumer Privacy Act, even this instance becomes problematic. Over-matching can have severe consequences: commingling different people’s accounts in financial services; linking different patients together in healthcare; or simply “losing” data for assets or products.

Err on the side of under-matching, but ideally you don’t do either: you match with the highest degree of precision. To get there, you must weigh evidence both for and against possible outcomes. Keys or distinct identifiers are ideal to connect data, but these are also the pieces of data that criminals can use to commit fraud. Where keys don’t exist, you need as many relevant attributes applied as you can get so that you can decide whether a match is valid or not. And that comes back to business context and domain expertise to ensure effective decisions.

The evidence in favor of a decision is that relevant data matches or closely matches based on the algorithm used. The business question is whether there are enough data elements that match that you’d reasonably consider it good or precise. If not, you should consider whether there are other sources of information that could append more detail available.

The evidence against a decision requires business context to know what pieces of data you should use to keep records apart. Different key values, middle initials, generational suffixes (e.g. Sr. vs. Jr.), and directional attributes for streets (e.g. North Main vs. South Main) are all common examples for customer data. Specifications such as length, size, and quantity are relevant for product data. Matching algorithms need to include weighting or scoring factors that ensure distinct differences are factored out.

Any data attribute that contributes to your matching results should be identified a Critical Data Element – they have a direct, immediate impact on your ability to get a single, consolidated picture of your master data. You need to understand how that data is changing over time and put processes and business rules in place to continually monitor and validate the data both before and after any cleansing and matching processes, to ensure that you can adapt your match strategy to changing conditions and data.

Building a future-looking matching strategy

Data matching is not a static activity, nor does it only occur at one place at one point in time within an organization. Matching is never “done”, but an ongoing, critical, central process within all IT systems.

New data arrives regularly, whether through customer orders, patient visits, support calls, changes of address, or catalog updates. Names, addresses, and descriptions can all change. Sometimes there are new attributes that should be accounted for (e.g. mobile device ID’s and IP addresses) and added into existing match algorithms. With new regulations such as CCPA, websites must capture opt-out information and ensure it is applied to all consumer data. This means that any future-looking strategy must handle updates in batch and real-time and do so consistently with the same criteria and algorithms.

Master Data Management is a central piece of this strategy. Whether part of a formal MDM system, a Customer Information Management (CIM) or Customer Relationship Management (CRM) system, or a less formal approach, the ability to store the existing evidence, merge records when new evidence supports doing so, and split apart inappropriately linked data is critical to achieving precision. Logs of these merges and splits become inputs into evaluating current algorithms and where changes may be required.

As organizations look to leverage advanced analytics, AI, and machine learning, data is increasingly deployed into data lakes or onto cloud platforms. Ideally this includes core master data for analysts and data scientists to use. However, data matching processes have not necessarily been deployed along with the data (or are reliant on different tools for data “preparation” that do not offer the same, consistent matching algorithms) opening up gaps that can negatively impact any new initiative. To put a future-looking match strategy in place, you need to consider delivering “data matching” as a service – a capability that allows downstream users to leverage your enterprise data matching functionality in batch or real-time wherever their data resides.

To apply business context effectively and address changing requirements, you need to ensure that everyone working with this data has also achieved a level of data literacy around data matching and can access that information. The strategies and algorithms used, including criteria to support or reject a match, need to be clear and understood. Business glossaries are one tool that can be used to support this, although you should consider a Center of Excellence with broader documentation, examples, and instructions for leveraging available data matching services.

Conclusion

IT and business leaders need to ask: what is a sustainable process for data matching to ensure the highest level of precision and accuracy? Foundational skills, business context, and data quality tools are the building blocks, but to be effective as new needs and data emerge, you need to ensure that a broader set of consistent data matching services are available and understood within the organization.

As effective practices are established, with decisions and data that are monitored and recorded, further analytics and machine learning may be leveraged to optimize the processing and encapsulate the knowledge for the broader organization leading to a sustainable data matching strategy and program.

Precisely offers data matching and entity resolution solutions to help you achieve the highest quality data for deeper, trusted insights, effective governance and compliance.

Also, make sure to check out our white paper: Data Quality Gets Smart