Data Quality

Getting Ahead of Data Matching: Building the Right Strategies for Today and Tomorrow – Part 2

January 23, 2020

Harald Smith

This article on Data Matching was originally published on Dataversity by Harald Smith. Part 1 started with a background on data matching and how to get started.

Ensuring business context

Matching algorithms produce scores or grades that together provide a standardized result. Scores above a threshold indicate a match, while scores below that there is no match. It is up to you to apply business context or meaning to that result and find the relevant threshold.

If your scoring is too strict (e.g. you use exact character-based matching) or threshold too high, you undermatch the data. Customer or household data isn’t brought together, resulting in “duplicate” data and poor customer engagement. Or, if product data isn’t matched, your product catalog ends up with too many of the same thing, creating issues with inventory management or procurement. Often there is tangible business cost that can be identified for duplicated data, and with focus now on data-driven analytics and machine learning, that “cost” ripples downstream through your organization.

If your scoring is too “loose” or threshold too low, you overmatch. In the past, for some situations such as mass-market mailings to households, this was ok, but with growing regulations such as the California Consumer Privacy Act, even this instance becomes problematic. Overmatching can have severe consequences: commingling different people’s accounts in financial services; linking different patients together in healthcare; or simply “losing” data for assets or products.

Err on the side of undermatching, but ideally you don’t do either: you match with the highest degree of precision. To get there, you must weigh evidence both for and against possible outcomes. Keys or distinct identifiers are ideal to connect data, but these are also the pieces of data that criminals can use to commit fraud. Where keys don’t exist, you need as many relevant attributes applied as you can get so that you can decide whether a match is valid or not. And that comes back to business context and domain expertise to ensure effective decisions.

The evidence in favor of a decision is that relevant data matches or closely matches based on the algorithm used. The business question is whether there are enough data elements that match that you’d reasonably consider it good or precise. If not, you should consider whether there are other sources of information that could append more detail available.

The evidence against a decision requires business context to know what pieces of data you should use to keep records apart. Different key values, middle initials, generational suffixes (e.g. Sr. vs. Jr.), and directional attributes for streets (e.g. North Main vs. South Main) are all common examples for customer data. Specifications such as length, size, and quantity are relevant for product data. Matching algorithms need to include weighting or scoring factors that ensure distinct differences are factored out.

Any data attribute that contributes to your matching results should be identified a Critical Data Element – they have a direct, immediate impact on your ability to get a single, consolidated picture of your master data. You need to understand how that data is changing over time and put processes and business rules in place to continually monitor and validate the data both before and after any cleansing and matching processes, to ensure that you can adapt your match strategy to changing conditions and data.

Building a future-looking matching strategy

Data matching is not a static activity, nor does it only occur at one place at one point in time within an organization. Matching is never “done”, but an ongoing, critical, central process within all IT systems.

New data arrives regularly, whether through customer orders, patient visits, support calls, changes of address, or catalog updates. Names, addresses, and descriptions can all change. Sometimes there are new attributes that should be accounted for (e.g. mobile device ID’s and IP addresses) and added into existing match algorithms. With new regulations such as CCPA, websites must capture opt-out information and ensure it is applied to all consumer data. This means that any future-looking strategy must handle updates in batch and real-time and do so consistently with the same criteria and algorithms.

Master Data Management is a central piece of this strategy. Whether part of a formal MDM system, a Customer Information Management (CIM) or Customer Relationship Management (CRM) system, or a less formal approach, the ability to store the existing evidence, merge records when new evidence supports doing so, and split apart inappropriately linked data is critical to achieving precision. Logs of these merges and splits become inputs into evaluating current algorithms and where changes may be required.

As organizations look to leverage advanced analytics, AI, and machine learning, data is increasingly deployed into data lakes or onto cloud platforms. Ideally this includes core master data for analysts and data scientists to use. However, data matching processes have not necessarily been deployed along with the data (or are reliant on different tools for data “preparation” that do not offer the same, consistent matching algorithms) opening up gaps that can negatively impact any new initiative. To put a future-looking match strategy in place, you need to consider delivering “data matching” as a service – a capability that allows downstream users to leverage your enterprise data matching functionality in batch or real-time wherever their data resides.

To apply business context effectively and address changing requirements, you need to ensure that everyone working with this data has also achieved a level of data literacy around data matching and can access that information. The strategies and algorithms used, including criteria to support or reject a match, need to be clear and understood. Business glossaries are one tool that can be used to support this, although you should consider a Center of Excellence with broader documentation, examples, and instructions for leveraging available data matching services.

Conclusion

IT and business leaders need to ask: what is a sustainable process for data matching to ensure the highest level of precision and accuracy? Foundational skills, business context, and Data Quality tools are the building blocks, but to be effective as new needs and data emerge, you need to ensure that a broader set of consistent data matching services are available and understood within the organization. As effective practices are established, with decisions and data that are monitored and recorded, further analytics and machine learning may be leveraged to optimize the processing and encapsulate the knowledge for the broader organization leading to a sustainable data matching strategy and program.

Also, make sure to check out our eBook, “4 Ways to Measure Data Quality.”