Data-driven decision-making is critical in today’s business landscape. But even if you collect vast scores of data, it’s only as valuable as your ability to understand and draw insights from it.

Organizations in all industries face the challenge of “messy” data, with records stored across multiple systems. Bringing this data together so that it can be effectively managed and analyzed is easier said than done, since variations in the way records are captured across the sources (e.g. spelling mistakes, abbreviations, and even extra spaces) can complicate the process of synthesizing them.

At best, this obstacle costs organizations opportunities to gain meaningful insights. At worst, it leads to high-level mistakes due to leaders’ overconfidence in fragmented, unreliable data – which can lead to millions lost.

How do you put all the pieces together and ensure you have an accurate, up-to-date, and complete view of your most important information?

RecordLinker is a machine learning-based solution to this problem: record linkage aka entity resolution. Our clients use this system to merge data on, for instance, 150 million insurance policy records from thousands of different databases, despite all manner of variations in the spelling of policy coverage and insurance company names (guess how many unique, case-insensitive spellings for ‘Philadelphia Indemnity Insurance Company’ we found? 800)

Given our success, we now help clients across diverse industries get a handle on their disparate data sources. But what if you wanted to create your own system? Where would you begin?

In this post, we aim to share what we have learned about in-house record linkage in order to help you on your journey. We’ll take you through several use cases of how to build your own record linkage system and outline best practices for your development team to follow.

The Problem: Multiple Databases

Let’s say you have several databases or systems that hold important data. Some of these are likely internal but spread across departments. Others may be managed by vendors or SaaS (Software-as-a-Service) companies. You might be trying to:

  • Migrate data from one system to another
  • Fix consistency issues in an information hub, such as a data mart, that draws on copies of records from multiple sources
  • Produce a reliable set of KPIs by integrating data from more than one database

In all these scenarios, the challenge you are facing is that records may have different IDs, business codes, and names across systems. For example, if you’re an insurance company, your General Liability coverage type might be named “General Liab” in one system and “Liability-Commercial” in another. Or, misspellings may come into play, such as when “Ibuprofen” is mistakenly logged as “Ibeprofen” by your data entry team.

If you tried to match these different records manually, it would be both time-consuming and prone to error. And if you wrote an explicit script to try and automate the process, you would still have to account for the thousands of different ways your data might vary. That’s where a flexible record linkage system comes in.

The Solution: Canonical Record Linkage

The first step in migrating data, or cleaning up a messy data mart, is to identify what the “correct” values should look like. For instance, an insurance company might decide that their General Liability policies should always be referred to in records as “General Liability” (i.e. not “General Liab” nor “Liability-Commercial”).

In this post, we will call these “correct values” canonical records. You may have seen the term “master data records” used elsewhere; this is simply another name for the same concept. The process of standardizing your existing records so that they match the canonical record is called normalization (aka ‘record linkage’ or ‘entity resolution’)

But before we take you through the process of implementing your own record linkage system, let’s take a look at a popular, seemingly ready-made solution: Microsoft Excel.

Though Excel is a powerful program for many business tasks and offers the advantages of being widely available and user-friendly, it has serious limitations as a record linkage tool.

Microsoft Excel: the WRONG tool for record linkage

When organizations decide to get a handle on their data, the first solution they come up with is often this:

“Let’s build an Excel file where we link input records (things that we need to migrate) to our canonical records (the ‘correct’ values).”

Such an Excel spreadsheet has two columns: input value and canonical value:

When the canonical values are filled in, the sheet may look something like this:

Unfortunately, using Excel for record linkage presents two major challenges. And at present, solving these issues is extremely tedious and difficult to manage. It’s so bad that using Excel is almost untenable – and here’s why.

Problem 1: The problem of populating canonical record values

Organizations tend to use one of two approaches to deal with the second column in the Excel file (selection of the corresponding canonical record).

Drop-down lists

One approach is to use Excel’s Data Validation function to create a dropdown of the canonical record values. You can find a comprehensive guide to this process here. However, there are two problems with this strategy.

Firstly, someone has to manually create a list of canonical record values and keep it up-to-date. If, in two weeks’ time, you want to add, rename, or merge certain values, the process has to be repeated. In our observation, these types of changes to canonical values are needed more often than organizations tend to expect.

Secondly, a dropdown list with 5 or 10 values may not sound too cumbersome, but in a real data situation, your canonical records can consist of 300 or even 10,000 values. These ever-growing dropdowns can quickly become a source of frustration. Imagine scrolling through 5,000 rows just to find the value you need!

Thirdly, in some cases your canonical records are composite objects, i.e. have two or more ‘parts’ (or dimensions). How to fit a multi-value canonical into a single value drop-down list?

VLookup

Alternatively, some organizations try to use VLookup (see a guide at https://www.simplilearn.com/tutorials/excel-tutorial/vlookup-in-excel) to find matches between records in multiple sheets. VLookup has many useful applications, but fails miserably when applied to record linkage because it requires entries to be written exactly the same way in order to match. Since your goal is to link records in spite of different spellings and formats, this is obviously a huge limitation.

In the end, most organizations choose to populate the canonical values column using a drop-down, or even manually. As they grow and evolve, so too does their list of canonical values, and this process becomes increasingly difficult to manage. At worst, this creates a high potential for error; at best, it takes up valuable employee time.

Problem 2: The trouble with versions

Finally, dealing with new versions of the linkage Excel file can cause a great deal of confusion and error.

Suppose an employee creates Linkages-version1.xlsx, then emails that file to the company’s dev team. A developer copies the data from that file. Then, a few days later, the initial creator of the file realizes that some records were logged incorrectly, and sends Linkages-version2.xlsx to the dev team. The developer has to repeat the whole data transfer process, because figuring out what was changed between two versions would take even more time.

Typically, there is a near-constant flow of successive versions of the linkage Excel file, with an increasing risk of miscommunication.

Can you imagine a conversation like this? “Wait, I thought we agreed to use version 8 of the Excel file for this release! I emailed that one last Friday, but you’ve used version 7!”

Emailing these files, or storing them in a shared location such as Dropbox, always leads to pain and frustration. Simply don’t do it.

Building an In-house Record Linkage Solution

Implementing your own machine learning-based record linkage solution may sound challenging, and it certainly requires an initial investment of time and resources. But the data clarity you can achieve with a robust, reliable system will pay off in dividends over the long-term, in the form of more efficient processes, deeper and more accurate insights, and reduced employee stress.

Prior to starting RecordLinker, we built an in-house solution for a client. It took our team of 5 people around 3 weeks to get the first prototype ready, another 6 months to make it production-ready, and an additional 6 months of subsequent iterations and improvements.

Your in-house record linking solution will consist of three major components: a database (central repository for the canonical records), a user interface, and machine learning capabilities. To create these elements and bring them all together, you will need a development team that knows how to build web-based UI, work with databases, and build and train machine learning models.

We compiled the following guidelines based on our own experience of building an in-house system. Though they are somewhat simplified here, sharing them with your development team will help the project start off on the right foot.

Database storage

Your database will house your canonical records, input records, and linkages between them, serving as a single source of truth for your organization’s critical linkage data.

Your database should contain the following elements:

CANONICAL_RECORD table:

  • ID – (PK) internal ID of the canonical record
  • VALUE – the canonical name (this is what you will be normalizing to)
  • BUSINESS_CODE – oftentimes, the canonical values include short business abbreviations (codes)

INPUT_RECORD table:

  • ID – (PK) internal ID of the input record
  • VALUE – string value of the input record (the value you are trying to normalize)

LINK table:

  • INPUT_RECORD_ID – FK to INPUT_RECORD.ID
  • CANONICAL_RECORD_ID – nullable, a FK to CANONICAL_RECORD.ID

Side Note – you can merge INPUT_RECORD and LINK into a single table, but over time you might want to have more than one link for the same input record, for example one ‘live’ link and a ‘draft mode’ link possibly leading to another canonical entity. We leave this decision to you.

User Interface (UI) layer

Now, you need to build a UI that allows users to view, navigate, and edit the records in the database. There are many ways to go about building a UI, but whichever method you choose, focus on enabling users to easily perform the following tasks:

  • Authenticating users
    • Authentication ensures that only authorized users can access and edit the canonical records. This is an important element of data security.
    • Single sign-on (SSO) is our preferred authentication method, as it allows users to access records without hassle.
  • Managing canonical records
    • Users should be able to easily create, rename, and delete a canonical record.
    • Users should be able to merge several canonical records into one (when this happens, you have to find and update relevant records in the LINK table to point to a new canonical record ID).
    • The system should keep an audit trail of who renamed, merged, or deleted canonical records.
  • Importing data to be normalized
    • This process should be designed with business (non-technical) users in mind. The UI should enable the user to upload an excel file and easily identify which column needs to be normalized. The code then uses that knowledge to populate the INPUT_RECORD table.
  • Linking to canonical records
    • The business user must be able to see all input records that are not yet linked, and efficiently link them to canonical records. You may choose to have them do this one by one. However, adding machine learning functionality will enable the system to automatically link those records that have a high confidence score, or, at the very least, to automatically suggest the top 10 candidates for each input record.
  • Exporting linkages as an Excel file (and via API)
  • Undoing certain linkages or restoring to a previous checkpoint
  • Allowing “draft” mode of linkage creation
    • In most real applications, teams prefer to have a “current” set of linkages. If the data team is preparing an upgrade to a new canonical model set, they should be able to keep such changes separate (and invisible) from the users of the current linkages.
  • Managing data access rights and permissions
    • As your data management team grows, you will want to be able to restrict access and permissions to some subsets of your data to avoid accidental changes that might have an impact on other teams.

Implementing Machine Learning (no, it’s not ‘just use ChatGPT’)

Machine learning is what will enable your record linkage solution to improve over time. It will grow more accurate and sophisticated as it is given more data to work with. Gradually, it will become better at connecting canonical records to new input records that contain mistakes or deviations in formatting.

We recommend that your team uses AWS SageMaker infrastructure to implement the machine learning for your record linkage solution. Since AWS has a large library of tutorials (see https://aws.amazon.com/getting-started/hands-on/build-train-deploy-machine-learning-model-sagemaker) we will not go into the specifics of implementation in this whitepaper.

However, we would like to share one valuable takeaway from our own implementation experience. We observed that building the initial model and plugging it into the UI on a developer’s machine only accounted for about 5% of the effort we spent on machine learning.

The remaining 95% was in creating robust processes and infrastructure to support the following objectives:

  • Keep strict control over deployment of the model to the production environment. This includes finding a way to test a pre-release version of the model, and knowing in advance which aspects will be changed when it is eventually deployed to production.
  • Ensure you are able to roll back your model (and revert the linkages it automatically performed) in production to a specific prior version
  • Have a strict test harness that is quantifiable in reporting of any possible deterioration in predictions compared to a ‘gold standard’ input set.
  • Have a strict test harness that allows upgrades of the training set to the respective model (i.e. the model logic stays the same, but when you upgrade the training set by adding human-done linkages (let’s say, once per week) the results of predictions should not be worse).

Factor these considerations into your machine learning implementation, and you will be well on your way to creating an adaptable, responsive system that can grow with your organization.

Wrapping It Up

Implementing an in-house record linkage system takes planning, patience, a keen understanding of your organization’s data management needs, and a team with the right specialized skills and expertise.

In the end, it will be worth it. A record linkage solution will go a long way toward helping you make the most of your data, and you will see the benefits in nearly every aspect of your business. Your analytics and predictions will grow more accurate, your business processes will become more efficient, and your employees will be free to focus on doing their jobs well rather than chasing records.

If you plan to start implementing a record linkage solution yourself, we are rooting for you. You are heroes, and doing a critical job for your company. If you have any questions, email ‘dev’ ‘at’ recordlinker.com. We are happy to share our knowledge.

If you decide to try RecordLinker and have your data normalized in a few weeks rather than a year, request a demo (click blue button below) and mention this post for a discount.

Happy Record Linking!