Most Popular
Recently Added
Recently Updated

Automatically discovering and merging duplicate organizations in the address book

Introduction

One of the major strengths of Cerb5 is that it provides you with a consolidated record of your contacts and customers. Your collected data can be enhanced with features like Broadcast and Virtual Attendants to supplement processes like Customer Relationship Management (CRM). This information comes from a variety of sources (e.g. customer data entry in contact forms, worker data entry, imports, integration) and is prone to duplication. When your address book suffers from a high degree of redundancy then the information you base important decisions on may be fragmented and have degraded reliability.

For example, if you make the distinction between paid customers and trial users (e.g. "freemium", evaluators) with a custom field on organization records, the main record for a company will likely be categorized properly. However, if duplicates exist for some of those organizations then you're faced with two major problems regardless of the outcome: (1) if the duplicates are categorized properly then your customer segment numbers are inflated, and the members of such organizations are fragmented; (2) if the duplicates are improperly categorized then you're likely to approach paying customers as if they were trial users, which doesn't show the proper appreciation that you likely desire to convey. It can be insulting for a customer to spend months of evaluation time, and potentially a large portion of their budget, over several conversations only to be treated as a stranger a few weeks later.

Here's what a duplicated organization might look like in your address book:

  • WebGroup Media, LLC.
  • Webgroup Media
  • WebGroupMedia
  • WebGroup Media LLC

Duplicates like this are especially frustrating when the interface "autocompletes" a list of suggested organizations based on what you start typing. Which of these organizations is the right one?

While it is possible to merge these duplicates every time you encounter them, that process requires you to stop what you're doing and switch tasks; and the cost of such multitasking can be significant if you have to perform it several dozen times per day.

We've created an optional plugin to help automatically discover duplication in your address book. Using these tools you can quickly examine duplicate records to verify they belong to the same company before merging them.

If your data is more reliable then your productivity and efficiency are also likely to improve.

Instructions

To use the Organization Dupe Finder plugin you must be using Cerb5 version 5.6.1 or later.

Download the Organization Dupe Finder plugin from GitHub. (These instructions will assume you have console access, but you can also download a ZIP file from GitHub and extract it to the same directory.)

  1. cd /path/to/cerb5/storage/plugins/

  2. git clone git://github.com/cerb5-plugins/wgm.org_dupe_finder.git

Activate the plugin in the web interface from Setup->Plugins.

Click on address book link in the top navigation menu. Select the new Find Dupes tab.

For smaller address books (e.g. a few thousand contacts), you can click on the Find Similar Orgs button to discover potential duplicates using all organizations. With a very large address book it's often more efficient and less tedious to distribute the work among several people. Using the "Starts with:" option you can divide alphabetic ranges between several people -- e.g. "You take A-E, I'll handle F-J, ...".

The world is a big place. It is possible that two or more organizations in your address book will have the exact same name and still be separate companies. For this reason, it is a good practice to click the peek icon to the right of an organization. You can compare known contact information, as well as the email addresses of existing members in the People tab.

Technical notes

The discovery of duplicate organizations is accomplished through the following process:

  • Punctuation and spaces are removed.
  • Common corporate suffixes (e.g. Inc, Pty, LLC, Oy, BV) are removed.
  • The Soundex algorithm converts the remaining text into phonetic notation.
  • The phonetic results are filtered to a minimum length and then grouped by similarity up to a maximum length.
  • Organization names that are entirely contained within the names of other organizations are grouped with them.

Strengths:

  • This process is capable of finding many representations for the same company that are separated by subtle spelling, spacing, or formal suffixes. These are the most likely differences in your address book entries.

  • The execution of the process is fast; tens of thousands of organizations can be compared within seconds. Execution speed is favored over accuracy in edge cases in accordance with the Pareto Principle (i.e. 80% of the dupes can be discovered with 20% of the effort, and the remaining 20% would likely take 5X more execution time). Outliers can still be merged manually.

Limitations:

  • The notation of the Soundex algorithm uses the literal first letter of each phrase (in its PHP/MySQL implementation). For example, Soundex("Craft")=C613 and Soundex("Kraft")=K613. This means that, by default, the algorithm is not suitable for finding duplicates where the first letters are replaced with different letters that have a similar phonetic sound. This can be offset by adding the same prefix to every line of text being compared, although this also has the potential to introduce more false positives. We believe that misspellings at the beginning of an organization are likely to be rare compared to the hassle of false positives, so we have elected to leave this limitation in place.

  • The Soundex algorithm may provide many false positive groupings for companies that are similar but distinct. For this reason we do not automatically merge duplicates, and instead tools are provided so that workers can perform a quick inspection of the records when in doubt.

  • This Soundex algorithm may not produce optimal results for all languages.

  • This process won't find every duplicate entry in your address book, although it has demonstrated great success in finding the majority of them.


Properties ID: 000083   Views: 5202   Updated: 3 years ago
Filed under:
knowledgebase comments powered by Disqus