Skip to main content

The concept of identifying duplicates is fairly simple: do two records have the same values?  If yes, then they are duplicates. However when dealing with names and addresses the process can be much more complicated.  It is obvious to the human mind that “Dr. John Smith” and “Smith, John Dr” are duplicates; but to the computer – and many duplicate tests approaches – these will not be identified as duplicates.  Addresses are even more challenging because of their free-form nature.  We would recognize “123 – 456 Main St E” and “123 – 456 Main Street East” and “456 Main St East #123” as duplicate addresses; but, again, the computer might not.

If human are better, why not perform the duplicate test manually? Firstly, thousands of records would take many days to review manually.   Secondly, the order of presentation is important.  Unless the potential duplicates were listed one after the other – making comparison easier – we would not notice that record #2,950 (“Dr. John Smith”) and record #104,552 (“Smith, John Dr”) are duplicates. This means that we have to rely on the computer to assist us with our duplicate testing.  Before I discuss the merits of different approaches and algorithms for duplicate testing, I would like you to consider something: “Do you need to find all of the duplicates?” Your first reaction maybe “yes”, but if your objective is to determine if the controls preventing and detecting duplicates are working, then you only have to find ‘some’ of the duplicates. Finding 3,000 duplicates that were not prevented (entry controls) or detected (processing or output controls) allows you to conclude that the controls are not adequate and effective.  The fact that there may be an additional ‘x’ duplicates that you did not find is not relevant to the binary (Yes/No) question “Are the controls over duplicates working?”  Ideally, we would like to identify the majority of the duplicates so that you can provide a level of significance and impact to the analysis; and to support the requirement to fix the controls.  If you only found one duplicate you could legitimately say the controls over duplicates had failed, but the significance and impact of a single failure may not support the additional work required to tighten the controls.

The identification of duplicates ultimately comes down to finding a balance between the False Positives, False Negatives and True Duplicates.  The ideal situation is one where the number of False Positive (records identified as duplicates which are not duplicates) is Low; the number of False Negatives (records identified as not duplicate when they are duplicates) is Low; and the number of True Duplicates (correctly identified duplicates) is High.  This balance is directly affected by the criteria you select for the duplicates analysis. If you tighten your criteria for what constitutes a duplicate (e.g. using “Last Name, First Name and Initials” instead of “Last Name and First Name”), you will reduce the number of False Positives, but you will also increase the number of False Negatives. Conversely if you loosen the criteria (e.g. using “Last Name” instead of “Last Name and First Name”) you will get less False Negatives, but more False Positives. So, the selection of the criteria directly affects the identification of duplicates; and the number of False Positive and False Negatives.  Where the balance point exists for your organization will depend on: the objective of your test for duplicates (why are you trying to identify duplicates); the impact of having duplicates; the risk tolerance level for duplicates; and the amount of time and effort you are willing to spend on identifying and validating the duplicate test results.

After selecting your criteria, the next most important task is to cleanse the data. Cleaner and more consistent the data will produce a lower the number of False Negatives and a lower the number of False Positives. Without cleansing the data, comparing “Dr John Smith” and “Smith John Dr.” would result in a “not duplicate” assessment – a False Negative.  However, the test would be more effective if all records were cleansed and had the format “Last Name, First Name, Salutation” so both records would be “Smith, John, Dr.” and would be correctly identified as a duplicate.

So after “What criteria should I use?”, the next question should be “How can I cleanse (normalize) my criteria?”  A closely related questions is “How much time and effort should I spend on cleansing (normalizing) my data?”

If we go back to the factors that influence the response to this question, you can arrive at a better idea of how much time and effort should be expended. For example, if the objective of the duplicates analysis was to “Determine the size of a known fraud”, then the False Positive and the False Negatives would have a significant impact. Fortunately, in this situation, you are likely to be dealing with a smaller number of criteria values (e.g. an employee or a vendor) and you can manually cleanse the variations in the data (e.g. “A.B.C. Inc” “ABC Inc” “A B C Inc” and “ABC Incorporated” can all easily be manually cleansed to “ABC Inc”) producing a better duplicate result.

If the objective was to “Test controls over duplicates”, you would still want to cleanse the data to decrease the number of False Negatives. This can be done by the computer to automatically deal with variations in the data (e.g. “Ltd.” “Ltd” and “Limited” can all be normalized to “Limited”).  However, other variations can make the normalization process more difficult. Ultimately, the amount of time and effort to fully normalize the data should be kept to a minimum unless the impact of having duplicates is high.

 

David Coderre is President of Computer Assisted Analysis Techniques and Solutions (CAATS). DaveDavid Coderre was awarded a lifetime achievement award by the Canadian body of the Institute of Internal Auditors for his contributions to the field of Internal Audit and CAATS,. David Coderre has published numerous books on the topic of Internal Audit and CAATs including Internal Audit: Efficiency through Automation and Computer Aided Fraud Prevention and Detection: A Step by Step Guide.