Audimation Services has been acquired by Caseware International Learn More.

X
Icon


Blog Image

Tech Tip: Fuzzy Duplicates & Fuzzy Match


CaseWare IDEA® Version 10 introduced an Advanced Fuzzy Duplicate task, which identifies multiple similar records for up to three selected character fields. The output produces databases, including or excluding fuzzy matches with varying degrees of similarity to detect data entry errors, multiple data conventions for recording information and fraud.
 
This option can be found within the “Analysis” tab and under “Duplicate Key”.
 
Duplicate-Keys
 
The image below showcases the default selections, scanning for COMPANY names with 80% or greater similarity, including exact duplicates.
 
Fuzzy-Match-Keys
This is great for smaller files, but users have found the process to be CPU intensive when scanning files with several thousand of records. In addition, if they pick a percentage too low, they can end up with false positives and must run the process again.
 
Values such as “JOHN SMITH” and “JOHN J SMITH” will have 95% similarity, while “JAMES SMITH” and “JIM SMITH” have an 82% similarity. You should be familiar with your data to determine what similarity degree is best for the analysis.
 
Before IDEA 10.0 introduced “Fuzzy Duplicates”, Audimation developers utilized a different approach that supported files of several thousand records.
 
 
 
In circumstances such as seeking duplicate addresses between a Vendor/Customer database and an employee database or duplicate names within a single database, consider this approach:

  1. APPEND FIELD: PREC_NO to the file or files to be compared.
    1. Type: Numeric
    2. Size: 0 decimal places
    3. Parameter: @PRecNo()
    4. Description: Physical Record Number
  1. ADDRESS COMPARE? APPEND a KEY_FIELD consisting of the leading numbers & the 5 digit ZIP CODE to each file
    1. Type: Character
    2. Size: 50
    3. Parameter: @Str(@JustNumbersLeading( STREET_ADDRESS ),0,0) + “_” + @Left(ZIP_CODE,5)
    4. Description: Leading numbers of Street Address + Zip 5
  1. NAME COMPARE? APPEND a KEY_FIELD consisting of the 1st character of the NAME field
    1. Type: Character
    2. Size: 1
    3. Parameter:  @Left(NAME,1)
    4. Description: 1st character of Vendor or Customer Name
  1. If comparing 1 file against itself – perform a DIRECT EXTRACTION to create a copy for the next step.
  1. Next, use VISUAL CONNECTOR* to join the 2 tables you wish to compare on the KEY_FIELD you added.
  2. Finally, append a field SIMILARITY_DEGREE to compare the 2 values, NAMES or ADDRESSES
    1. Type: Numeric
    2. Size: 4 decimal places
    3. Parameter: @SimilarPhrase( NAME_1 ,  NAME_2 ) or @SimilarPhrase( ADDRESS_1 ,  ADDRESS_2 ) **
    4. Description: Fuzzy compare
  1. Use the CRITERIA window to the right under PROPERTIES to filter and determine the best percent value to use. As previously stated, a value somewhere between .82 and .95 will probably work best. Start with SIMILARITY_DEGREE > .82 and apply. Increase the value until you are confident you have eliminated what you might consider “false duplicates”, 1 = 100% match. If comparing 1 file against itself – include:    .AND. PREC_NO < PREC_NO1

Equation Editor

  1. When you have determined the criteria ideal for your data set, perform a DIRECT EXTRACTION to create a table of your “fuzzy matches”.

 
 
SUPPORTING RESOURCES
 
Tech Tip: Understanding Join & Visual Connector

 

IDEA Function: @SimilarPhrase

 

@SimilarPhrase measures the similarity between two specified phrases or Character fields. A phrase can be a character expression split by white space. A phrase can also include multiple words where the internal word sequence is not important. It returns the similarity degree between the two phrases as a numeric value ranging between 0 and 1, up to six decimal places. A similarity degree of 0 indicates that the phrases are completely different and a similarity degree of 1 indicates that they are identical. The higher the numeric value of the similarity degree, the more similar the two strings are.

 

Two phrases are considered identical only if they contain the same words. As mentioned prevously, @SimilarPhrase ignores the internal word sequence in each phrase; therefore, the sequence of the words can be different. For example, “Jupiter planet” and “planet Jupiter” would be considered identical.


Best Practices , Tech Tip



Posted By

By Kris Willison
Kris joined the Professional Services team in January of 2015 as a Solutions Specialist. She has an extensive background in Software and Database Development accumulated from thirty years in IT support with twenty years’ experience in database development, cleanup, audit and migration using Microsoft Access. In her time with Audimation, she has received client praise for her “Top Tier” engagement on Monitor and Scripting projects. Kris enjoys looking at problems from new angles to determine the most efficient means of meeting the clients’ needs. Kris has been breeding/showing purebred Balinese cats since 1972 and Oriental Longhairs since 1996. She also hosts one of the largest online pedigree database sites for Siamese and related breeds with nearly 600 users worldwide.


Related Posts
The Power of Field Statistics
Mar 07 IDEA’s Field Statistics can save you a wealth of time. It should be the first thing you go to after importing a file. Why? Every control total is available...
Using Data Analytics to Prepare for Fieldwork
Feb 20 Abraham Lincoln once said that if he had eight hours to chop down a tree, he would spend six sharpening his axe. Proper preparation can save hours of effort and...
No Image
Sep 19 Henry Ford once said that if he had asked people what they wanted, they would have said faster horses. Constant innovation keeps businesses thriving. Ford Motor...
BROWSER NOT SUPPORTED

This website has been designed for modern browsers. Please update. Update my browser now

×