Audimation Services has been acquired by Caseware International Learn More.


Blog Image

Tech Tip: Fuzzy Duplicates & Fuzzy Match

CaseWare IDEA® Version 10 introduced an Advanced Fuzzy Duplicate task, which identifies multiple similar records for up to three selected character fields. The output produces databases, including or excluding fuzzy matches with varying degrees of similarity to detect data entry errors, multiple data conventions for recording information and fraud.
This option can be found within the “Analysis” tab and under “Duplicate Key”.
The image below showcases the default selections, scanning for COMPANY names with 80% or greater similarity, including exact duplicates.
This is great for smaller files, but users have found the process to be CPU intensive when scanning files with several thousand of records. In addition, if they pick a percentage too low, they can end up with false positives and must run the process again.
Values such as “JOHN SMITH” and “JOHN J SMITH” will have 95% similarity, while “JAMES SMITH” and “JIM SMITH” have an 82% similarity. You should be familiar with your data to determine what similarity degree is best for the analysis.
Before IDEA 10.0 introduced “Fuzzy Duplicates”, Audimation developers utilized a different approach that supported files of several thousand records.
In circumstances such as seeking duplicate addresses between a Vendor/Customer database and an employee database or duplicate names within a single database, consider this approach:

  1. APPEND FIELD: PREC_NO to the file or files to be compared.
    1. Type: Numeric
    2. Size: 0 decimal places
    3. Parameter: @PRecNo()
    4. Description: Physical Record Number
  1. ADDRESS COMPARE? APPEND a KEY_FIELD consisting of the leading numbers & the 5 digit ZIP CODE to each file
    1. Type: Character
    2. Size: 50
    3. Parameter: @Str(@JustNumbersLeading( STREET_ADDRESS ),0,0) + “_” + @Left(ZIP_CODE,5)
    4. Description: Leading numbers of Street Address + Zip 5
  1. NAME COMPARE? APPEND a KEY_FIELD consisting of the 1st character of the NAME field
    1. Type: Character
    2. Size: 1
    3. Parameter:  @Left(NAME,1)
    4. Description: 1st character of Vendor or Customer Name
  1. If comparing 1 file against itself – perform a DIRECT EXTRACTION to create a copy for the next step.
  1. Next, use VISUAL CONNECTOR* to join the 2 tables you wish to compare on the KEY_FIELD you added.
  2. Finally, append a field SIMILARITY_DEGREE to compare the 2 values, NAMES or ADDRESSES
    1. Type: Numeric
    2. Size: 4 decimal places
    3. Parameter: @SimilarPhrase( NAME_1 ,  NAME_2 ) or @SimilarPhrase( ADDRESS_1 ,  ADDRESS_2 ) **
    4. Description: Fuzzy compare
  1. Use the CRITERIA window to the right under PROPERTIES to filter and determine the best percent value to use. As previously stated, a value somewhere between .82 and .95 will probably work best. Start with SIMILARITY_DEGREE > .82 and apply. Increase the value until you are confident you have eliminated what you might consider “false duplicates”, 1 = 100% match. If comparing 1 file against itself – include:    .AND. PREC_NO < PREC_NO1

Equation Editor

  1. When you have determined the criteria ideal for your data set, perform a DIRECT EXTRACTION to create a table of your “fuzzy matches”.

Tech Tip: Understanding Join & Visual Connector


IDEA Function: @SimilarPhrase


@SimilarPhrase measures the similarity between two specified phrases or Character fields. A phrase can be a character expression split by white space. A phrase can also include multiple words where the internal word sequence is not important. It returns the similarity degree between the two phrases as a numeric value ranging between 0 and 1, up to six decimal places. A similarity degree of 0 indicates that the phrases are completely different and a similarity degree of 1 indicates that they are identical. The higher the numeric value of the similarity degree, the more similar the two strings are.


Two phrases are considered identical only if they contain the same words. As mentioned prevously, @SimilarPhrase ignores the internal word sequence in each phrase; therefore, the sequence of the words can be different. For example, “Jupiter planet” and “planet Jupiter” would be considered identical.

Best Practices , Tech Tip

Posted By

By Kris Willison
Kris joined the Professional Services team in January of 2015 as a Solutions Specialist. She has an extensive background in Software and Database Development accumulated from thirty years in IT support with twenty years’ experience in database development, cleanup, audit and migration using Microsoft Access. In her time with Audimation, she has received client praise for her “Top Tier” engagement on Monitor and Scripting projects. Kris enjoys looking at problems from new angles to determine the most efficient means of meeting the clients’ needs. Kris has been breeding/showing purebred Balinese cats since 1972 and Oriental Longhairs since 1996. She also hosts one of the largest online pedigree database sites for Siamese and related breeds with nearly 600 users worldwide.

Related Posts
Tech Tip: Creating Separate Files for Summarizations
Sep 25 Follow these simple steps to create and export separate files based upon your summarization: Open the original file, in this case SAMPLE-CUSTOMERS Fro...
Using Data Analytics to Prepare for Fieldwork
Feb 20 Abraham Lincoln once said that if he had eight hours to chop down a tree, he would spend six sharpening his axe. Proper preparation can save hours of effort and...
7 Critical Activities of World-Class Internal Auditors
Aug 07 As an internal auditor, your role involves many moving parts. Deliver relevant, timely, and meaningful information to key stakeholders? Check. Build and maintai...

This website has been designed for modern browsers. Please update. Update my browser now