Tech Tip: Fuzzy Duplicates & Fuzzy Match

CaseWare IDEA® Version 10 introduced an Advanced Fuzzy Duplicate task, which identifies multiple similar records for up to three selected character fields. The output produces databases, including or excluding fuzzy matches with varying degrees of similarity to detect data entry errors, multiple data conventions for recording information and fraud.

This option can be found within the “Analysis” tab and under “Duplicate Key”.

Duplicate-Keys

The image below showcases the default selections, scanning for COMPANY names with 80% or greater similarity, including exact duplicates.

Fuzzy-Match-Keys
This is great for smaller files, but users have found the process to be CPU intensive when scanning files with several thousand of records. In addition, if they pick a percentage too low, they can end up with false positives and must run the process again.

Values such as “JOHN SMITH” and “JOHN J SMITH” will have 95% similarity, while “JAMES SMITH” and “JIM SMITH” have an 82% similarity. You should be familiar with your data to determine what similarity degree is best for the analysis.

Before IDEA 10.0 introduced “Fuzzy Duplicates”, Audimation developers utilized a different approach that supported files of several thousand records.

In circumstances such as seeking duplicate addresses between a Vendor/Customer database and an employee database or duplicate names within a single database, consider this approach:

APPEND FIELD: PREC_NO to the file or files to be compared.
1. Type: Numeric
2. Size: 0 decimal places
3. Parameter: @PRecNo()
4. Description: Physical Record Number

ADDRESS COMPARE? APPEND a KEY_FIELD consisting of the leading numbers & the 5 digit ZIP CODE to each file
1. Type: Character
2. Size: 50
3. Parameter: @Str(@JustNumbersLeading( STREET_ADDRESS ),0,0) + “_” + @Left(ZIP_CODE,5)
4. Description: Leading numbers of Street Address + Zip 5

NAME COMPARE? APPEND a KEY_FIELD consisting of the 1^st character of the NAME field
1. Type: Character
2. Size: 1
3. Parameter: @Left(NAME,1)
4. Description: 1^st character of Vendor or Customer Name

If comparing 1 file against itself – perform a DIRECT EXTRACTION to create a copy for the next step.

Next, use VISUAL CONNECTOR* to join the 2 tables you wish to compare on the KEY_FIELD you added.
Finally, append a field SIMILARITY_DEGREE to compare the 2 values, NAMES or ADDRESSES
1. Type: Numeric
2. Size: 4 decimal places
3. Parameter: @SimilarPhrase( NAME_1 , NAME_2 ) or @SimilarPhrase( ADDRESS_1 , ADDRESS_2 ) **
4. Description: Fuzzy compare

Use the CRITERIA window to the right under PROPERTIES to filter and determine the best percent value to use. As previously stated, a value somewhere between .82 and .95 will probably work best. Start with SIMILARITY_DEGREE > .82 and apply. Increase the value until you are confident you have eliminated what you might consider “false duplicates”, 1 = 100% match. If comparing 1 file against itself – include: .AND. PREC_NO < PREC_NO1

Equation Editor

When you have determined the criteria ideal for your data set, perform a DIRECT EXTRACTION to create a table of your “fuzzy matches”.

SUPPORTING RESOURCES

Tech Tip: Understanding Join & Visual Connector

IDEA Function: @SimilarPhrase

@SimilarPhrase measures the similarity between two specified phrases or Character fields. A phrase can be a character expression split by white space. A phrase can also include multiple words where the internal word sequence is not important. It returns the similarity degree between the two phrases as a numeric value ranging between 0 and 1, up to six decimal places. A similarity degree of 0 indicates that the phrases are completely different and a similarity degree of 1 indicates that they are identical. The higher the numeric value of the similarity degree, the more similar the two strings are.

Two phrases are considered identical only if they contain the same words. As mentioned prevously, @SimilarPhrase ignores the internal word sequence in each phrase; therefore, the sequence of the words can be different. For example, “Jupiter planet” and “planet Jupiter” would be considered identical.

Best Practices , Tech Tip

By Kris Willison
Kris joined the Professional Services team in January of 2015 as a Solutions Specialist. She has an extensive background in Software and Database Development accumulated from thirty years in IT support with twenty years’ experience in database development, cleanup, audit and migration using Microsoft Access. In her time with Audimation, she has received client praise for her “Top Tier” engagement on Monitor and Scripting projects. Kris enjoys looking at problems from new angles to determine the most efficient means of meeting the clients’ needs. Kris has been breeding/showing purebred Balinese cats since 1972 and Oriental Longhairs since 1996. She also hosts one of the largest online pedigree database sites for Siamese and related breeds with nearly 600 users worldwide.

Data Acquisition and Overcoming IT Objections

Jul 20 The acquisition of client data, more than any other single factor, will determine the success or failure the audit. As custodians of system data, the IT departm...

Resources to Build or Refresh Your IDEA Skills

Aug 04 Eight months ago, IDEA was installed, you provided your team with excellent on-site training, and the team spent several days determining internal best practice...

Do You Know Where Your Data Is Tonight?

May 20 Challenges of Auditing in a Virtual World Welcome to our virtual world. These days, our company information, financial details, employee records, and client ...

Tech Tip: Fuzzy Duplicates & Fuzzy Match

BROWSER NOT SUPPORTED