Chapter 21: Fuzzy Duplicate

CaseWare IDEA 10 Tutorials

Chapter 21: Fuzzy Duplicate Transcript

Hello. Welcome to this video about Fuzzy Duplicates. This video is brought to you by CaseWare Analytics.

One of our goals at CaseWare is to help our clients maximize the return on their investment in IDEA through continuing education.

Fuzzy Duplicate is a special kind of Duplicate Key detection which is used to identify similar records in Character fields. The keyword here is similar – there are already lots of tools for finding exact matches. Once records are identified as having similarities, they are gathered into groups called Fuzzy Groups and ranked according to their similarity degree – the more similar the data, the higher the similarity degree.

An exact match such as Chan and Chan is considered to have a similarity degree of 1.0000 while something like Schmidt and Schmitt has a similarity degree of about 0.85. A Fuzzy Group is made up of a core and the matches – the core is the record with the largest number of fuzzy matches.

Fuzzy Duplicate is good for comparing fields that contain single words, such as a stock market ID or foreign exchange codes as well as character strings where the sequence is important, such as a phone number with national code, and an area or regional code as well as the local number.

It is also good for comparing phrases or short sentences where word order is important, such as a business name or an address.

In practical terms, it means that this tool can help identify entries with slight differences, such as a spelling error or variations introduced during data entry.

An example of a data entry variation can be seen in the words Road or Street when used in an address. It is common to see both these words abbreviated, but period use is often inconsistent.

So even though we recognize these entries as simply variations that represent the same entity, this kind of minor difference is enough for IDEA to consider this a unique key in a field.

These kinds of duplicates are notoriously difficult to find, but the Fuzzy Duplicates tool makes finding them much easier.

Chapter 21: Fuzzy Duplicate

CaseWare IDEA 10 Tutorials

Chapter 21: Fuzzy Duplicate Transcript

BROWSER NOT SUPPORTED