Audimation Services has been acquired by CaseWare International Learn More.
CaseWare IDEA® Version 10 introduced an Advanced Fuzzy Duplicate task, which identifies multiple similar records for up to three selected character fields. The output produces databases, including or excluding fuzzy matches with varying degrees of similarity to detect data entry errors, multiple data conventions for recording information and fraud.
This option can be found within the “Analysis” tab and under “Duplicate Key”.
The image below showcases the default selections, scanning for COMPANY names with 80% or greater similarity, including exact duplicates.
This is great for smaller files, but users have found the process to be CPU intensive when scanning files with several thousand of records. In addition, if they pick a percentage too low, they can end up with false positives and must run the process again.
Values such as “JOHN SMITH” and “JOHN J SMITH” will have 95% similarity, while “JAMES SMITH” and “JIM SMITH” have an 82% similarity. You should be familiar with your data to determine what similarity degree is best for the analysis.
Before IDEA 10.0 introduced “Fuzzy Duplicates”, Audimation developers utilized a different approach that supported files of several thousand records.
In circumstances such as seeking duplicate addresses between a Vendor/Customer database and an employee database or duplicate names within a single database, consider this approach:
Tech Tip: Understanding Join & Visual Connector
IDEA Function: @SimilarPhrase
@SimilarPhrase measures the similarity between two specified phrases or Character fields. A phrase can be a character expression split by white space. A phrase can also include multiple words where the internal word sequence is not important. It returns the similarity degree between the two phrases as a numeric value ranging between 0 and 1, up to six decimal places. A similarity degree of 0 indicates that the phrases are completely different and a similarity degree of 1 indicates that they are identical. The higher the numeric value of the similarity degree, the more similar the two strings are.
Two phrases are considered identical only if they contain the same words. As mentioned prevously, @SimilarPhrase ignores the internal word sequence in each phrase; therefore, the sequence of the words can be different. For example, “Jupiter planet” and “planet Jupiter” would be considered identical.