Applying Fuzzy Logic to Acquire Clear Results

3 Practical Tips from Professionals Using IDEA

Fuzzy logic techniques are an effective way to normalize data to identify potential matches, duplicates, errors or fraud. Here are some tips and techniques from experienced IDEA users to help apply fuzzy logic:

TIP #1 Identifying Duplicates in Single & Multiple Databases

A company with five offices, each with individual customer databases containing 10 years of data, wanted to eliminate duplicate information and improve consistency by applying a conversion method. Since each customer clerk had their own way of entering information into the system, identical duplicates are rare. The fuzzy logic is true, false or possible, and the goal is to identify possible duplicates in each database, then identify possible duplicates across all five databases.

Steps to Identify Possible Duplicates:

Append a new field using @justnumbers (Customer Address + Zipcode) to include only numbers from addresses and zip codes
Detect duplicates on the new field from step 1 and generate a duplicate file with possible duplicates
Append a new field using @justnumbers (telephone) to include numbers from Telephone field (some telephone field includes brackets or hyphens
Detect duplicates on the new field from step 3 and generate a duplicate file with possible duplicates
Append a new field using @remove all special characters from Customer Name field, such as hyphens, periods, commas, “&”, and so on.
Append a new field using @left (the new customer name field from step 5, 15) to return 15 characters from the new customer name field. (Depending on the nature of the customer name, you may choose more or less characters to return, or use @right to return characters from the right)
Detect duplicates on the new field from step 6 and generate a duplicate file with possible duplicates

This process resulted in three files with possible duplicates, which were reviewed. Possible duplicates that were identified as non-duplicates were deleted from the database.

Actual duplicates were sent to the customer department for cleanup. The process was replicated for each database for company-wide consistency. Due to the massive records in each database and the duplicates in the same database, the visual connector and @similarphrase were not used in this case.

TIP #2 Matching Customer and Employee Data

Data inconsistencies can make it virtually impossible to compare data. Variances in how names, addresses and numbers are entered or abbreviated within a database can quickly complicate the analysis process. However fuzzy matching can be applied to return a percentage match by reading data forwards and backward for a certain length.

For example “Westheimer Boulevard” can appear as “Westheimer,” “Westheimer Blvd.” Using a forwards match of the first five characters, approximately 80% of potential matches were identified, and the remainder was manageable enough to be manually identified and corrected.

A backward fuzzy match is effective when trying to match customer and employee names. Formatting examples include:

First Name Middle Name Last Name
First Name Middle Initial Last Name
First Name Last Name
First Name Last Name Suffix
Prefix First Name Last Name

Combined with inconsistent formatting and spelling, and the desire to find potential relatives of employees, we realized doing the fuzzy match backwards would target the last name Applying Fuzzy Logic to Acquire Clear Results 3 Practical Tips from Professionals Using IDEA New Levenshtein Distance Techniques Available in IDEA V9.2 The latest version of IDEA simplifies joining data sets together to identify possible infractions between employee and vendor master files using new @functions. Watch the video! no matter what the first/middle names looked like. So “John George Richards”, “Jonathan Richards”, “J G Richards and “Mary Beth Richards” would all show up as matches using the backwards fuzzy match.

Recommendations:
A method to accomplish this would be to use the @similarphrase function in IDEA to compare the employee names. The first step would be to append a new column using the @strip (lastname + firstname + middlename). Apply this in both the employee file and the vendor file.

To start the process of identifying duplicates and family members, it might help to sort the files on the new column creating a new database (INDEXing rearranges the records as a VIEW, SORTing the records will physically change the order of the records to match your criteria – and create a new database).

Run a simple duplicate analysis at this time to identify exact matches.
Use the following function to compare the sorted record value in this new column with the value below it and give a numeric indicator of the similarity: @similarphrase(NewName, @getnextvalue(“NewName”) – note that the field name must be in quote marks. This will provide a numeric indicator between 0 and 1 of how similar the first name is to the name that follows in the sorted database.
Apply this to the Vendor and Employee databases.

To test the Vendor name against the Employee name use Visual Connector to join the two files using the first x number of characters in the new name field as the connecting field or “where clause.” You can decide how many characters to use or do multiple runs using different numbers of characters until you find a “sweet spot.”

First, create a new column in each data set by using the @left(new column name, x) to pull out only x characters.
Next choose Visual Connector from the Analysis tab [Note: Have the dataset with the most records in it open when starting Visual Connector.]
Add the second recordset to the panel and, left-click the new shortened name column in the primary table, drag it over the new name column in the secondary table and release the mouse button to create the join. Click OK.
Next select “Matches Only” and name your new file. This type of Visual Connector join will result in a dataset that contains every record in the secondary file that matches the record in the primary. A 1-tomany match.
Click OK and after the new file is created append a new column using the following function: @similarphrase(New vendor name column, New employee name column). The column should be of the numeric type with 6 decimal places.
Use the vendor name column that contains the whole name in this function, not the shortened x character version for the comparison.
The result will be a number between 0 and 1 with 1 being identical matches and 0 being no match.

TIP #3 Comparing Vendor and Merchant Names

A county agency compares vendor names from their monthly contract report to merchant names on a p-card transaction list to see if departments are purchasing items from contractors using p-cards rather than following the requisition/purchase order process. Merchant names can be slightly different from vendor names within the financial system, so matches are not always obvious.

County auditors also compare vendor names from monthly contract reports to the payee name on the non-PO voucher listing. This enables the Compliance Officer to see if departments are making payments to vendors and bypassing the procurement process.

Fuzzy matches are performed using the @strip and @upper function to match up vendor, merchant and payee names. Then run visual connector to identify matches. You can also use the @similarphrase function to perform this operation and it will provide a percentage of the matches found. These standard @functions within IDEA help normalize and narrow the data population to quickly identify potential matches.

Need help?

Contact the IDEA Help Desk with questions: 888.641.2800 select option 4 or email, [email protected]

Special thanks to the professionals who contributed these helpful tips: Emma Zhang, Chris Ripps, Lisa Currier and Scott Smith.

CaseWare IDEA , News , Tech Tip

By Audimation Team

Common @Functions in IDEA

Jul 12 FUNCTIONS FUNCTIONALITY EXAMPLES @age( ) Returns the number of days between two specified dates. Note: both dates must be a "Date" field typ...

The Numbers Never Lie

Jul 20 An Audit Division That Has Mastered Its Use Employee fraud costs the U.S. $300 billion a year. Fraudsters are becoming more sophisticated and clever in their...

Explore the IDEAScript Vault

Oct 20 The IDEAScript Vault has over 80 IDEAScripts available to any supported user. They are intended as samples and instructional materials, so you should be sure to...