Fuzzy logic techniques are an effective way to normalize data to identify potential matches, duplicates, errors or fraud. Here are some tips and techniques from experienced IDEA users to help apply fuzzy logic:
A company with five offices, each with individual customer databases containing 10 years of data, wanted to eliminate duplicate information and improve consistency by applying a conversion method. Since each customer clerk had their own way of entering information into the system, identical duplicates are rare. The fuzzy logic is true, false or possible, and the goal is to identify possible duplicates in each database, then identify possible duplicates across all five databases.
This process resulted in three files with possible duplicates, which were reviewed. Possible duplicates that were identified as non-duplicates were deleted from the database.
Actual duplicates were sent to the customer department for cleanup. The process was replicated for each database for company-wide consistency. Due to the massive records in each database and the duplicates in a same database, the visual connector and @similarphrase were not used in this case.
Available in IDEA V9.2
The latest version of IDEA simplifies joining data sets together to identify possible infractions between employee and vendor master files using new @functions.
Watch the Video!
Data inconsistencies can make it virtually impossible to compare data. Variances in how names, addresses and numbers are entered or abbreviated within a database can quickly complicate the analysis process. However fuzzy matching can be applied to return a percentage match by reading data forwards and backwards for a certain length.
For example “Westheimer Boulevard” can appear as “Westheimer,” “Westheimer Blvd.” Using a forwards match of the first five characters, approximately 80% of potential matches were identified, and the remainder was manageable enough to be manually identified and corrected.
A backwards fuzzy match is effective when trying to match customer and employee names. Formatting examples include:
Combined with inconsistent formatting and spelling, and the desire to find potential relatives of employees, we realized doing the fuzzy match backwards would target the last name Applying Fuzzy Logic to Acquire Clear Results 3 Practical Tips from Professionals Using IDEA New Levenshtein Distance Techniques Available in IDEA V9.2 The latest version of IDEA simplifies joining data sets together to identify possible infractions between employee and vendor master files using new @functions. Watch the video! no matter what the first/middle names looked like. So “John George Richards”, “Jonathan Richards”, “J G Richards and “Mary Beth Richards” would all show up as matches using the backwards fuzzy match.
A method to accomplish this would be to use the @similarphrase function in IDEA to compare the employee names. The first step would be to append a new column using the @strip (lastname + firstname + middlename). Apply this in both the employee file and the vendor file.
To start the process of identifying duplicates and family members, it might help to sort the files on the new column creating a new database (INDEXing rearranges the records as a VIEW, SORTing the records will physically change the order of the records to match your criteria – and create a new database).
To test the Vendor name against the Employee name use Visual Connector to join the two files using the first x number of characters in the new name field as the connecting field or “where clause.” You can decide how many characters to use or do multiple runs using different numbers of characters until you find a “sweet spot.”
A county agency compares vendor names from their monthly contract report to merchant names on a p-card transaction list to see if departments are purchasing items from contractors using p-cards rather than following the requisition/purchase order process. Merchant names can be slightly different from vendor names within the financial system, so matches are not always obvious.
County auditors also compare vendor names from monthly contract reports to the payee name on the non-PO voucher listing. This enables the Compliance Officer to see if departments are making payments to vendors and bypassing the procurement process.
Fuzzy matches are performed using the @strip and @upper function to match up vendor, merchant and payee names. Then run visual connector to identify matches. You can also use the @similarphrase function to perform this operation and it will provide a percentage of the matches found. These standard @functions within IDEA help normalize and narrow the data population to quickly identify potential matches.
Contact the IDEA Help Desk with questions: 888.641.2800 select option 4 or email, [email protected]
Special thanks to the professionals who contributed these helpful tips: Emma Zhang, Chris Ripps, Lisa Currier and Scott Smith.