Custom duplicate invoice patterns
With version 2.1.0 (see Release Notes May 2023), we make the algorithm logic customizable. It is now possible to:
Define custom patterns that determine a match.
Specify a combination of columns to be checked per pattern, such as five columns.
Perform more than one fuzzy match per pattern, for example, three exact and two fuzzy matches.
Tweak the parameters per fuzzy match; for example, increase the date range to 30 days.
These pattern customizations are done in the Knowledge Model (ML) of a given package and will extend the configuration in the ML Sensor. If customizations are used, the ML Sensor will still determine the knowledge model and the use case that should be applied (“Duplicate Checking Invoices”). The ML Sensor will also still provide the logs. However, everything previously defined in the ML Sensor: the record, the columns to be checked and written back, the reversal flag, and filters, will now be defined in the Knowledge Model (KM). All this is done using a custom object.
Custom object
A custom object in the Knowledge Model stores the parameters formerly defined in the ML Sensor parameters and the newly added search patterns that determine a match. The custom object should be structured as follows:
Under “customObjects”, add an object with an arbitrary “id” and “displayName”. Under “customAttributes” define the key “record” and specify the record id used for all attributes defined in the custom object. With keys “invoiceDate”, “invoiceValue”, “invoiceReference” and “invoiceVendorName”, define the attributes that should be persisted in the result table. With the key “invoiceReversalFlag”, add the reversal flag. The parameter “excludeExactFromFuzzy” determines how fuzzy matches are handled with regard to exact values. Under the key “invoiceFilters”, list the filter IDs that should be applied. Under “searchPatterns”, define the patterns. It is possible to define multiple custom search patterns.
customObjects: - id: ALGORITHM_CUSTOMIZATIONS displayName: Custom Matching Rules customAttributes: record: INVOICE # Required invoiceDate: INVOICE_DATE # Required invoiceValue: INVOICE_VALUE # Required invoiceReference: INVOICE_REFERENCE # Required invoiceVendorName: INVOICE_VENDOR_NAME # Required invoiceReversalFlag: INVOICE_REVERSAL_FLAG # Optional invoiceFilters: # Optional - FITLER_DUPLICATE_CHECKER_SHORT_REF - FITLER_DUPLICATE_CHECKER_INTERNAL_VENDOR excludeExactFromFuzzy: True # Optional (omitted by default) searchPatterns: # Required REFERENCE_FUZZY: ...
Persisted data fields
The algorithm will always persist four columns in the result table. Any additional columns defined in the search pattern will be checked but not persisted in the result table. The four columns to be persisted have to be defined under the fixed keys “invoiceDate”, “invoiceValue”, “invoiceReference” and “invoiceVendorName”.
Each attribute defined there has to be part of the record selected (“INVOICE” by default). The behavior of configuration changes remains consistent with the ML Sensor setup. For instance, changing an attribute results in the renaming of the column in the “DUPLICATE_INVOICES” table based on the attribute’s ID. Make sure to account for that in the variables being used. In case a result table already exists that was previously populated by the ML Sensor configuration, make sure that an attribute is persisted in the same column (e.g. “INVOICE_DATE” should be specified under “invoiceDate”). The data type of an attribute must align with the data type of the corresponding column in the table. In case of a mismatch, it is necessary to recreate
Important
Even if less than four attributes are checked, it is still necessary to specify four attributes to be persisted.
Fuzzy Handling of Exact Values
The “excludeExactFromFuzzy” parameter determines the handling of fuzzy matches for exact values. It can be set to either “True” or “False”. When set to “True,” a fuzzy comparer considers exact values as non-matches. In other words, a match is only found if values are different, i.e., fuzzy. To match cases with exact values, it is necessary to explicitly define a pattern that uses the “Exact” comparer. When set to “False,” a fuzzy comparer also considers exact values as matches. The parameter can be omitted and then set to “True” by default.
Custom search patterns
The way potential duplicate groups are formed is determined by the patterns. Each search pattern consists of a combination of columns and comparers. A single pattern is defined in the following format:
REFVENDORDATEFUZZY: # Required. Name of the pattern. Unique but arbitrary. columns: # Required. - columnName: INOVICE_REFERENCE # Required. Under the key “columnName", specify the id of the record attribute to be checked. comparer: # Required. comparerName: Reference # Required. Under the key “comparerName”, specify the comparer name that should be used. parameters: # Optional. Tweak parameters of the comparer. numDays: 7 # Optional. Parameter name and parameter value - ....
The pattern name can be freely defined, but it is recommended to use short names (e.g., A1 or REF_FUZZY) in case there are more than four patterns. This ensures we can persist all applicable pattern names in the “PATTERN” column in the DUPLICATE_INVOICES table. By default, the width for this column is 200 characters. If you run into problems with the width, you can increase it with an “ALTER TABLE” SQL statement after the result table creation.
Multiple attributes (“columnName”
) can be specified per pattern. Each attribute defined in a given pattern has to be part of the record selected (“INVOICE”
by default). Please note that the default patterns are applied if no search pattern (key “searchPattern”
) has been defined in the knowledge model.
Note
The default patterns are applied, if no search pattern (key “searchPattern”) has been defined in the knowledge model. If “searchPattern” has been defined, only the patterns explicitly specified are applied, i.e., the default patterns are not applied automatically.
Important
The current constraint of a search pattern is that it must contain at least one exact column using the “Exact” comparer.
Comparers and parameters
Within the search pattern definition, there is the option to use different comparers and tweak certain parameters. Currently, six comparers can be used:
One exact comparer:
Exact
One different comparer:
Diff
Four fuzzy comparers:
CompanyName
Date
Reference
Value
Each comparer applies a different matching logic and consists of a list of parameters that can be tweaked.
Important
Please be aware that comparers might have additional parameters by default that cannot be customized. For example, for the “Date” comparer, a check for commonly swapped months and swapped days and months is always performed. Check the Standard Patterns for the entire pattern logic. Furthermore, not specifying the parameters in the pattern definition, results in the default parameters being applied.
The parameters that can be customized for a given comparer are defined as follows:
Exact: Designed to find matches of the same value.
Diff: Designed to find matches of different values (i.e., not exact).
CompanyName: Designed to find similar company names but can be used as a general string comparer.
CompanyPatterns: A list of company suffixes to be removed before comparing.
companyPatterns = [ " gmbh", " ag", " llc", … ]
Method: The method used for calculating the string similarity score. Choose from “jaro” or “levenshtein”.
"method": "jaro"
Threshold: The threshold of the score above which a fuzzy match is considered. The higher, the stricter.
"threshold": 0.85
Date: Designed to find similar invoice dates but can be used as a general date comparer.
NumDays: The maximum allowed day difference between two dates.
"numDays": 7
Reference: Designed to find similar invoice references.
ConfusionChars: Pairs of letters and numbers that can be mistakenly recognized as the same, for example, “8” and “b.” Use two lists of characters where each position in the character list forms a pair (“8” and “b” form the first pair).
"confusionChars": ["86i10oqsz", "bgllddd52"]
CharMaxErrors: The maximum allowed number of characters skipped (or errors) in one of the strings so that it is equal to the other string. For example, “abcd1234” and “abcd124”. If we skip character “3” in the first string, the two strings would be identical. This constitutes one error. A string of “abc123” and “abca12z3” would constitute two errors since “a” (the second) and “z” can be skipped in the second string to make the two strings match. A string of “abc123” and “abg123” would not be considered a match since skipping “c” does not result in the two strings matching (“ab123” vs. “abg123”), and skipping “g” does not result in the two strings to match (“abc123” vs “ab123”).
"charMaxErrors": 3
TurnerMaxErrors: The maximum number of swapped character pairs (turners). For instance, there are two turners in “ABC” and “BAC” where “A and B” and “B and A” represent a pair of inverted characters. Two characters need to be swapped to turn “ABC” into “BAC” and vice versa. This requires the two strings to have the same characters.
"turnerMaxErrors": 3
Note
If there is a numerical inversion such as “01” and “10”, we automatically consider this not to be a match. For instance, the references “ABC01” and “ABC10” do not match. This prevents false positives with recurring invoices, often with incremental numbers.
Value: Designed to find similarities between invoice values but can be used as a general numeric comparer.
MaxPriceLimit: The maximum allowed absolute difference between two numbers.
"maxPriceLimit": 80.0
TurnerMaxErrors: The maximum number of swapped number pairs (turners). For instance, there are four turners in “21500” and “12005”, where “2 and 1”, “1 and 2”, “5 and 0” and “0 and 5” represent a pair of inverted characters. To turn “21500” into “12005” and vice versa, four digits need to be swapped. This requires two numbers to have the same digits.
"turnerMaxErrors": 5
Pattern Definition - Best Practices
When introducing or modifying custom patterns, it is highly recommended to test the new configuration in a non-production environment initially. This testing environment should include a separate package and data model to avoid interference with the productive setup. Since the behavior of configuration changes remains consistent with the ML Sensor setup, already checked documents will not be re-evaluated with a new custom pattern configuration. To re-check all documents, a manual reset of the result tables is necessary (see Troubleshooting).
Pattern Application
Standard Patterns outline that a pattern is always applied to a pair of documents. However, it can be that different custom patterns that are applicable for a given document pair have been defined simultaneously. For instance, consider two patterns with a fuzzy date check - one having a threshold of 30 days and another with 90 days.
The only reason to define the stricter pattern with a 30-day threshold is to better distinguish the group formation in the front end. With both the 30-day and 90-day patterns listed in the “PATTERN” column, this information can be used to display a more specific group pattern in the view, for example, “Similar Date (short).” However, this clear distinction only works with groups of two documents. Groups with more than two documents can be formed with multiple patterns (see Default patterns), potentially both the 30- and 90-day thresholds.
Pattern Determination
The default group pattern shown to the user is determined by comparing the checked columns and counting distinct values within the group. For instance, if documents within the group share the same vendor name, date and value, but have more than one reference, this translates to the group pattern “Similar Reference”. When defining custom patterns with new or additional columns checked, adjust the variable “GROUP_DUPLICATE_PATTERN”. In case of overlapping patterns, it might also be necessary to incorporate the “PATTERN” column. Returning to the example with the 30 and 90 days threshold and assuming a fifth column is checked, the additional PQL could look like this:
WHEN ... PU_COUNT_DISTINCT ( "GROUPS_DUPLICATE_INVOICES" , "DUPLICATE_INVOICES"."INVOICE_DATE" , INDEX_ORDER ( "GROUPS_DUPLICATE_INVOICES"."GROUP_UUID" , ORDER BY ( "DUPLICATE_INVOICES"."GROUP_CREATION_DATE" ASC ) , PARTITION BY ( "GROUPS_DUPLICATE_INVOICES"."GROUP_UUID" , "DUPLICATE_INVOICES"."INVOICE_ID" ) ) = 1 ) > 1 AND PU_COUNT_DISTINCT ( "GROUPS_DUPLICATE_INVOICES" , BIND ( "DUPLICATE_INVOICES", PU_LAST ( "DISTINCT_INVOICES" , "CASE_TABLE"."EXTRA_COLUMN")), INDEX_ORDER ( "GROUPS_DUPLICATE_INVOICES"."GROUP_UUID" , ORDER BY ( "DUPLICATE_INVOICES"."GROUP_CREATION_DATE" ASC ) , PARTITION BY ( "GROUPS_DUPLICATE_INVOICES"."GROUP_UUID" , "DUPLICATE_INVOICES"."INVOICE_ID" ) ) = 1 ) = 1 AND PU_LAST ( "GROUPS_DUPLICATE_INVOICES", "DUPLICATE_INVOICES"."PATTERN") LIKE '%DOC_DATE_SHORT%' THEN 'Similar Date (short)'...
Performance Considerations
With custom patterns, there are trade-offs and performance considerations that you should be aware of:
The pattern complexity, in combination with the number of documents checked, determines the resource usage.
Exact column checks are less resource-intensive than fuzzy checks.
The more exact columns checked, the less memory is used. A pattern with three exact columns uses less memory than a pattern with only one exact column.
Checking fewer columns and applying less strict patterns naturally results in larger groups. For example, using multiple patterns with only one Exact and one CompanyName comparer, set to a very low threshold, can lead to the creation of groups containing thousands of documents, which should be avoided.
Each additional fuzzy check per pattern significantly increases memory usage, especially the “Reference” comparer. If possible, avoid adding multiple fuzzy checks with the “Reference” comparer for a single pattern.
We recommend a maximum of 7 columns checked per pattern as they have a medium impact on memory usage.
We recommend a maximum of 15 search patterns as they have a minor impact on memory usage. This is because the patterns are independent of each other.
Having a “resource intensive” pattern configuration beyond these recommendations is, of course, possible but comes with the trade-off of being able to check fewer documents simultaneously.
Configuration examples
Default configuration
The configuration of the default search pattern (applied when using the default ML Sensor setup). Removing the parameters results in the default parameters being applied.
""" customObjects: - id: ALGORITHM_CUSTOMIZATIONS displayName: Custom Matching Rules customAttributes: record: INVOICE invoiceDate: INVOICE_DATE invoiceValue: INVOICE_VALUE invoiceReference: INVOICE_REFERENCE invoiceVendorName: INVOICE_VENDOR_NAME invoiceReversalFlag: INVOICE_REVERSAL_FLAG invoiceFilters: - FITLER_DUPLICATE_CHECKER_SHORT_REF - FITLER_DUPLICATE_CHECKER_INTERNAL_VENDOR searchPatterns: REFERENCE_FUZZY: columns: - columnName: INVOICE_REFERENCE comparer: comparerName: Reference parameters: charMaxErrors: 3 turnerMaxErrors: 3 confusionChars: ["86i10oqsz", "bgllddd52"] - columnName: INVOICE_DATE comparer: comparerName: Exact - columnName: INVOICE_VALUE comparer: comparerName: Exact - columnName: INVOICE_VENDOR_NAME comparer: comparerName: Exact DOC_DATE_FUZZY: columns: - columnName: INVOICE_REFERENCE comparer: comparerName: Exact - columnName: INVOICE_DATE comparer: comparerName: Date parameters: numDays: 7 - columnName: INVOICE_VALUE comparer: comparerName: Exact - columnName: INVOICE_VENDOR_NAME comparer: comparerName: Exact VALUE_FUZZY: columns: - columnName: INVOICE_REFERENCE comparer: comparerName: Exact - columnName: INVOICE_DATE comparer: comparerName: Exact - columnName: INVOICE_VALUE comparer: comparerName: Value parameters: maxPriceLimit: 80.0 turnerMaxErrors: 5 - columnName: INVOICE_VENDOR_NAME comparer: comparerName: Exact VENDOR_FUZZY: columns: - columnName: INVOICE_REFERENCE comparer: comparerName: Exact - columnName: INVOICE_DATE comparer: comparerName: Exact - columnName: INVOICE_VALUE comparer: comparerName: Exact - columnName: INVOICE_VENDOR_NAME comparer: comparerName: CompanyName parameters: threshold: 0.85 method: jaro companyPatterns: [ " gmbh", " ag", " llc", " inc", " ltd", " limited", " sdn", " bhd", " se", " corporation", " corp", " sl", " coltd", " group", " mbh", " co", " kg", " ltda", " sa", " sro", " des", " sas", " sasu", " zoo", " sp", " sau", " cokg", ] EXACT: columns: - columnName: INVOICE_REFERENCE comparer: comparerName: Exact - columnName: INVOICE_DATE comparer: comparerName: Exact - columnName: INVOICE_VALUE comparer: comparerName: Exact - columnName: INVOICE_VENDOR_NAME comparer: comparerName: Exact """
Additional column checked
Including the due date (“CFIELD_DUE_DATE”) as an additional column to be checked using a fuzzy comparer. Ensure that the attribute “CFIELD_DUE_DATE” is part of the INVOICE record.
""" searchPatterns: DUE_DATE_FUZZY: columns: - columnName: INVOICE_REFERENCE comparer: comparerName: Exact - columnName: INVOICE_DATE comparer: comparerName: Exact - columnName: INVOICE_VALUE comparer: comparerName: Exact - columnName: INVOICE_VENDOR_NAME comparer: comparerName: Exact - columnName: CFIELD_DUE_DATE comparer: comparerName: Date """
Multiple fuzzy matching
Only set the invoice value to be exact and apply fuzzy matches to the other three columns.
""" searchPatterns: VALUE_EXACT: columns: - columnName: INVOICE_REFERENCE comparer: comparerName: Reference - columnName: INVOICE_DATE comparer: comparerName: Date - columnName: INVOICE_VALUE comparer: comparerName: Exact - columnName: INVOICE_VENDOR_NAME comparer: comparerName: CompanyName """
Date comparer parameters
Loosen the days' constraint of the date fuzzy match to 60 days.
""" searchPatterns: LOOSE_DATE_FUZZY: columns: - columnName: INVOICE_REFERENCE comparer: comparerName: Exact - columnName: INVOICE_DATE comparer: comparerName: Date parameters: numDays: 60 - columnName: INVOICE_VALUE comparer: comparerName: Exact - columnName: INVOICE_VENDOR_NAME comparer: comparerName: Exact """
CompanyName comparer parameters
Make the vendor name fuzzy matching stricter by increasing the threshold to 0.9.
""" searchPatterns: TIGHT_VENDOR_FUZZY: columns: - columnName: INVOICE_REFERENCE comparer: comparerName: Exact - columnName: INVOICE_DATE comparer: comparerName: Exact - columnName: INVOICE_VALUE comparer: comparerName: Exact - columnName: INVOICE_VENDOR_NAME comparer: comparerName: CompanyName parameters: threshold: 0.9 """
Reference comparer parameters
Make the reference fuzzy matching stricter by setting the maximum character errors to 2.
""" searchPatterns: TIGHT_REFERENCE_FUZZY: columns: - columnName: INVOICE_REFERENCE comparer: comparerName: Reference parameters: charMaxErrors: 2 - columnName: INVOICE_DATE comparer: comparerName: Exact - columnName: INVOICE_VALUE comparer: comparerName: Exact - columnName: INVOICE_VENDOR_NAME comparer: comparerName: Exact """
Value comparer parameters
Make the value fuzzy matching stricter by setting the maximum allowed value difference to be 30.
""" searchPatterns: TIGHT_VALUE_FUZZY: columns: - columnName: INVOICE_REFERENCE comparer: comparerName: Exact - columnName: INVOICE_DATE comparer: comparerName: Exact - columnName: INVOICE_VALUE comparer: comparerName: Value parameters: max_price_limit: 30 - columnName: INVOICE_VENDOR_NAME comparer: comparerName: Exact """