Skip to main content

Celonis Product Documentation

MATCH_STRINGS
Description

MATCH_STRINGS finds the top-k best matching strings from another column for each string in the input column.

Warning

In some settings, this operator may require excessive CPU time. If the execution time exceeds 10 minutes, the execution is stopped and an error is reported.

MATCH_STRINGS matches strings from an input column against strings in a match column based on their edit distance (see CLUSTER_STRINGS for an explanation of the edit distance). The best top-k matches are returned separated by a given separator.

Warning

Computation Times In some settings, this operator may require excessive CPU time. If the execution time exceeds 10 minutes, the execution is stopped and an according error is reported.

Syntax
 MATCH_STRINGS (table1.input_column, table2.match_column [, TOP_K ( top_k )] [, SEPARATOR ( separator ) ])
  • input_column: String column.

  • match_column: String column. This column might be from a different table than the input column.

  • top_k: Positive integer that is recommended to not exceed 10 and must not exceed 100. The number of matches MATCH_STRINGS should find. Defaults to 1.

  • separator: All found strings will be separated by this string. Defaults to ', '.

Multiple occurrences

Multiple occurrences of the same string value in the match column are ignored and the result of MATCH_STRINGS will be the same as when called with a match column containing this string value only once.

NULL Handling

MATCH_STRINGS ignores NULL values. If the column contains only NULL values, the result will also be a column containing only NULL values. If there are NULL values and non NULL values, it is guaranteed that no non NULL value is matched with a NULL value.

Execution Time Limit

MATCH_STRINGS has currently a set time limit to around 10 minutes until it will interrupt its execution. High execution times might be due to a high TOP_K value and lowering it will improve the operator runtime.

Examples

[1]

For every entry in the input column MATCH_STRINGS finds the top_k best matching strings from the match column. Here top_k is the default value 1.

Query

Column1

         MATCH_STRINGS ( "TABLE1"."Input" , "TABLE2"."Match" )
        

Input

Output

TABLE1

Input : string

'Shirt'

'Pants'

TABLE2

Match : string

'T-Shirt'

'Sweatshirt'

'Short pants'

'Sweatpants'

Result

Column1 : string

'T-Shirt'

'Sweatpants'

[2]

Example of MATCH_STRINGS with non default values. Since top_k is two and the separator is '##' the result is a list of two strings separated by '##'.

Query

Column1

         MATCH_STRINGS ( "TABLE1"."Input" , "TABLE2"."Match" , TOP_K ( 2 ) , SEPARATOR ( '##' ) )
        

Input

Output

TABLE1

Input : string

'Shirt'

'Pants'

TABLE2

Match : string

'T-Shirt'

'Sweatshirt'

'Short pants'

'Sweatpants'

Result

Column1 : string

'T-Shirt##Sweatshirt'

'Sweatpants##Short pants'

[3]

There has to be at least one common token between the string in the input column and the string in the match column. Otherwise MATCH_STRINGS will find less than top_k matches.

Query

Column1

         MATCH_STRINGS ( "TABLE1"."Input" , "TABLE2"."Match" , TOP_K ( 2 ) )
        

Input

Output

TABLE1

Input : string

'xyz'

'T-Shirt'

'abc'

TABLE2

Match : string

'Shirt'

'Sweatshirt'

Result

Column1 : string

''

'Shirt, Sweatshirt'

'Sweatshirt'

[4]

Multiple occurrences of a string value in the match column do not effect the result.

Query

Column1

         MATCH_STRINGS ( "TABLE1"."Input" , "TABLE2"."Match" , TOP_K ( 2 ) )
        

Input

Output

TABLE1

Input : string

'Shirt'

'Pants'

TABLE2

Match : string

'T-Shirt'

'T-Shirt'

'Sweatpants'

'Sweatpants'

Result

Column1 : string

'T-Shirt, Sweatpants'

'Sweatpants, T-Shirt'