MATCH_STRINGS
Description
MATCH_STRINGS finds the top-k best matching strings from another column for each string in the input column.
Warning
In some settings, this operator may require excessive CPU time. If the execution time exceeds 10 minutes, the execution is stopped and an error is reported.
MATCH_STRINGS
matches strings from an input column against strings in a match column based on their edit distance (see CLUSTER_STRINGS for an explanation of the edit distance). The best top-k matches are returned separated by a given separator.
Warning
Computation Times In some settings, this operator may require excessive CPU time. If the execution time exceeds 10 minutes, the execution is stopped and an according error is reported.
Syntax
MATCH_STRINGS (table1.input_column, table2.match_column [, TOP_K ( top_k )] [, SEPARATOR ( separator ) ])
input_column: String column.
match_column: String column. This column might be from a different table than the input column.
top_k: Positive integer that is recommended to not exceed 10 and must not exceed 100. The number of matches
MATCH_STRINGS
should find. Defaults to 1.separator: All found strings will be separated by this string. Defaults to ', '.
Multiple occurrences
Multiple occurrences of the same string value in the match column are ignored and the result of MATCH_STRINGS
will be the same as when called with a match column containing this string value only once.
NULL Handling
MATCH_STRINGS
ignores NULL values. If the column contains only NULL values, the result will also be a column containing only NULL values. If there are NULL values and non NULL values, it is guaranteed that no non NULL value is matched with a NULL value.
Execution Time Limit
MATCH_STRINGS
has currently a set time limit to around 10 minutes until it will interrupt its execution. High execution times might be due to a high TOP_K
value and lowering it will improve the operator runtime.
Examples
[1] For every entry in the | |||||||||||||||
| |||||||||||||||
|
[2] Example of | |||||||||||||||
| |||||||||||||||
|
[3] There has to be at least one common token between the string in the input column and the string in the match column. Otherwise | |||||||||||||||
| |||||||||||||||
|
[4] Multiple occurrences of a string value in the match column do not effect the result. | |||||||||||||||
| |||||||||||||||
|