BUCKET_UPPER_BOUND - BUCKET_LOWER_BOUND
Description
BUCKET_UPPER_BOUND
and BUCKET_LOWER_BOUND
calculate histogram buckets and identify the buckets in which the input values reside.
BUCKET_UPPER_BOUND
and BUCKET_LOWER_BOUND
can be used with INT and FLOAT input columns. The resulting column is of INT or FLOAT type, respectively.
The operator generates a temporary table which is joined to the input column. Depending on which operator is used (BUCKET_UPPER_BOUND
or BUCKET_LOWER_BOUND
), the result of the operator is a column of upper or lower bucket boundaries (respectively). The result column contains upper/lower bucket boundaries of the owner buckets (buckets into which input elements are sorted). Additionally, all upper/lower boundaries of all histogram buckets are appended, regardless of whether they are empty or contain input column elements.
Syntax
BUCKET_UPPER_BOUND ( table.column [, SUGGESTED_WIDTH(suggested_width)] [, SUGGESTED_TICKS[suggested_ticks] ] [, MAX_COUNT(max_count) ] )
BUCKET_UPPER_BOUND ( table.column [, SUGGESTED_COUNT(suggested_count)] [, SUGGESTED_TICKS[suggested_ticks] ] [, MAX_COUNT(max_count) ] )
BUCKET_LOWER_BOUND ( table.column [, SUGGESTED_WIDTH(suggested_width)] [, SUGGESTED_TICKS[suggested_ticks] ] [, MAX_COUNT(max_count) ] )
BUCKET_LOWER_BOUND ( table.column [, SUGGESTED_COUNT(suggested_count)] [, SUGGESTED_TICKS[suggested_ticks] ] [, MAX_COUNT(max_count) ] )
If the query already contains an operator call with an input column and an (optional) histogram configuration, the other operator calls in the same query can be empty:
BUCKET_UPPER_BOUND ( )
BUCKET_LOWER_BOUND ( )
table.column: The input column that the histogram buckets calculation is based on.
suggested_width: Preferred bucket width. This value is taken into consideration when calculating the bucket width. The value of suggested_width defined in the query or a value close to that value is used. The suggested_width must be greater than 0 and cannot be NULL. The type of suggested_width depends on the data type of the input column:
In case a max_count is configured and suggested_width contradicts its value, suggested_width will be ignored.
suggested_count: Preferred bucket count. This value is taken into consideration when calculating the number of buckets in the histogram. The value of suggested_count defined in the query or a bucket count close to the preferred count is chosen. The suggested_count must be of type INT and greater than 0. The value of suggested_count cannot be NULL. In case a max_count is configured and suggested_count contradicts its value, suggested_count will be ignored.
suggested_ticks: A list of desired bucket boundaries. If any of the suggested ticks are outside of the input column range, they will be ignored. In case max_count is not configured, tick values are not just suggestions like suggested_width and suggested_count, it is mandatory that all ticks (which are inside the input column range) appear as bucket boundaries in the end result. In case max_count is configured and the biggest possible step defined by suggested_ticks results in a number of buckets greater than max_count, all ticks will be ignored apart from the first one in the list (unsorted). NULL values are not allowed in the list of suggested ticks. The possible data type of the suggested_ticks depends on the data type of the input column, the same as suggested_width:
INT input column: suggested can be of type INT (example: 4, 5, 6). suggested_ticks can also be of type INT represented as FLOAT (example: 4.0, 5.0, 6.0). The list suggested_ticks can contain a mix of these values.
FLOAT input column: suggested_ticks can be of type FLOAT or INT. The list suggested_ticks can contain a mix of these values.
max_count: Maximum bucket count. This value can be used to limit the number of buckets in the end result. In case a suggested_count is configured and it contradicts max_count, the suggested count will be ignored. In case a suggested_width is configured and it contradicts max_count, the suggested width will be ignored. In case suggested_ticks are defined and the biggest step that they allow results in a number of buckets greater than max_count, all of the suggested_ticks will be ignored, apart from the first one mentioned in the list (unsorted). The resulting number of buckets will always be smaller or equal to max_count. Null buckets and outlier buckets are not counted towards max_count. It can happen that the final number of buckets is smaller than max_count if this is needed to make the bucket boundaries user-friendly and easily readable. The max_count must be of type INT and greater than 0. The value of max_count cannot be NULL.
Limitations
Nested operator calls of BUCKET_UPPER_BOUND
/BUCKET_LOWER_BOUND
are not allowed.
The maximum number of SUGGESTED_TICKS
is limited to 10.
The maximum absolute value allowed for the elements of the input column, elements in SUGGESTED_TICKS
or SUGGESTED_WIDTH
is 10^15.
The number of buckets in the histogram (either defined in the query using SUGGESTED_COUNT
or calculated by the algorithm) will not exceed 10^4.
NULL and empty input column handling
Rows from input column, which are NULL, are also NULL in the output column. The BUCKET_UPPER_BOUND
/BUCKET_LOWER_BOUND
operators ignore NULL value rows when doing the histogram calculation.
BUCKET_LOWER_BOUND
and BUCKET_UPPER_BOUND
on an empty column result in an empty result column, when no additional configuration is specified.
Configuration propagation
In the case when there are multiple calls to BUCKET_UPPER_BOUND
/BUCKET_LOWER_BOUND
, it is possible to leave all calls empty apart from one, like so: BUCKET_UPPER_BOUND ( )
or BUCKET_LOWER_BOUND ( )
and the bucketing configuration (SUGGESTED_WIDTH
, SUGGESTED_COUNT
, SUGGESTED_TICKS
, MAX_COUNT
) will be propagated from the operator which contains the configuration to all the empty calls.
Multiple calls to BUCKET_UPPER_BOUND
/BUCKET_LOWER_BOUND
with histogram configurations that differ from one another can not exist, since a common table for these operators does not exist.
Examples
[1] BUCKET_LOWER_BOUND and BUCKET_UPPER_BOUND on an INT column with COUNT_TABLE. The result shows the calculated histogram buckets and the number of input column elements belonging to each bucket. | ||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||
|
[2] BUCKET_LOWER_BOUND and BUCKET_UPPER_BOUND on an INT column with null elements. Each of the two operators results in one column. The first five values are the lower/upper bucket boundaries of the buckets that the non-null input column elements belong to. The NULL element is placed into the [null - null] bucket. These values are joined to the corresponding input values. Then, the lower/upper bucket boundaries of all histogram buckets are appended, including the [null-null] bucket. The appended values are not joined to any value from the input column. In this example we can also notice that one of the values (50) is "far away" from where the majority of the elements are. This results in an outlier bucket being created, [20 - (+inf)] which we mark as [20 - null]. Additionally, we can see that this value influences the histogram bucket calculation - now the bucket width is 4, whereas in the previous example it was 2: | |||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||
|
[3] BUCKET_LOWER_BOUND on a FLOAT column with SUGGESTED_WIDTH. The first two values are the lower bucket boundaries of the buckets that input column elements belong to. These values are joined to the corresponding input values. Then, the lower bucket boundaries of all histogram buckets are appended. The appended values are not joined to any value from the input column.The resulting bucket width matches the value used in the query: | |||||||||||||||
| |||||||||||||||
|
[4] BUCKET_LOWER_BOUND on a FLOAT column with SUGGESTED_WIDTH. Here we can see that the SUGGESTED_WIDTH used in the query (4) is just taken as a suggestion, the resulting bucket width is different (5): | ||||||||||||||||||
| ||||||||||||||||||
|
[5] BUCKET_LOWER_BOUND on an INT column with SUGGESTED_COUNT. The first five values are the lower bucket boundaries of the buckets that input column elements belong to. Then, the lower bucket boundaries of all histogram buckets are appended. The resulting bucket count is the same as the one used in the query, not counting the outlier buckets: | |||||||||||||||||||||||||||
| |||||||||||||||||||||||||||
|
[6] Example of input column propagation. BUCKET_UPPER_BOUND will use the same input column as BUCKET_LOWER_BOUND : | |||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||
|
[7] In this example, we filter based on the values in the FilterColumn. The filter only keeps edges where the | ||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||
|
[8] In this example, we filter based on the values in the FilterColumn. The FILTER_TO_NULL is applied to the input column, which means that only the values where FilterColumn equals 1 are kept and used to calculate histogram buckets: | |||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||
|
[9] Example with MAX_COUNT. SUGGESTED_WIDTH contradicts MAX_COUNT and is ignored in favor of the MAX_COUNT value. Null buckets and outlier buckets are not counted towards MAX_COUNT. | ||||||||||||||||||||||
| ||||||||||||||||||||||
|