SQL best practice checklist
When managing your SQL, use the following best practice checklist:
Category | Rule | Comment |
CREATE TABLE/VIEW | Use Views if they are not/seldom accessed in the transformations | The use of views should be limited to DM tables only. For large data model tables with complex definitions (e.g. multiple joins) a table may be created for DM instead, if DM load runtimes are longer than expected. |
Use Tables if many transformations access it | Temporary tables (especially large tables) should not be created as views when used by subsequent transformations. | |
Combine tables if possible | Avoid extra tables that carry the same data, e.g. case table and TMP table | |
Order a table that is often joined by the columns used in this JOIN and segment by one | This will sort the default projection by those columns and will improve performance of subsequent queries that use the table | |
Ensure Table Statistics are refreshed for optimal performance of queries. | Run SELECT ANALYZE_STATISTICS(‘TABLENAME’); after the creation of every temporary/custom table. | |
Field sizes are significant from a performance aspect. Review your tableschemas and reduce the size of the fields if it is not used. | Example: A reduction from VARCHAR(200) to VARCHAR(20) might have a significantperformance impact. | |
SELECT | Avoid DISTINCT | Consider whether a better designed query logic makes the use of DISTINCT unnecessary. |
Reduce SELECT * to needed columns | The fewer columns carried over (in temp tables or DM tables), the more performant the pipeline will be. | |
Avoid (nearly) unused JOINS | e.g. if only used for one column that 'might' be used in an analysis | |
JOINS | Tables should be INNER JOINED in the order of ascending size | The sorted hash table needed for the INNER join will thus be created based on the smaller table. If Table Statistics have been properly refreshed, attention to the join order is not required. |
JOINS fields > 500 characters are significantly slower than JOINS on fields < 500 characters. | Make sure to optimize the field size, example: VARCHAR(60) offers better performance than using VARCHAR(600). | |
Avoid INNER JOIN with the purpose of limiting the data set | using WHERE EXISTS in such cases may help to avoid duplications of records in the resulting data set that a JOIN operation would cause, forcing the user to resort to a use of DISTINCT, negatively affecting performance | |
Don’t modify columns with functions in the join | e.g. avoid the use of functions such as substring(), right() or cast() in JOIN conditions | |
UNION | Move conditions to extraction if possible | If same conditions are applied on a table throughout the pipeline, limiting the table contents during extraction will improve runtimes throughout the pipeline, as well as provide APC savings |
Use UNION ALL if possible | Using UNION ALL instead of UNION avoids DISTINCT operations in the background | |
CONDITIONS | Move WHERE-conditions directly into inner joins | Any WHERE conditions that are applied on tables that are part of INNER JOINS of a query can be incorporated in the JOIN conditions instead. |
BETWEEN is better than AND | DATE≥1970 AND DATE≤1980 is slower than DATE BETWEEN 1970 AND 1980 |