How to Use Partition Pruning in Hive Queries
Table partitioning is an optimization practice for processing big data by splitting it into smaller parts. In such cases, it is possible to avoid a full table scan of the large tables such as claims. Partition pruning is a performance optimization that enables a database engine (Hive in the case of the Centralized Data Repository (CDR)) to scan only necessary partitions. The Hive engine requires definite partition values in the execution plan to narrow down partitions to be scanned.
It is strongly recommended to run an EXPLAIN statement to display the execution plan using the %hive_explain(sql) macro to evaluate if the partitioning pruning is effective BEFORE running the actual query. Additional details about the macro can be found in Section 14 of the SAS Viya Global Macros list.
For the five examples below, the row count differences may change over time and were representative at the time this document was created.
- Example 1 shows a query that fails to use partition pruning.
- Example 2 shows a query with partial partition pruning.
- Example 3 shows a query with effective partition pruning by including hard-coded values.
- Example 4 shows a query with effective partition pruning through use of a common table expression (CTE).
- Example 5 shows a query that fails to use partition pruning because the partition values were included from a table.
Notice the row count differences between the five queries that is identified by running %hive_explain macro prior to running the actual query.