Page tree



Back to Knowledge Base

How to Use Partition Pruning in Hive Queries

Table partitioning is an optimization practice for processing big data by splitting it into smaller parts. In such cases, it is possible to avoid a full table scan of the large tables such as claims. Partition pruning is a performance optimization that enables a database engine (Hive in the case of the Centralized Data Repository (CDR)) to scan only necessary partitions. The Hive engine requires definite partition values in the execution plan to narrow down partitions to be scanned. 

It is strongly recommended to run an EXPLAIN statement to display the execution plan using the %hive_explain(sql) macro to evaluate if the partitioning pruning is effective BEFORE running the actual query. Additional details about the macro can be found in Section 14 of the SAS Viya Global Macros list. 

For the five examples below, the row count differences may change over time and were representative at the time this document was created.  

  • Example 1 shows a query that fails to use partition pruning. 
  • Example 2 shows a query with partial partition pruning. 
  • Example 3 shows a query with effective partition pruning by including hard-coded values.
  • Example 4 shows a query with effective partition pruning through use of a common table expression (CTE).
  • Example 5 shows a query that fails to use partition pruning because the partition values were included from a table. 

Notice the row count differences between the five queries that is identified by running %hive_explain macro prior to running the actual query. 

Example 1

This query performs a full table scan because there is no partition pruning. This EXECUTION plan returns a row count of nch.hosp_header: 15,985,838 and row count of nch.hosp_line: 205,525,366. 

Sample SAS Code without Partition Pruning
%LET sql=%STR(
CREATE TABLE &myschema.show_the_plan AS 
	SELECT header.clm_thru_dt, header.claim_sk
	FROM nch.hosp_header AS header
	INNER JOIN nch.hosp_line AS line
		ON header.claim_sk = line.claim_sk);

/* Run EXPLAIN plan */
%hive_explain(&sql);

Sample HiveQL Code without Partition Pruning
EXPLAIN
	CREATE TABLE myschema.show_the_plan AS
		SELECT header.clm_thru_dt, header.claim_sk
		FROM nch.hosp_header AS header
		INNER JOIN nch.hosp_line AS line
			ON header.claim_sk = line.claim_sk;

Example 2

In this query, only a partition value for the header table is specified. Pruning occurs with the header table however the line table still performs a full table scan. This query returns a row count of nch.hosp_header: 1,819,756 and row count of nch.hosp_line: 205,525,366. 

Sample SAS Code to Demonstrate Partial Partition Pruning
%LET sql=%STR(
CREATE TABLE &myschema.show_the_plan AS   
	SELECT header.clm_thru_dt, header.claim_sk
	FROM nch.hosp_header AS header
	INNER JOIN nch.hosp_line AS line
		ON header.claim_sk = line.claim_sk
	WHERE header.clm_thru_dt > '2020-01-01');

/* Run EXPLAIN plan */
%hive_explain(&sql);

Sample HiveQL Code to Demonstrate Partial Partition Pruning
EXPLAIN
  	CREATE TABLE myschema.show_the_plan AS
		SELECT header.clm_thru_dt, header.claim_sk
		FROM nch.hosp_header AS header
		INNER JOIN nch.hosp_line AS line
			ON header.claim_sk = line.claim_sk
		WHERE header.clm_thru_dt > '2020-01-01';

Example 3

In this query, partition values are coded into the query, and users can see that partition pruning is working when they review the EXECUTION plan. This query returns a row count of nch.hosp_header: 1,819,756 and row count of nch.hosp_line: 20,762,051. 

Sample SAS Code to Demonstrate Effective Partition Pruning
%LET sql=%STR(
CREATE TABLE &myschema.show_the_plan AS     
	SELECT header.clm_thru_dt, header.claim_sk
	FROM nch.hosp_header AS header
	INNER JOIN nch.hosp_line AS line
		ON header.claim_sk = line.claim_sk
	WHERE header.clm_thru_dt > '2020-01-01'
			AND line.clm_thru_dt > '2020-01-01');

/* Run EXPLAIN plan */
%hive_explain(&sql);

Sample HiveQL Code to Demonstrate Effective Partition Pruning
EXPLAIN
	CREATE TABLE myschema.show_the_plan AS
		SELECT header.clm_thru_dt, header.claim_sk
		FROM nch.hosp_header AS header
		INNER JOIN nch.hosp_line AS line
			ON header.claim_sk = line.claim_sk
		WHERE header.clm_thru_dt > '2020-01-01'
			AND line.clm_thru_dt > '2020-01-01';

Example 4

In this query, partition values are included in the query through use of a common table expression (CTE) that calculates a partition value based on the current date. Users can see that partition pruning is working when they review the EXECUTION plan. This query returns a row count of nch.hosp_header: 1,819,756 and row count of nch.hosp_line: 20,762,051. 

Sample SAS Code to Demonstrate Partition Pruning using CTE
%LET sql=%STR(
CREATE TABLE &myschema.show_the_plan AS
	WITH 
	CURRENT_DT AS 
		(
		SELECT ADD_MONTHS(CURRENT_DATE, -16) AS date_last_year
		)
	SELECT header.clm_thru_dt, header.claim_sk
	FROM nch.hosp_header AS header
	INNER JOIN nch.hosp_line AS line
		ON header.claim_sk = line.claim_sk
	INNER JOIN CURRENT_DT ly
	WHERE header.clm_thru_dt > ly.date_last_year
		AND line.clm_thru_dt > ly.date_last_year);

/* Run EXPLAIN plan */
%hive_explain(&sql);

Sample HiveQL Code to Demonstrate Partition Pruning using CTE
EXPLAIN
	CREATE TABLE myschema.show_the_plan AS
	WITH CURRENT_DT AS 
		(
			SELECT ADD_MONTHS(CURRENT_DATE, -16) AS date_last_year
		)
	SELECT header.clm_thru_dt, header.claim_sk
	FROM nch.hosp_header AS header
	INNER JOIN nch.hosp_line AS line
		ON header.claim_sk = line.claim_sk
	INNER JOIN CURRENT_DT ly
	WHERE header.clm_thru_dt > ly.date_last_year
		AND line.clm_thru_dt > ly.date_last_year;

Example 5

In this query, partition values are included from a table. Hive is not able to interpret the partition value and therefore it performs a full table scan. Partition pruning is not working when we review the EXECUTION plan. This query returns a row count of nch.hosp_header: 15,985,838 and row count of nch.hosp_line: 205,525,366. 

Sample SAS Code to Demonstrate Partition Pruning using CTE
%LET myschema = yourprojectdb;

%hive_drop_table(&myschema,date_last_year);

%LET sql= %STR(
CREATE TABLE &myschema..date_last_year AS
	SELECT add_months(CURRENT_DATE, -16) AS date_last_year));
%hive_exec_sql(&sql);

%LET sql=%STR(
CREATE TABLE &myschema..show_the_plan AS    
	SELECT header.clm_thru_dt, header.claim_sk
	FROM nch.hosp_header AS header
	INNER JOIN nch.hosp_line AS line
		ON header.claim_sk = line.claim_sk
	INNER JOIN &myschema..date_last_year ly
	WHERE header.clm_thru_dt > ly.date_last_year
		AND line.clm_thru_dt > ly.date_last_year);

/* Run EXPLAIN plan */
%hive_explain(&sql);

Sample HiveQL Code to Demonstrate Partition Pruning using CTE
CREATE TABLE damod.pgrivas_date_last_year AS
	SELECT add_months(CURRENT_DATE, -16) AS date_last_year

EXPLAIN
 	CREATE TABLE myschema.show_the_plan AS
		SELECT header.clm_thru_dt, header.claim_sk
		FROM nch.hosp_header AS header
		INNER JOIN nch.hosp_line AS line
			ON header.claim_sk = line.claim_sk
		INNER JOIN damod.pgrivas_date_last_year ly
		WHERE header.clm_thru_dt > ly.date_last_year
			AND line.clm_thru_dt > ly.date_last_year;
  
  • No labels