Performance Impacting Hive Functions (i.e., Distinct, Order By, Partition By)
Distinct and Order By
Hive operates differently, as each query is converted into a series of mapping and reducing processes. These usually operate in parallel and extremely fast.
DISTINCT and ORDER BY are sorting operations and are performed by only a single reducer so the last step of query is not parallelized.
If users observe their query initially using up many containers and then dropping down to two and staying there for a long time (>30 minutes), it might be sorting data. Consider eliminating DISTINCT and ORDER BY.
The number of containers for a queue is variable based on available system resources. The number of containers used depends on the process:
- 1 - for writing to disk, housekeeping, small query, etc.
- >2 - parallel processing their query
- 2 (after an initial larger number) - ORDERing and DISTINCTing
“Queue %” should be used to determine how busy the queue is
Users can use Hive windowing functions as a DISTINCT alternative. This can reduce the time a "distinct" takes from hours to minutes.
Partition By
The PARTITION BY clause can be used to separate data into distinct groups like a GROUP BY.
Step 1. Create a row column for each group as follows:
ROW_NUMBER() OVER (PARTITION BY key_column_1, key_column_2) rownum
→ With this, each partition group (a unique key for a group of rows ) now has a row number assigned so the first occurrence of that key is 1, the second is 2, etc.
Step 2. Then filter for only the first occurrence of rows with that key:
rownum=1
→ Thereby replicating a DISTINCT on the data but with parallelizing and no requirement to channel the data through a single reducer.
Related Article: How to Terminate a SAS and Hive Job