Using SQL window functions instead of join and groupBy
Window functionsSUM
or
MAX
, operate on a group of rows and calculate a single
return value for every group. Window functions are useful for processing tasks, such as
calculating a moving average or accessing the value of rows based on the relative position
of the current row.
The Spark
groupBy
function collects data into groups and performs aggregate functions on
the grouped data. When you use groupBy
, Spark partly aggregates
the data first and then shuffles the reduced dataset. When you use window functions, the
entire dataset is shuffled. Window functions can provide faster runtimes.
In very large datasets, if the cardinality of the column is large, then window functions are recommended. However, if the cardinality of the column is small, data aggregation is small, and the aggregated result can be broadcasted in the join.