Group Rank - Amazon Kinesis Data Analytics SQL Reference

Group Rank

This function applies a RANK() function to logical groups of rows and optionally delivers the group in sorted order.

Applications of group_rank include the following:

  • To sort results of a streaming GROUP BY.

  • To determine a relationship within the results of a group.

Group Rank can do the following actions:

  • Apply rank to a specified input column.

  • Supply either sorted or unsorted output.

  • Enable the user to specify a period of inactivity for data flush.

SQL Declarations

The functional attributes and DDL are described in the sections that follow.

Functional Attributes for Group_Rank

This function acts as follows:

  • Gathers rows until either a rowtime change is detected or a specified idle-time limit is exceeded.

  • Accepts any streaming rowset.

  • Uses any column with a basic SQL data type of INTEGER, CHAR, VARCHAR as the column by which to do the ranking.

  • Orders the output rows either in the order received or in ascending or descending order of values in the selected column.

DDL for Group_Rank

group_rank(c cursor, rankByColumnName VARCHAR(128),    rankOutColumnName VARCHAR(128), sortOrder VARCHAR(10), outputOrder VARCHAR(10),    maxIdle INTEGER, outputMax INTEGER)  returns table(c.*, "groupRank" INTEGER)

The parameters to the function are listed in the following table.

Parameter Description
c CURSOR to Streaming Result Set

rankByColumnName

String naming the column to use for ranking the group.

rankOutColumnName

String naming the column to use for returning the rank.

This string must match the name of the groupRank column in the RETURNS clause of the CREATE FUNCTION statement.

sortOrder

Controls ordering of rows for rank assignment.

Valid values are as follows:

  • 'asc' - Ascending based on the rank.

  • 'desc' - Descending based on the rank.

outputOrder

Controls ordering of output. Valid values are as follows:

  • 'asc' - Ascending based on the rank.

  • 'desc' - Descending based on the rank.

maxIdle

Time limit in milliseconds for holding a group for ranking.

When maxIdle expires, the current group is released to the stream. A value of zero indicates no idle timeout.

outputMax

Maximum number of rows the function outputs in a given group.

A value of 0 indicates no limit.

Example

Example Dataset

The following example is based on the sample stock dataset that is part of the Getting Started Exercise in the Amazon Kinesis Data Analytics Developer Guide. To run each example, you need an Amazon Kinesis Data Analytics application that has the sample stock ticker input stream. To learn how to create an analytics application and configure the sample stock ticker input stream, see Getting Started in the Amazon Kinesis Data Analytics Developer Guide.

The sample stock dataset has the following schema:

(ticker_symbol VARCHAR(4), sector VARCHAR(16), change REAL, price REAL)

Example 1: Sort the Results of a GROUP BY Clause

In this example, the aggregate query has a GROUP BY clause on ROWTIME that groups the stream into finite rows. The GROUP_RANK function then sorts the rows returned by the GROUP BY clause.

CREATE OR REPLACE STREAM "ticker_grouped" ( "group_time" TIMESTAMP, "ticker" VARCHAR(65520), "ticker_count" INTEGER); CREATE OR REPLACE STREAM "destination_sql_stream" ( "group_time" TIMESTAMP, "ticker" VARCHAR(65520), "ticker_count" INTEGER, "group_rank" INTEGER); CREATE OR REPLACE PUMP "ticker_pump" AS INSERT INTO "ticker_grouped" SELECT STREAM FLOOR(SOURCE_SQL_STREAM_001.ROWTIME TO SECOND), "TICKER_SYMBOL", COUNT(TICKER_SYMBOL) FROM SOURCE_SQL_STREAM_001 GROUP BY FLOOR(SOURCE_SQL_STREAM_001.ROWTIME TO SECOND), TICKER_SYMBOL; CREATE OR REPLACE PUMP DESTINATION_SQL_STREAM_PUMP AS INSERT INTO "destination_sql_stream" SELECT STREAM "group_time", "ticker", "ticker_count", "groupRank" FROM TABLE( GROUP_RANK( CURSOR(SELECT STREAM * FROM "ticker_grouped"), 'ticker_count', 'groupRank', 'desc', 'asc', 5, 0));

Results

The preceding examples output a stream similar to the following.

Data table showing ROWTIME, group_time, ticker, ticker_count, and group_rank columns with sample values.

Operational Overview

Rows are buffered from the input cursor for each group (that is, rows with the same rowtimes). Ranking of the rows is done either after the arrival of a row with a different rowtime (or when the idle timeout occurs). Rows continue to be read while ranking is performed on the group of rows with the same rowtime.

The outputMax parameter specifies the maximum number of rows to be returned for each group after ranks are assigned.

By default, group_rank supports column pass through, as the example illustrates by using c.* as the standard shortcut directing pass through of all input columns in the order presented. You can, instead, name a subset using the notation "c.columName", allowing you to reorder the columns. However, using specific column names ties the UDX to a specific input set, whereas using the c.* notation allows the UDX to handle any input set.

The rankOutColumnName parameter specifies the output column used to return ranks. This column name must match the column name specified in the RETURNS clause of the CREATE FUNCTION statement.