# Group by aggregate functions¶

Data Prep aggregate functions let you combine sets of rows and compute a specific function on the referenced column.

These operations are called aggregate because they find matching rows in the dataset and then combine them into one row. A matching row is defined as one that, excluding the reference column, shares the same values in a column-by-column examination. The reference column is excluded from the column-by-column examination because its value are submitted to the aggregate function in order to produce the reference column value in the single-row result.

The following sections describe the aggregate functions available when you use **Group By** shaping operations:

Function | Purpose |
---|---|

array | Condenses duplicate rows into a single row of data and assembles the reference column data into a single, comma-separated string. |

average | Condenses duplicate rows into a single row of data and displays the average of the numbers in the reference column. |

count | Condenses duplicate rows into a single row of data and displays the number of duplicate rows in the reference column. |

count (numbers only) | Condenses duplicate rows into a single row of data and displays the number of duplicate rows but only counts numeric values. |

count distinct | Condenses duplicate rows into a single row of data and displays the number of unique values in the reference column. |

first | Condenses duplicate rows into a single row of data and displays the first value that appeared for the duplicate rows. |

last | Condenses duplicate rows into a single row of data and displays the last value that appeared for the duplicate rows. |

max | Condenses duplicate rows into a single row of data and displays the largest of the numbers in the reference column. |

min | Condenses duplicate rows into a single row of data and displays the smallest of the numbers in the reference column. |

median | Condenses duplicate rows into a single row of data and displays the median of the numbers in the reference column. |

mode | Condenses duplicate rows into a single row of data and displays the mode of the numbers in the reference column. |

stdev | Condenses duplicate rows into a single row of data and displays the standard deviation of the numeric values in the reference column. |

stdevp | Condenses duplicate rows into a single row of data and displays the standard deviation for a population contained within the reference column. |

sum | Condenses duplicate rows into a single row of data and displays the sum of the numbers in the reference column. |

var | Condenses duplicate rows into a single row of data and displays the variance of the numeric values in the reference column. |

varp | Condenses duplicate rows into a single row of data and displays the variance of a population contained within the reference column. |

## array¶

Of the available aggregate functions available, **array** is unique in that it operates on both text and numeric values. Rather than perform a mathematical operation on collapsed rows, all values in the reference column (the column to which **array** is applied) are temporarily stored. As the unique single row is created, **array** assembles the reference column data from the set into a single, comma-separated string within the column.

Excluding the column to which the **array** is being applied, all rows are examined column-by-column in order to find identical rows. The header of the reference column receives a name change to become *Array of *.

**Examples**

The small dataset below will be used to show how **array** operates.

Column A | Column B | Column C |
---|---|---|

1 | two | 5 |

1 | two | 6 |

two | two | 7 |

1 | two | 4 |

**Example 1**

Applying the **array** function to *Column C* reduces the row count from four to two. The value in the column *Array of Column C* shows the sum of of the *Column C* values in the duplicate rows that were collapsed during the operation.

Column A | Column B | Column C |
---|---|---|

1 | two | 5.0, 6.0, 4.0 |

two | two | 7.0 |

**Example 2**

Applying the **array** function to *Column A* results in no row count reduction since examination of *Column B* and *Column C* values in each row reveals that each is already unique. The value in the column *Array of Column A* therefore displays each row with its original value—however, numbers have been converted into text.

Column A | Column B | Column C |
---|---|---|

1.0 | two | 5 |

two | two | 7 |

1.0 | two | 6 |

1.0 | two | 4 |

## average¶

The **average** aggregate function finds an average of the numbers in the reference column (the column to which it is applied) as all duplicate rows are collapsed into unique single rows. Excluding the column to which **average** is being applied, all rows are examined column-by-column in order to find identical rows.

A mathematical average (also called an “arithmetic mean” or simply “mean”) is calculated by adding all of the numeric values in a set and then dividing the resulting sum by the number of items the set contained. Note that attempts to apply **average** to a text value in the reference column will result in a 0 for that row. The header of the reference column receives a name change to become *Average of *.

Related aggregate functions are median and mode.

**Examples**

The small dataset below will be used to show how **average** operates. All numbers are numeric values (not text) in order to illustrate behavior of this aggregate function.

Column A | Column B | Column C |
---|---|---|

1 | two | 5 |

1 | two | 6 |

two | two | 7 |

1 | two | 4 |

**Example 1**

Applying the **average** function to *Column C* reduces the row count from four to two. The value in the column *Average of Column C* shows the average of of the *Column C* values in the duplicate rows that were collapsed during the operation: (5 + 6 + 4) ÷ 3 = 5 while 7 ÷ 1 = 7.

Column A | Column B | Column C |
---|---|---|

1 | two | 5 |

two | two | 7 |

**Example 2**

Applying the average function to *Column A* results in no row count reduction since examination of *Column B* and *Column C* values in each row reveals that each is already unique. The value in the column *Average of Column A* therefore displays each row with its original numeric value (in this case, a number 1) since the **average** function found no duplicate rows that could be involved in a mathematical operation. The 0 replaces the text “two” because the **average** operation cannot be applied to text values.

Column A | Column B | Column C |
---|---|---|

1 | two | 5 |

1 | two | 6 |

1 | two | 4 |

0 | two | 7 |

## count¶

The **count** aggregate function returns the number of duplicate rows in a dataset. Excluding the column to which it is being applied (the reference column), all rows are examined column-by-column in order to find duplicate rows. Those rows that contain duplicate data are collapsed into a single unique row. The reference column receives a name change to become *Count of * and the number that appears in the column indicates the number of duplicate rows that have been collapsed.

**Examples**

The small dataset below will be used to show how **count** operates.

Column A | Column B | Column C |
---|---|---|

one | two | 5 |

one | two | 6 |

two | two | 7 |

one | two | 4 |

**Example 1**

Applying the **count** function to *Column C* reduces the row count from four to two. The value in the column *Count of Column C* shows the count of (number of times) duplicate rows that were collapsed during the operation.

Column A | Column B | Column C |
---|---|---|

one | two | 3 |

two | two | 1 |

**Example 2**

Applying the **count** function to *Column A* results in no row count reduction since examination of *Column B* and *Column C* values in each row reveals that each is already unique. The value in the column *Count of Column A* therefore shows a value of 1 for each of the four rows.

Column A | Column B | Column C |
---|---|---|

1 | two | 5 |

1 | two | 6 |

1 | two | 7 |

1 | two | 4 |

## count (numbers only)¶

The **count (numbers only)** aggregate function operates exactly the same as the count function. However, **count (numbers only)** only counts numeric values and ignores text values during the counting process.

## count distinct¶

The **count distinct** aggregate function returns the number of unique values in the column being counted, as opposed to the count function which counts all values.

## first¶

The **first** aggregate function searches the data for duplicate rows (based on the columns chosen by the user) and condenses them into a single row of data. The **first** function will then display the first value within the data that appeared for the duplicate rows. Any values following the first value will be lost during this process.

**Example**

The small dataset below will be used to show how **first** operates.

Column A | Column B | Column C |
---|---|---|

one | two | 5 |

one | two | 6 |

two | two | 7 |

one | two | 4 |

Applying the **first** function to *Column C* reduces the row count from four to two. The value in the column *First of Column C* shows the first value from *Column C* of the duplicate rows that were collapsed during the operation.

Column A | Column B | Column C |
---|---|---|

one | two | 5 |

two | two | 7 |

## last¶

The **last** aggregate function searches the data for duplicate rows (based on the columns chosen by the user) and condenses them into a single row of data. The **last** function will then display the last value within the data that appeared for the duplicate rows. Any values prior to the last value will be lost during this process.

**Example**

The small dataset below will be used to show how **last** operates.

Column A | Column B | Column C |
---|---|---|

one | two | 5 |

one | two | 6 |

two | two | 7 |

one | two | 4 |

Applying the Last function to *Column C* reduces the row count from four to two. The value in the column *Last of Column C* shows the last value from *Column C* of the duplicate rows that were collapsed during the operation.

Column A | Column B | Column C |
---|---|---|

one | two | 4 |

two | two | 7 |

# max¶

The **max** aggregate function returns the largest of the numbers in the reference column (the column to which it is applied) as all duplicate rows are collapsed into unique single rows. Excluding the column to which **max** is being applied, all rows are examined column-by-column in order to find identical rows.

The counterpart to this function is min (minimum).

**Example**

The small dataset below will be used to show how **max** operates. All numbers are numeric values (not text) in order to illustrate behavior of this aggregate function.

Column A | Column B | Column C |
---|---|---|

1 | two | 5 |

1 | two | 6 |

two | two | 7 |

1 | two | 4 |

Applying the **max** function to *Column C* reduces the row count from four to two. The value in the column *Max of Column C* shows the maximum of of the *Column C* values in the duplicate rows that were collapsed during the operation.

In the dataset that is returned by the **max** function (shown below) number 6 in the first row resulted from the set of numbers {4, 5, 6}. Each of these numbers exists in this set because each was a member of an identical row when *Column A* and *Column B* were examined. (*Column C* was excluded from this examination because it is the reference column.) In this set of three numbers, 6 is the largest—therefore it became the value shown in the reference column.

The number 7 results from a set with a single number {7} because there were no duplicate rows that could contribute other numbers. Since 7 is both the minimum and maximum of the single-row set, the function returns a 7 for that row.

## min¶

The **min** aggregate function returns the smallest of the numbers in the reference column (the column to which it is applied) as all duplicate rows are collapsed into unique single rows. Excluding the column to which the **min** is being applied, all rows are examined column-by-column in order to find identical rows.

The counterpart to this function is max (maximum).

**Example**

The small dataset below will be used to show how **min** operates. All numbers are numeric values (not text) in order to illustrate behavior of this aggregate function.

Column A | Column B | Column C |
---|---|---|

1 | two | 5 |

1 | two | 6 |

two | two | 7 |

1 | two | 4 |

Applying the **min** function to *Column C* reduces the row count from four to two. The value in the column *Min of Column C* shows the minimum of of the *Column C* values in the duplicate rows that were collapsed during the operation.

In the dataset that is returned by the **min** function (shown below) number 4 in the first row resulted from the set of numbers {4, 5, 6}. Each of these numbers exists in this set because each was a member of an identical row when *Column A* and *Column B* were examined. (*Column C* was excluded from this examination because it is the reference column.) In this set of three numbers, 4 is the smallest—therefore it became the value shown in the reference column.

The number 7 results from a set with a single number {7} because there were no duplicate rows that could contribute other numbers. Since 7 is both the minimum and maximum of the single-row set, the function returns a 7 for that row.

Column A | Column B | Column C |
---|---|---|

1 | two | 4 |

two | two | 7 |

## median¶

The **median** aggregate function finds the median of the numbers in the reference column (the column to which median is applied) as all duplicate rows are collapsed into unique single rows. Excluding the column to which **median** is being applied, all rows are examined column-by-column in order to find identical rows.

A median value is one in the middle of a range of numbers ordered from lowest to highest value. This means that half of the numbers are to the “right” of the value returned and half of the numbers are to the “left” of the value returned. Where there is an even set of numbers (i.e., no single number in the middle of the collection), the function calculates an average on the pair of numbers in the middle of the range (i.e., the two numbers on either side of the mid-point).

Note that attempts to apply **median** to a text value in the reference column will result in a error for that row. The header of the reference column receives a name change to become *Median of *.

Related aggregate functions are average and mode.

**Example**

The small dataset below will be used to show how **median** operates. All numbers are numeric values (not text) in order to illustrate behavior of this aggregate function.

Column A | Column B | Column C |
---|---|---|

one | two | 5 |

one | two | 6 |

two | two | 7 |

one | two | 4 |

Applying the **median** function to *Column C* reduces the row count from four to two. The value in the column *Median of Column C* shows the median of the *Column C* values in the duplicate rows that were collapsed during the operation.

In the dataset that is returned by the **median** function (shown below) number 5 in the first row resulted from the ordered set of numbers {4, 5, 6}. Each of these numbers exists in this set because each was a member of an identical row when *Column A* and *Column B* were examined. (*Column C* was excluded from this examination because it is the reference column.) In this set of three numbers, 5 is the middle value with one number in the set on either side of it.

The number 7 results from a set with a single number {7} because there were no duplicate rows that could contribute other numbers. Since 7 is in the middle of the set (there are zero numbers on either side of it) the function returns a 7 for that row.

Column A | Column B | Column C |
---|---|---|

one | two | 5 |

two | two | 7 |

## mode¶

A mode is the value that occurs most frequently in a set of numbers. The **mode** aggregate function finds the most frequently occurring number in the reference column (the column to which **mode** is applied) among those rows that are identical. All duplicate rows are discovered by conducting a column-by-column examination (excluding the column to which **mode** is being applied) before they are collapsed into unique single rows. For each resulting single row, the value in the reference column from the contributing duplicate row becomes part of a set against which the **mode** function operates.

Although **mode** will still combine rows that have text in the reference column, it ignores text in finding the actual mode for the set. The header of the reference column receives a name change to become *Mode of *.

Important

Where there is a “tie” for the mode of a set (i.e., where multiple numbers have an equal number of occurrences and there is no other number that appears more frequently) the result of **mode** is unpredictable. Although one of the “tied” numbers will appear in the column *Mode of * for the appropriate row, it is not possible to determine which of the values will occur.

Related aggregate functions are average and median.

**Example**

The small dataset below will be used to show how **mode** operates. All numbers are numeric values (not text) in order to illustrate behavior of this aggregate function.

Column A | Column B | Column C |
---|---|---|

one | two | 3 |

one | two | 6 |

two | two | 7 |

one | two | 3 |

Applying the **mode** function to *Column C* reduces the row count from four to two. The value in the column *Mode of Column C* shows the mode of the *Column C* values in the duplicate rows that were collapsed during the operation.

In the resulting dataset (shown below) number 3 in the first row resulted from the ordered set of numbers {3, 6, 3} where each member of the set was contributed by row that was duplicate and then collapsed into a single row. In this set of three numbers, 3 occurs most frequently (twice in the set of three numbers) and is therefore the mode for the set.

The number 7 results from a set with a single number {7} because there were no duplicate rows that could contribute other numbers. Since 7 is the only number in the set, it is therefore also the most frequently occurring.

Column A | Column B | Column C |
---|---|---|

one | two | 3 |

two | two | 7 |

## stdev¶

The **stdev** (standard deviation) aggregate function computes the standard deviation—the amount of variation from the average—that exists within a sample set of data. This aggregate function calculates the standard deviation of the numeric values in the reference column (the column to which **stdev** is applied) among those rows that are identical.

All duplicate rows are discovered by conducting a column-by-column examination (excluding the column to which **stdev** is being applied) before they are collapsed into unique single rows. For each resulting single row, the value in the reference column from the contributing duplicate row becomes part of the standard deviation calculation. The header of the reference column receives a name change to become *Stdev of *.

If there are text values in the reference column, they are ignored within the **stdev** calculation. Note also that the aggregate **stdev** function requires at least two values. In other words, there must be at least two identical rows available for aggregation for each unique single row that is returned. Rows that occur only once will contribute only one value in the reference column for the calculation—resulting in an error.

The standard deviation for data is the square root of its variance. If the set under analysis represents all data points (referred to as a “population”), use of stdevp is recommended for a more accurate result. A related function that deals with statistical variance is varp.

**Example**

The dataset below will be used to show how **stdev** operates. All numbers are numeric values (not text) in order to illustrate behavior of this aggregate function.

Column A | Column B | Column C |
---|---|---|

one | two | 0.2 |

one | two | 0.1 |

one | two | 1.1 |

one | two | 0.2 |

one | two | 0.6 |

one | one | 0.2 |

one | one | 0.27 |

one | two | 0.2 |

one | two | 0.4 |

As shown in the table below, applying the **stdev** function to *Column C* reduces the row count from nine to two. The value in the column *Stdev of Column C* shows the standard deviation of the *Column C* sample data values in the duplicate rows that were collapsed during the operation.

Column A | Column B | Column C |
---|---|---|

one | two | 0.3511884584284246 |

one | one | 0.049497474683058325 |

## stdevp¶

The **stdevp** (standard deviation for a population) aggregate function computes the standard deviation—the amount of variation from the average—that exists within an entire set of data (a population). This aggregate function calculates the standard deviation for a population using the numeric values in the reference column (the column to which **stdevp** is applied) among those rows that are identical.

All duplicate rows are discovered by conducting a column-by-column examination (excluding the column to which **stdevp** is being applied) before they are collapsed into unique single rows. For each resulting single row, the value in the reference column from the contributing duplicate row becomes part of the standard deviation calculation for the population. The header of the reference column receives a name change to become *StdevP of *.

If there are text values in the reference column, they are ignored within the **stdevp** calculation. Note also that the aggregate **stdevp** function requires at least two values. In other words, there must be at least two identical rows available for aggregation for each unique single row that is returned. Rows that occur only once will contribute only one value in the reference column for the calculation—resulting in an error.

If the set under analysis represents only a sample of data, use of stdev is recommended for a more accurate result. Other related functions that deal with statistical variance are var and varp.

**Example**

The dataset below will be used to show how **stdevp** operates. All numbers are numeric values (not text) in order to illustrate behavior of this aggregate function.

Column A | Column B | Column C |
---|---|---|

one | two | 0.2 |

one | two | 0.1 |

one | two | 1.1 |

one | two | 0.2 |

one | two | 0.6 |

one | one | 0.2 |

one | one | 0.27 |

one | two | 0.2 |

one | two | 0.4 |

As shown in the table below, applying the **stdevp** function to *Column C* reduces the row count from nine to two. The value in the column *StdevP of Column C* shows the standard deviation of the population contained in *Column C* among the duplicate rows that were collapsed during the operation.

Column A | Column B | Column C |
---|---|---|

one | two | 0.32513733362117264 |

one | one | 0.034999999999999996 |

## sum¶

The **sum** aggregate function executes an addition operation on the numbers in the reference column (the column to which it is applied) as all duplicate rows are collapsed into unique single rows. Excluding the column to which **sum** is being applied, all rows are examined column-by-column in order to find identical rows. Note that attempts to apply **sum** to a text value in the reference column will result in a 0 for that row. The header of the reference column receives a name change to become *Sum of *.

**Examples**

The small dataset below will be used to show how **sum** operates. All numbers are numeric values (not text) in order to illustrate behavior of this aggregate function.

Column A | Column B | Column C |
---|---|---|

1 | two | 5 |

1 | two | 6 |

two | two | 7 |

1 | two | 4 |

**Example 1**

Applying the **sum** function to *Column C* reduces the row count from four to two. The value in the column *Sum of Column C* shows the sum of the *Column C* values in the duplicate rows that were collapsed during the operation: 5 + 6 + 4 = 15 while 7 + 0 = 7.

Column A | Column B | Column C |
---|---|---|

1 | two | 15 |

two | two | 7 |

**Example 2**

Applying the **sum** function to *Column A* results in no row count reduction since examination of *Column B* and *Column C* values in each row reveals that each is already unique. The value in the column *Sum of Column A* therefore displays each row with its original numeric value (in this case, a number 1) since the **sum** function found no duplicate rows that could be involved in an adding operation. The 0 replaces the text “two” because the **sum** operation cannot be applied to text values.

Column A | Column B | Column C |
---|---|---|

1 | two | 5 |

1 | two | 6 |

1 | two | 4 |

0 | two | 7 |

## var¶

The **var** (variance) aggregate function estimates dispersion—how much the values are spread out)—within a sample set of data. This aggregate function calculates the variance of the numeric values in the reference column (the column to which it is applied) among those rows that are identical.

All duplicate rows are discovered by conducting a column-by-column examination (excluding the column to which **var** is being applied) before they are collapsed into un ique single rows. For each resulting single row, the value in the reference column from the contributing duplicate row becomes part of the variance calculation. The header of the reference column receives a name change to become *Var of *.

If there are text values in the reference column, they will be ignored within the **var** calculation. Note also that the aggregate **var** function requires at least two values. In other words, there must be at least two identical rows available for aggregation for each unique single row that is returned. Rows that occur only once will contribute only one value in the reference column for the calculation—resulting in an error.

If the set under analysis represents all data points (referred to as a “population”) use of varp is recommended for a more accurate result. Related functions that deal with statistical variance are stdev and stdevp.

**Example**

The dataset below will be used to show how **var** operates. All numbers are numeric values (not text) in order to illustrate behavior of this aggregate function.

Column A | Column B | Column C |
---|---|---|

one | two | 0.2 |

one | two | 0.1 |

one | two | 1.1 |

one | two | 0.2 |

one | two | 0.6 |

one | one | 0.2 |

one | one | 0.27 |

one | two | 0.2 |

one | two | 0.4 |

As shown in the table below, applying the **var** function to *Column C* reduces the row count from nine to two. The value in the column *Var of Column C* shows the variance of the *Column C* sample data values in the duplicate rows that were collapsed during the operation.

Column A | Column B | Column C |
---|---|---|

one | two | 0.12333333333333334 |

one | one | 0.00245 |

## varp¶

The **varp** (variance for a population) aggregate function computes dispersion—how much the values are spread out—for an entire set of data (a population). This aggregate function calculates the variance for a population contained within the reference column (the column to which it is applied) among those rows that are identical.

All duplicate rows are discovered by conducting a column-by-column examination (excluding the column to which **varp** is being applied) before they are collapsed into unique single rows. For each resulting single row, the value in the reference column from the contributing duplicate row becomes part of the variance calculation for the population. The header of the reference column receives a name change to become *VarP of *.

If there are text values in the reference column, they will be ignored within the **varp** calculation. Note also that the aggregate **varp** function requires at least two values. In other words, there must be at least two identical rows available for aggregation for each unique single row that is returned. Rows that occur only once will contribute only one value in the reference column for the calculation—resulting in an error.

If the set under analysis represents only a sample set of data, use of **var** is recommended for a more accurate result. Related functions that deal with statistical variance are stdev and stdevp.

**Example**

The dataset below will be used to show how **varp** operates. All numbers are numeric values (not text) in order to illustrate behavior of this aggregate function.

Column A | Column B | Column C |
---|---|---|

one | two | 0.2 |

one | two | 0.1 |

one | two | 1.1 |

one | two | 0.2 |

one | two | 0.6 |

one | one | 0.2 |

one | one | 0.27 |

one | two | 0.2 |

one | two | 0.4 |

As shown in the table below, applying the **varp** function to *Column C* reduces the row count from nine to two. The value in the column *VarP of Column C* shows the variance of the population contained in *Column C* among the duplicate rows that were collapsed during the operation.

Column A | Column B | Column C |
---|---|---|

one | two | 0.10571428571428572 |

one | one | 0.001225 |