Transformations > Expression transformation > Window functions

Window functions

In advanced mode, you can use a window function to concisely express stateful computations. A window function takes a small subset of a larger data set for processing and analysis.

Window functions operate on a group of rows and calculate a return value for every input row.

Use window functions to perform the following tasks:

•Retrieve data from upstream or downstream rows.
•Calculate a cumulative sum based on a group of rows.
•Calculate a cumulative average based on a group of rows.

Before you define a window function, configure the following window properties on the Window tab:

Frame
Partition Keys: Separates the input rows into different partitions.
Order Keys: Defines how rows in a partition are ordered.

You cannot parameterize an expression that contains a window function. If the expression is parameterized, you cannot specify a window function in the mapping task.

Frame

The frame determines which rows are included in the calculation for the current input row based on the rows' relative position to the current row. Configure a frame if you use an aggregate function as a window function.

The start offset and end offset describe the number of rows that appear before and after the current input row. An offset of "0" represents the current input row. For example, a start offset of -3 and an end offset of 0 describe a frame including the current input row and the three rows before the current row.

The following image shows a frame with a start offset of -1 and an end offset of 1:

The Sports row is the current input row. The Arcade, Sports, and Adventure rows are part of the window and have corresponding revenues of 1000, 2000, and 3000.

For every input row, the function performs an aggregate operation on the rows inside the frame. If you configure an aggregate expression like SUM on the frame in the previous image, the expression calculates the sum of the values within the frame and returns a value of 6000 for the input row.

You can also specify a frame that does not include the current input row. For example, a start offset of 10 and an end offset of 15 describe a frame that includes six total rows, from the tenth to the fifteenth row after the current row.

Offsets of All Preceding Rows and All Following Rows represent the first row of the partition and the last row of the partition. For example, if the start offset is All Preceding Rows and the end offset is -1, the frame includes one row before the current row and all rows before that.

The following image shows a frame with a start offset of 0 and an end offset of All Following Rows:

The current input row is Country. All Following Rows include the following rows: Country, Ethnic, Pop, Rock, Classical, EDM, Hip Hop, and Punk.

If the frame offsets are outside the partition, the aggregate function ignores the frame. If the offsets of a frame are not within the partition or table, the aggregate function processes only the rows within the partition. The aggregate function lists the skipped rows in the log file. The function returns one of the following responses for the skipped rows: a default value of ERROR('transformation error'), NULL, or a predefined constant.

For example, you partition a table by seller ID and you order by quantity. You set the start offset to -3 and the end offset to 4.

The following image shows the partition and frame for the current input row:

The row with seller ID of 3 and quanity of 15 is the current input row. The rows with seller IDS of 3 and quantity of 10, 15, 20, and 30 are part of the window. Rows with seller IDs of 2,3, and 4 are part of the frame.

The frame includes eight total rows, but the calculation remains within the partition. If you define an AVG function with this frame, the function calculates the average of the quantities inside the partition and returns 18.75.

Consider the following rules and guidelines when you define a frame:

•LEAD and LAG use the frame that you specify in the function arguments and ignore the frame that you configure on the Window tab.
•The start offset must be less than or equal to the end offset.

Partition and order keys

Configure partition and order keys to form groups of rows and define the order or sequence of rows within each partition.

Use the following keys to group and order the rows in a window:

Partition keys: Configure partition keys to define partition boundaries rather than performing the calculation across all rows.
Order keys: Use order keys to determine how rows in a partition are ordered. Order keys define the position of a particular row in a partition.

Consider the following rules and guidelines when you define window properties for partition and order keys:

•You cannot use hierarchical fields as partition or order keys.
•Define unique fields as partition and order keys.

Example

You are the owner of a coffee and tea shop. You want to calculate the best-selling and second best-selling coffee and tea products.

The following table lists the products, the corresponding product categories, and the revenue from each product:

Product	Category	Revenue
Espresso	Coffee	600
Black	Tea	550
Cappuccino	Coffee	500
Americano	Coffee	600
Oolong	Tea	250
Macchiato	Coffee	300
Green	Tea	450
White	Tea	650

You partition the data by category and order the data by descending revenue.

The following table shows the data grouped into two partitions according to category. Within each partition, the revenue is organized in descending order:

Product	Category	Revenue
Espresso	Coffee	600
Americano	Coffee	600
Cappuccino	Coffee	500
Macchiato	Coffee	300
White	Tea	650
Black	Tea	550
Green	Tea	450
Oolong	Tea	250

You can run the MAX function within each partition to determine that the two best-selling coffees are espresso and Americano, and the two best-selling teas are white and black.

Example: Use a window to calculate expiration dates

You are a banker with information about the financial plans of two of your customers. Each plan has an associated start date.

For each customer, you want to know the expiration date for the current plan based on the activation date of the next plan. The previous plan ends when a new plan starts, so the end date for the previous plan is the start date of the next plan minus one day.

The following table lists the customer codes, the associated plan codes, and the start date of each plan:

CustomerCode	PlanCode	StartDate
C1	00001	2014-10-01
C2	00002	2014-10-01
C2	00002	2014-11-01
C1	00004	2014-10-25
C1	00001	2014-09-01
C1	00003	2014-10-10

To calculate expiration dates, complete the following tasks:

1Define partition and order keys.

You configure the following window properties to partition the data by customer code and order the data by ascending start date:

Property	Value	Description
Frame	Not specified.	The LEAD function will access rows based on the offset argument and ignore the frame.
Partition key	CustomerCode	Groups the rows according to customer code so that calculations are based on individual customers.
Order key	StartDate Ascending	Arranges the data chronologically by ascending start date.

The following table lists the data grouped by customer code and ordered by start date:

CustomerCode	PlanCode	StartDate
C1	00001	2014-09-01
C1	00002	2014-10-01
C1	00003	2014-10-10
C1	00004	2014-10-25
C2	00001	2014-10-01
C2	00002	2014-11-01

2Define a window function.

You define a LEAD function to access the subsequent row for every input.

You configure an expression field that performs the following calculation:

LEAD ( StartDate, 1, '01-Jan-2100' )

For more information about the LEAD function, see Function Reference.

3Define an ADD_TO_DATE function.

You use an ADD_TO_DATE function to subtract one day from the date you accessed.

You configure an expression field that performs the following calculation:

ADD_TO_DATE ( LEAD ( StartDate, 1, '01-Jan-2100' ), 'DD', -1, )

By subtracting one day from the start date of the next plan, you find the end date of the current plan.

The following table lists the end dates of each plan:

CustomerCode	PlanCode	StartDate	EndDate
C1	00001	2014-09-01	2014-09-30
C1	00002	2014-10-01	2014-10-09
C1	00003	2014-10-10	2014-10-24
C1	00004	2014-10-25	2099-12-31*
C2	00001	2014-10-01	2014-10-31
C2	00002	2014-11-01	2099-12-31*

*The LEAD function returned the default value because these plans have not yet ended. The rows were outside the partition, so the ADD_TO_DATE function subtracted one day from 01-Jan-2100, returning 2099-12-31.

Example: Use a window to flag GPS pings

Your organization receives GPS pings from vehicles that include trip and event IDs and a time stamp. You want to calculate the time difference between each ping and flag the row as skipped if the time difference with the previous row is less than 60 seconds.

You order the events chronologically and partition the events by trip. You define a window function that accesses the event time from the previous row, and you use an ADD_TO_DATE function to calculate the time difference between the two events.

Window properties

Property	Value	Description
Frame	Not specified	Window functions access rows based on the offset argument and ignore the frame.
Partition key	trip_id	Groups the rows according to trip ID so that calculations are based on events from the same trip.
Order key	_event_id Ascending	Arranges the data chronologically by ascending event ID.

Window function

You define the following LAG function to get the event time from the previous row:

LAG ( _event_time, 1, NULL )

For more information about the LAG function, see Function Reference.

You define the following DATE_DIFF function to calculate the length of time between the two dates:

DATE_DIFF ( _event_time, LAG ( _event_time, 1, NULL ), 'ss' )

You flag the row as skipped if the DATE_DIFF is less than 60 seconds, or if the _event_time is NULL:

IIF ( DATE_DIFF < 60 or ISNULL ( _event_time ), 'Skip', 'Valid' )

Output

The transformation produces the following outputs:

Trip ID	Event ID	Event Time	Time Difference	Flag
101	1	2017-05-03 12:00:00	NULL*	Skip
101	2	2017-05-03 12:00:34	34	Skip
101	3	2017-05-03 12:02:00	86	Valid
101	4	2017-05-03 12:02:23	23	Skip
102	1	2017-05-03 12:00:00	NULL*	Skip
102	2	2017-05-03 12:01:56	116	Valid
102	3	2017-05-03 12:02:00	4	Skip
102	4	2017-05-03 13:00:00	3480	Valid
103	1	2017-05-03 12:00:00	NULL*	Skip
103	2	2017-05-03 12:00:12	12	Skip
103	3	2017-05-03 12:01:12	60	Valid

*The rows preceding these rows are outside the bounds of the partition, so the LAG function produces NULL values.

Example: Run an aggregate function on a window

You want to compare the salary of each employee with the average salary for the corresponding department.

The following table lists the department names, the employee identification number, and the employee salary:

Department	Employee	Salary
Development	11	5200
Development	7	4200
Development	9	4500
Development	8	6000
Development	10	5200
Personnel	5	3500
Personnel	2	3900
Sales	3	4800
Sales	1	5000
Sales	4	4800

You set an unbounded frame to include all employees in the calculation, and you define an aggregate function to calculate the difference between the salary of each employee and the average salary in the department.

Window Properties

Property	Value	Description
Start offset	All Preceding Rows	Describes the number of rows that appear before the current input row.
End offset	All Following Rows	Describes the number of rows that appear after the current input row.
Order key	Salary Ascending	Arranges the data by increasing salary.
Partition key	Department	Groups the rows according to department.

When you select All Preceding Rows and All Following Rows, the function includes all partition rows. For example, suppose the current row is the third row. The third row is in the "Development" partition, so the frame includes the third row in addition to all rows before and after the third row in the "Development" partition.

Window Function

Because you configure window properties in the Expression transformation, you can use an aggregate function as a window function.

You define the following aggregate function to calculate the difference between the salary of each employee and the average salary in the corresponding department:

Salary - AVG ( Salary ) = Salary_Diff

Output

The transformation produces the following salary differences:

Department	Employee	Salary	Salary_Diff
Development	11	5200	-820
Development	7	4200	-520
Development	9	4500	180
Development	8	6000	180
Development	10	5200	980
Personnel	5	3500	200
Personnel	2	3900	200
Sales	3	4800	-66
Sales	1	5000	-66
Sales	4	4800	134

You can identify which employees are making less or more than the average salary within the same department. Based on this information, you can add other transformations to learn more about the data. For example, you might add a Rank transformation to produce a numerical rank for each employee within the same department.