Transformations > Expression transformation > Window functions
  

Window functions

In advanced mode, you can use a window function to concisely express stateful computations. A window function takes a small subset of a larger data set for processing and analysis.
Window functions operate on a group of rows and calculate a return value for every input row.
Use window functions to perform the following tasks:
Before you define a window function, configure the following window properties on the Window tab:
Frame
Defines the rows that are included in the frame for the current input row, based on physical offsets from the position of the current input row.
You configure a frame if you use an aggregate function as a window function. The window functions LEAD and LAG reference individual rows and ignore the frame.
Partition Keys
Separates the input rows into different partitions.
If you do not define partition keys, all rows belong to a single partition.
Order Keys
Defines how rows in a partition are ordered.
The fields you choose determine the position of a row within a partition. The order key can be ascending or descending. If you do not define order keys, the rows have no particular order.
You cannot parameterize an expression that contains a window function. If the expression is parameterized, you cannot specify a window function in the mapping task.

Frame

The frame determines which rows are included in the calculation for the current input row based on the rows' relative position to the current row. Configure a frame if you use an aggregate function as a window function.
The start offset and end offset describe the number of rows that appear before and after the current input row. An offset of "0" represents the current input row. For example, a start offset of -3 and an end offset of 0 describe a frame including the current input row and the three rows before the current row.
The following image shows a frame with a start offset of -1 and an end offset of 1:
The Sports row is the current input row. The Arcade, Sports, and Adventure rows are part of the window and have corresponding revenues of 1000, 2000, and 3000.
For every input row, the function performs an aggregate operation on the rows inside the frame. If you configure an aggregate expression like SUM on the frame in the previous image, the expression calculates the sum of the values within the frame and returns a value of 6000 for the input row.
You can also specify a frame that does not include the current input row. For example, a start offset of 10 and an end offset of 15 describe a frame that includes six total rows, from the tenth to the fifteenth row after the current row.
Offsets of All Preceding Rows and All Following Rows represent the first row of the partition and the last row of the partition. For example, if the start offset is All Preceding Rows and the end offset is -1, the frame includes one row before the current row and all rows before that.
The following image shows a frame with a start offset of 0 and an end offset of All Following Rows:
The current input row is Country. All Following Rows include the following rows: Country, Ethnic, Pop, Rock, Classical, EDM, Hip Hop, and Punk.
If the frame offsets are outside the partition, the aggregate function ignores the frame. If the offsets of a frame are not within the partition or table, the aggregate function processes only the rows within the partition. The aggregate function lists the skipped rows in the log file. The function returns one of the following responses for the skipped rows: a default value of ERROR('transformation error'), NULL, or a predefined constant.
For example, you partition a table by seller ID and you order by quantity. You set the start offset to -3 and the end offset to 4.
The following image shows the partition and frame for the current input row:
The row with seller ID of 3 and quanity of 15 is the current input row. The rows with seller IDS of 3 and quantity of 10, 15, 20, and 30 are part of the window. Rows with seller IDs of 2,3, and 4 are part of the frame.
The frame includes eight total rows, but the calculation remains within the partition. If you define an AVG function with this frame, the function calculates the average of the quantities inside the partition and returns 18.75.
Consider the following rules and guidelines when you define a frame:

Partition and order keys

Configure partition and order keys to form groups of rows and define the order or sequence of rows within each partition.
Use the following keys to group and order the rows in a window:
Partition keys
Configure partition keys to define partition boundaries rather than performing the calculation across all rows.
If you do not specify partition keys, all the data is included in the same partition.
Order keys
Use order keys to determine how rows in a partition are ordered. Order keys define the position of a particular row in a partition.
You must also arrange the data in ascending or descending order. If you do not specify order keys, the rows in a partition are arranged randomly.
Consider the following rules and guidelines when you define window properties for partition and order keys:

Example

You are the owner of a coffee and tea shop. You want to calculate the best-selling and second best-selling coffee and tea products.
The following table lists the products, the corresponding product categories, and the revenue from each product:
Product
Category
Revenue
Espresso
Coffee
600
Black
Tea
550
Cappuccino
Coffee
500
Americano
Coffee
600
Oolong
Tea
250
Macchiato
Coffee
300
Green
Tea
450
White
Tea
650
You partition the data by category and order the data by descending revenue.
The following table shows the data grouped into two partitions according to category. Within each partition, the revenue is organized in descending order:
Product
Category
Revenue
Espresso
Coffee
600
Americano
Coffee
600
Cappuccino
Coffee
500
Macchiato
Coffee
300
White
Tea
650
Black
Tea
550
Green
Tea
450
Oolong
Tea
250
You can run the MAX function within each partition to determine that the two best-selling coffees are espresso and Americano, and the two best-selling teas are white and black.

Example: Use a window to calculate expiration dates

You are a banker with information about the financial plans of two of your customers. Each plan has an associated start date.
For each customer, you want to know the expiration date for the current plan based on the activation date of the next plan. The previous plan ends when a new plan starts, so the end date for the previous plan is the start date of the next plan minus one day.
The following table lists the customer codes, the associated plan codes, and the start date of each plan:
CustomerCode
PlanCode
StartDate
C1
00001
2014-10-01
C2
00002
2014-10-01
C2
00002
2014-11-01
C1
00004
2014-10-25
C1
00001
2014-09-01
C1
00003
2014-10-10
To calculate expiration dates, complete the following tasks:
  1. 1Define partition and order keys.
  2. You configure the following window properties to partition the data by customer code and order the data by ascending start date:
    Property
    Value
    Description
    Frame
    Not specified.
    The LEAD function will access rows based on the offset argument and ignore the frame.
    Partition key
    CustomerCode
    Groups the rows according to customer code so that calculations are based on individual customers.
    Order key
    StartDate Ascending
    Arranges the data chronologically by ascending start date.
    The following table lists the data grouped by customer code and ordered by start date:
    CustomerCode
    PlanCode
    StartDate
    C1
    00001
    2014-09-01
    C1
    00002
    2014-10-01
    C1
    00003
    2014-10-10
    C1
    00004
    2014-10-25
    C2
    00001
    2014-10-01
    C2
    00002
    2014-11-01
  3. 2Define a window function.
  4. You define a LEAD function to access the subsequent row for every input.
    You configure an expression field that performs the following calculation:
    LEAD ( StartDate, 1, '01-Jan-2100' )
    For more information about the LEAD function, see Function Reference.
  5. 3Define an ADD_TO_DATE function.
  6. You use an ADD_TO_DATE function to subtract one day from the date you accessed.
    You configure an expression field that performs the following calculation:
    ADD_TO_DATE ( LEAD ( StartDate, 1, '01-Jan-2100' ), 'DD', -1, )
    By subtracting one day from the start date of the next plan, you find the end date of the current plan.
    The following table lists the end dates of each plan:
    CustomerCode
    PlanCode
    StartDate
    EndDate
    C1
    00001
    2014-09-01
    2014-09-30
    C1
    00002
    2014-10-01
    2014-10-09
    C1
    00003
    2014-10-10
    2014-10-24
    C1
    00004
    2014-10-25
    2099-12-31*
    C2
    00001
    2014-10-01
    2014-10-31
    C2
    00002
    2014-11-01
    2099-12-31*
    *The LEAD function returned the default value because these plans have not yet ended. The rows were outside the partition, so the ADD_TO_DATE function subtracted one day from 01-Jan-2100, returning 2099-12-31.

Example: Use a window to flag GPS pings

Your organization receives GPS pings from vehicles that include trip and event IDs and a time stamp. You want to calculate the time difference between each ping and flag the row as skipped if the time difference with the previous row is less than 60 seconds.
You order the events chronologically and partition the events by trip. You define a window function that accesses the event time from the previous row, and you use an ADD_TO_DATE function to calculate the time difference between the two events.

Window properties

Property
Value
Description
Frame
Not specified
Window functions access rows based on the offset argument and ignore the frame.
Partition key
trip_id
Groups the rows according to trip ID so that calculations are based on events from the same trip.
Order key
_event_id Ascending
Arranges the data chronologically by ascending event ID.

Window function

You define the following LAG function to get the event time from the previous row:
LAG ( _event_time, 1, NULL )
For more information about the LAG function, see Function Reference.
You define the following DATE_DIFF function to calculate the length of time between the two dates:
DATE_DIFF ( _event_time, LAG ( _event_time, 1, NULL ), 'ss' )
You flag the row as skipped if the DATE_DIFF is less than 60 seconds, or if the _event_time is NULL:
IIF ( DATE_DIFF < 60 or ISNULL ( _event_time ), 'Skip', 'Valid' )

Output

The transformation produces the following outputs:
Trip ID
Event ID
Event Time
Time Difference
Flag
101
1
2017-05-03 12:00:00
NULL*
Skip
101
2
2017-05-03 12:00:34
34
Skip
101
3
2017-05-03 12:02:00
86
Valid
101
4
2017-05-03 12:02:23
23
Skip
102
1
2017-05-03 12:00:00
NULL*
Skip
102
2
2017-05-03 12:01:56
116
Valid
102
3
2017-05-03 12:02:00
4
Skip
102
4
2017-05-03 13:00:00
3480
Valid
103
1
2017-05-03 12:00:00
NULL*
Skip
103
2
2017-05-03 12:00:12
12
Skip
103
3
2017-05-03 12:01:12
60
Valid
*The rows preceding these rows are outside the bounds of the partition, so the LAG function produces NULL values.

Example: Run an aggregate function on a window

You want to compare the salary of each employee with the average salary for the corresponding department.
The following table lists the department names, the employee identification number, and the employee salary:
Department
Employee
Salary
Development
11
5200
Development
7
4200
Development
9
4500
Development
8
6000
Development
10
5200
Personnel
5
3500
Personnel
2
3900
Sales
3
4800
Sales
1
5000
Sales
4
4800
You set an unbounded frame to include all employees in the calculation, and you define an aggregate function to calculate the difference between the salary of each employee and the average salary in the department.

Window Properties

Property
Value
Description
Start offset
All Preceding Rows
Describes the number of rows that appear before the current input row.
End offset
All Following Rows
Describes the number of rows that appear after the current input row.
Order key
Salary Ascending
Arranges the data by increasing salary.
Partition key
Department
Groups the rows according to department.
When you select All Preceding Rows and All Following Rows, the function includes all partition rows. For example, suppose the current row is the third row. The third row is in the "Development" partition, so the frame includes the third row in addition to all rows before and after the third row in the "Development" partition.

Window Function

Because you configure window properties in the Expression transformation, you can use an aggregate function as a window function.
You define the following aggregate function to calculate the difference between the salary of each employee and the average salary in the corresponding department:
Salary - AVG ( Salary ) = Salary_Diff

Output

The transformation produces the following salary differences:
Department
Employee
Salary
Salary_Diff
Development
11
5200
-820
Development
7
4200
-520
Development
9
4500
180
Development
8
6000
180
Development
10
5200
980
Personnel
5
3500
200
Personnel
2
3900
200
Sales
3
4800
-66
Sales
1
5000
-66
Sales
4
4800
134
You can identify which employees are making less or more than the average salary within the same department. Based on this information, you can add other transformations to learn more about the data. For example, you might add a Rank transformation to produce a numerical rank for each employee within the same department.