Big Data Management User Guide > Stateful Computing on the Spark Engine > Windowing Examples
  

Windowing Examples

The examples in this section demonstrate how to use LEAD, LAG, and other aggregate functions as window functions in an Expression transformation.

Financial Plans Example

You are a banker with information about the financial plans of two of your customers. Each plan has an associated start date.
For each customer, you want to know the expiration date for the current plan based on the activation date of the next plan. The previous plan ends when a new plan starts, so the end date for the previous plan is the start date of the next plan minus one day.
The following table lists the customer codes, the associated plan codes, and the start date of each plan:
CustomerCode
PlanCode
StartDate
C1
00001
2014-10-01
C2
00002
2014-10-01
C2
00002
2014-11-01
C1
00004
2014-10-25
C1
00001
2014-09-01
C1
00003
2014-10-10
Define partition and order keys
You partition the data by customer code and order the data by ascending start date.
Property
Description
Order key
StartDate Ascending. Arranges the data chronologically by ascending start date.
Partition key
CustomerCode. Groups the rows according to customer code so that calculations are based on individual customers.
Frame
Not specified. Window functions access rows based on the offset argument and ignore the frame specification.
On the Windowing tab, StartDate is selected as the order key in ascending order. CustomerCode is selected as the partition key. The frame specification is not configured.
The following table lists the data grouped by customer code and ordered by start date:
CustomerCode
PlanCode
StartDate
C1
00001
2014-09-01
C1
00002
2014-10-01
C1
00003
2014-10-10
C1
00004
2014-10-25
C2
00001
2014-10-01
C2
00002
2014-11-01
The start dates for each customer are arranged in ascending order so that the dates are chronological.
Define a window function
You define a LEAD function to access the subsequent row for every input.
You define the following function on the Ports tab of the Expression transformation:
LEAD ( StartDate, 1, '01-Jan-2100' )
Where:
Define an ADD_TO_DATE function
You use an ADD_TO_DATE function to subtract one day from the date you accessed.
You define the following expression on the Ports tab of the Expression transformation:
ADD_TO_DATE ( LEAD ( StartDate, 1, '01-Jan-2100' ), 'DD', -1, )
By subtracting one day from the start date of the next plan, you find the end date of the current plan.
The following table lists the end dates of each plan:
CustomerCode
PlanCode
StartDate
EndDate
C1
00001
2014-09-01
2014-09-30
C1
00002
2014-10-01
2014-10-09
C1
00003
2014-10-10
2014-10-24
C1
00004
2014-10-25
2099-12-31*
C2
00001
2014-10-01
2014-10-31
C2
00002
2014-11-01
2099-12-31*
*The LEAD function returned the default value because these plans have not yet ended. The rows were outside the partition, so the ADD_TO_DATE function subtracted one day from 01-Jan-2100, returning 2099-12-31.

GPS Pings Example

Your organization receives GPS pings from vehicles that include trip and event IDs and a time stamp. You want to calculate the time difference between each ping and flag the row as skipped if the time difference with the previous row is less than 60 seconds.
You order the events chronologically and partition the events by trip. You define a window function that accesses the event time from the previous row, and you use an ADD_TO_DATE function to calculate the time difference between the two events.

Windowing Properties

Property
Description
Order key
_event_id Ascending. Arranges the data chronologically by ascending event ID.
Partition key
trip_id. Groups the rows according to trip ID so calculations are based on events from the same trip.
Frame
Not specified. Window functions access rows based on the offset argument and ignore the frame specification.
On the Windowing tab, _event_id is selected as the order key and trip_id is selected as the partition key. The frame specification is not defined.

Window Function

You define the following LAG function to get the event time from the previous row:
LAG ( _event_time, 1, NULL )
Where:
You define the following DATE_DIFF function to calculate the length of time between the two dates:
DATE_DIFF ( _event_time, LAG ( _event_time, 1, NULL ), 'ss' )
You flag the row as skipped if the DATE_DIFF is less than 60 seconds, or if the _event_time is NULL:
IIF ( DATE_DIFF < 60 or ISNULL ( _event_time ), 'Skip', 'Valid' )
The LAG function, the DATE_DIFF function, and the IIF flag are defined on the Ports tab in the Expression transformation.

Output

The transformation produces the following outputs:
Trip ID
Event ID
Event Time
Time Difference
Flag
101
1
2017-05-03 12:00:00
NULL*
Skip
101
2
2017-05-03 12:00:34
34
Skip
101
3
2017-05-03 12:02:00
86
Valid
101
4
2017-05-03 12:02:23
23
Skip
102
1
2017-05-03 12:00:00
NULL*
Skip
102
2
2017-05-03 12:01:56
116
Valid
102
3
2017-05-03 12:02:00
4
Skip
102
4
2017-05-03 13:00:00
3480
Valid
103
1
2017-05-03 12:00:00
NULL*
Skip
103
2
2017-05-03 12:00:12
12
Skip
103
3
2017-05-03 12:01:12
60
Valid
*The rows preceding these rows are outside the bounds of the partition, so the transformation produced NULL values.

Aggregate Function as Window Function Example

You work for a human resources group and you want to compare each of your employees' salaries with the average salary in his or her department:
The following table lists the department names, the employee identification number, and the employee's salary:
Department
Employee
Salary
Development
11
5200
Development
7
4200
Development
9
4500
Development
8
6000
Development
10
5200
Personnel
5
3500
Personnel
2
3900
Sales
3
4800
Sales
1
5000
Sales
4
4800
You set an unbounded frame to include all employees in the calculation, and you define an aggregate function to calculate the difference between each employee's salary and the average salary in the department.

Windowing Properties

Property
Description
Order key
Salary Ascending. Arranges the data by increasing salary.
Partition key
Department. Groups the rows according to department.
Start offset
All Rows Preceding
End offset
All Rows Following
With an unbounded frame, the aggregate function includes all partition rows in the calculation.
For example, suppose the current row is the third row. The third row is in the "Development" partition, so the frame includes the third row in addition to all rows before and after the third row in the "Development" partition.
On the Windowing tab, ascending salary is selected as the order key and department is selected as the partition key. The frame is All Rows Preceding and All Rows Following.

Window Function

An aggregate function acts as a window function when you configure the transformation for windowing.
You define the following aggregate function to calculate the difference between each employee's salary and the average salary in his or her department:
Salary - AVG ( Salary ) = Salary_Diff

Output

The transformation produces the following salary differences:
Department
Employee
Salary
Salary_Diff
Development
11
5200
-820
Development
7
4200
-520
Development
9
4500
180
Development
8
6000
180
Development
10
5200
980
Personnel
5
3500
200
Personnel
2
3900
200
Sales
3
4800
-66
Sales
1
5000
-66
Sales
4
4800
134
You can identify which employees are making less or more than the average salary for his or her department. Based on this information, you can add other transformations to learn more about your data. For example, you might add a Rank transformation to produce a numerical rank for each employee within his or her department.