Joining Data from the Same Source
You can join data from the same source if you want to perform a calculation on part of the data and join the transformed data with the original data.
When you join data from the same source, you can maintain the original data and transform parts of that data within one mapping. You can join data from the same source in the following ways:
- •Join two branches of the same pipeline.
- •Join two instances of the same source.
Joining Two Branches of the Same Pipeline
When you join data from the same source, you can create two branches of the pipeline.
When you branch a pipeline, you must add a transformation between the mapping input and the Joiner transformation in at least one branch of the pipeline. You must join sorted data and configure the Joiner transformation for sorted input.
For example, you have a source with the following ports:
- •Employee
- •Department
- •Total Sales
In the target, you want to view the employees who generated sales that were greater than the average sales for their departments. To do this, you create a mapping with the following transformations:
- •Sorter transformation. Sorts the data.
- •Sorted Aggregator transformation. Averages the sales data and group by department. When you perform this aggregation, you lose the data for individual employees. To maintain employee data, you must pass a branch of the pipeline to the Aggregator transformation and pass a branch with the same data to the Joiner transformation to maintain the original data. When you join both branches of the pipeline, you join the aggregated data with the original data.
- •Sorted Joiner transformation. Joins the sorted aggregated data with the original data.
- •Filter transformation. Compares the average sales data against sales data for each employee and filter out employees with less than above average sales.
- 1. Employees_West Source
- 2. Pipeline branch 1
- 3. Pipeline Branch 2
- 4. Sorted Joiner transformation
- 5. Filter out employees with less than above average sales
Joining two branches might decrease performance if the Joiner transformation receives data from one branch much later than the other branch. The Joiner transformation caches all the data from the first branch and writes the cache to disk if the cache fills. The Joiner transformation must then read the data from disk when it receives the data from the second branch.
Joining Two Instances of the Same Source
You can join data from the same source by creating a second instance of the source.
After you create the second source instance, you can join the pipelines from the two source instances. If you want to join unsorted data, you must create two instances of the same source and join the pipelines.
The following figure shows two instances of the same source joined with a Joiner transformation:
When you join two instances of the same source, the Data Integration Service reads the source data for each source instance. Performance can be slower than joining two branches of a pipeline.
Guidelines for Joining Data from the Same Source
Certain guidelines apply when you decide whether to join branches of a pipeline or join two instances of a source.
Use the following guidelines when you decide whether to join branches of a pipeline or join two instances of a source:
- •Join two branches of a pipeline when you have a large source or if you can read the source data only once.
- •Join two branches of a pipeline when you use sorted data. If the source data is unsorted and you use a Sorter transformation to sort the data, branch the pipeline after you sort the data.
- •Join two instances of a source when you need to add a blocking transformation to the pipeline between the source and the Joiner transformation.
- •Join two instances of a source if one pipeline may process slower than the other pipeline.
- •Join two instances of a source if you need to join unsorted data.