Select source and target catalog sources and schemas to link and generate lineage.
Generate automated lineage with CLAIRE or define rules to use name-based matching or construct an inclusion rule with expressions. Save and run the configuration to start a lineage generation job.
Step 1. Register general information
Provide general information about the configuration on the Registration tab.
1In Metadata Command Center, go to the Configure page.
2Select the Lineage tab and then select the Link Catalog Sources tab.
3Click the Add icon.
The Registration tab of the Link Catalog Sources page appears.
4On the General Information area, enter a name and an optional description for the configuration.
5Click Next.
The Configuration tab appears.
Step 2. Configure source and target catalog sources
Select source and target catalog sources on the Configuration tab.
1In the Source Catalog Source area of the Configuration tab, select a source catalog source from which you want to link and generate lineage.
The Select Catalog Source dialog box appears.
The following image shows the Select Catalog Source dialog box:
2Choose a source catalog source and click Select.
The overview and related assets of the catalog source appear on the preview pane.
You can filter the list based on the catalog source type and name.
The following image shows a selected source catalog source on the Select Catalog Source dialog box:
3Choose one of the following options from which you want to link and generate lineage:
- Root Directory. Select a root directory of the source catalog source. This field appears if you selected a file system-based catalog source.
- Schema. Select a schema of the source catalog source. This field appears if you selected a relational database-based catalog source.
The following image shows a selected schema of the source catalog source on the Select Schema dialog box:
4Optional. In the Filters area, define one or more filters to apply.
If you selected a relational database-based catalog source, perform the following steps:
aFrom the Include or Exclude metadata list, choose to include or exclude metadata based on the filter parameters.
bFrom the Object type list, select All, Tables, or Views.
cEnter a value to specify the object location.
Filters can contain the following wildcards:
▪ Question mark. Represents a single character.
▪ Asterisk. Represents multiple characters.
For object hierarchies, use a dot as a separator. Enclose filter values in double quotes if you use a space or a dot in a single segment.
The following image shows the filter condition options:
For example:
▪ To include or exclude metadata from all tables with names that begin with 'Table', select Tables as the object type and enter Table* in the value field.
▪ To include or exclude metadata from all columns in the Table1 table, select Tables as the object type and enter Table1.* in the value field.
▪ To include or exclude all objects types from all tables with names that begin with 'Table' followed by a single character, select All as the object type and enter Table? in the value field.
▪ To include or exclude metadata from the 'Table with space' table, select Tables as the object type and enter "Table with space" in the value field.
dTo define an additional filter with an OR condition, click the Add icon.
If you selected a file system-based catalog source, perform the following steps:
aFrom the Include or Exclude metadata list, choose to include or exclude metadata based on the filter parameters.
bEnter a value to specify the object location.
Filters are case-insensitive.
Filters can contain asterisk as a wildcard to represent multiple characters.
Use the following rules when you enter filter values:
▪ Use an asterisk as a path placeholder as shown in the following example: folder1/*/folder3. Here, the filter includes all folders under folder1.
▪ Use two asterisks in the path filter to indicate zero or more levels of folders. The pattern with two asterisks is recursive. The processing time is longer as the data volume increases.
▪ For path hierarchies, use '/' as a separator. You can provide a path in the folder and path filters but not in the file filter.
▪ To include or exclude metadata from a file, specify only the file name.
The following image shows the filter condition options:
Path filters apply to the files and folders in the path that you filter. The path filter is non-recursive. If you provide only the file or folder names, the path filters apply on the first level files or directories.
For example:
▪ To include or exclude metadata from files and folders with names that start with 'Item1' in the first level directory, enter Item1* in the value field.
▪ To include or exclude metadata from the 'File1' file in the 'Folder1' folder, enter Folder1/File1 in the value field.
▪ To include or exclude metadata from files or folders with names that contain the word 'Subfolder' in the 'Folder1' folder, enter Folder1/*Subfolder* in the value field.
▪ To include or exclude metadata from files or folders with the name 'File1' in any subfolder of the 'Folder1' folder, enter Folder1/*/File1 in the value field.
▪ To include or exclude metadata from all files and subfolders in the 'Folder1' folder, enter Folder1/* in the value field.
▪ To include or exclude metadata from files or folders with the name 'File1' located at any level in the 'Folder1' folder, enter Folder1/**/File1 in the value field. This is a recursive search, and therefore the processing time can be longer.
cTo define an additional filter with an OR condition, click the Add icon.
Note: If you add a filter that includes metadata from all objects, or if you don't add a filter, Metadata Command Center generates additional lineage for a few objects. These objects might include parameter containers, result sets, stages, and other objects that belong to the core.DataSet super class within the metadata model.
5In the Target Catalog Source area, select a target catalog source and schema or root directory to which you want to link and generate lineage. Optionally, you can add a filter.
6Click Next.
The Rule Definition tab appears.
Step 3. Perform rule-based or automated linking, save, and run the configuration
Generate automated lineage with CLAIRE or define rules to use name-based matching or construct an inclusion rule with expressions on the Linking Method tab.
1On the Linking Method tab, choose to either generate automated lineage with CLAIRE or define rules to generate catalog source links between assets of the source and target catalog sources.
2To refresh catalog source links whenever the source or target catalog source job is run, click Refresh Lineage.
3Choose one of following linking methods:
- Rule-based Linking. Define rules to use name-based matching or construct an inclusion rule with expressions.
- Automated Linking. Generate lineage automatically with CLAIRE.
Note: This linking method is not applicable for file system-based source or target catalog sources.
4If you choose the Automated Linking option, you can either automatically accept CLAIRE-generated lineage recommendations or manually accept them.
The following table describes the properties that you can enter for automated linking:
Property
Description
Enable auto-acceptance
Select to automatically accept CLAIRE-generated lineage recommendations.
If disabled, you must manually accept the lineage recommendations.
Confidence Score Threshold for Auto-Acceptance
If you enable auto-acceptance, specify a threshold limit based on which the CLAIRE-generated lineage recommendations are automatically accepted.
Specify a percentage from 80 to 100. If the confidence score of the catalog source links generated between a source and target asset is higher than the configured threshold limit, the recommended links are automatically accepted. Default is 95%.
Stakeholders of the source and target catalog sources can reject the auto-accepted and manually accepted catalog source links generated by CLAIRE in Data Governance and Catalog.
5If you choose the Rule-based Linking option, choose the rule type.
- Name Matching. Ignores specified prefixes and suffixes of an asset name and matches the rest of the asset name to generate catalog source links.
- Expression. Constructs an inclusion rule using expressions. Use a combination of attributes, operators, functions, or comments to build an inclusion rule.
6If you choose the Name Matching rule type, select the asset types to specify prefix and suffix strings to ignore.
The following table describes the properties that you can enter for name matching:
Property
Description
Source Data Set - Ignore Prefix
Specify the prefix of source data set names to ignore and match the rest of the source data set names with target data set names.
Source Data Set - Ignore Suffix
Specify the suffix of source data set names to ignore and match the rest of the source data set names with target data set names.
Target Data Set - Ignore Prefix
Specify the prefix of target data set names to ignore and match the rest of the target data set names with source data set names.
Target Data Set - Ignore Suffix
Specify the suffix of target data set names to ignore and match the rest of the target data set names with source data set names.
Source Data Element - Ignore Prefix
Specify the prefix of source data element names to ignore and match the rest of the source data element names with target data element names.
Source Data Element - Ignore Suffix
Specify the suffix of source data element names to ignore and match the rest of the source data element names with target data element names.
Target Data Element - Ignore Prefix
Specify the prefix of target data element names to ignore and match the rest of the target data element names with source data element names.
Target Data Element - Ignore Suffix
Specify the suffix of target data element names to ignore and match the rest of the target data element names with source data element names.
Prefixes and suffixes that you specify can contain alphanumeric characters, underscore (_), and hyphen (-).
For example:
- To match the source data set, "STG_CUSTOMER", with the target data set, "CUSTOMER", specify "STG_" in the Ignore Prefix field for the source data set.
- To match the target data set, "TMP_ACCOUNT_STG", with the source data set, "ACCOUNT", specify "TMP_" in the Ignore Prefix and "_STG" in the Ignore Suffix fields for the target data set.
- To match the source data element, "CUSTOMER_LND", with the target data element, "CUSTOMER", specify "_LND" in the Ignore Suffix field for the source data element.
- To match the target data element, "TMP_CUSTOMER_LND", with the source data element, "CUSTOMER", specify "TMP_" in the Ignore Prefix and "_LND" in the Ignore Suffix fields for the target data element.
Note: If you don't select an asset type, you can't enter a prefix or suffix. In such cases, the lineage generation job searches for and matches exact source and target asset names.
7If you choose the Expression rule type, construct an inclusion rule using expressions.
You can use a combination of attributes, operators, functions, and comments to define an inclusion rule. You can type your expressions directly and view autocompleted suggestions as you enter your expression in the editor. Expressions are created using a Spark SQL-based language. Expression values cannot exceed 5000 characters.
You can use the following components to construct an inclusion rule:
- Attributes: Attributes can be values that you obtain from the catalog. Values are case-sensitive.
You can use the following attributes:
▪ Source data set. For example: srcDataSet.name == 'Customer'
▪ Source data element. For example: srcDataElement.name == 'email'
▪ Source data set relative path.
For example: srcDataSet.relativePath == 'Folder1/Folder2/File.csv'
▪ Target data set. For example: tgtDataSet.name == 'employee'
▪ Target data element. For example: tgtDataElement.name == 'Age'
▪ Target data set relative path.
For example: tgtDataSet.relativePath == 'Folder1/Folder2/File.csv'
- Operators: Use operators to compare values of columns. For example, you can use an equality operator to check if the names of two columns are the same.
For example: tgtDataElement.name == ('MANAGER' || 'ID')
- Functions: Use functions to calculate values and manipulate data. For example, a function can be changing a name to all upper case or lower case using the upper or lowerfunctions.
Other supported functions include, but are not limited to:
▪ replace
▪ regexp_replace
▪ regexp_match
▪ substring
▪ length
For example: replace('EMPLoyee', 'oyee', 'OYEE')
- Comments: Use comments to summarize the constructed inclusion rule.
For example: /* source data element is changed to lowercase */
Example of a valid inclusion rule:
srcDataElement.name == tgtDataElement.name and srcDataSet.name == tgtDataSet.name
/* The source data element name must be the same as the target data element name, and the source data set name must be the same as the target data set name. */
Important: Construct expressions with both data sets and data elements to avoid generating unnecessary catalog source links.
8Click Validate to validate your expression.
If the validation is successful, a success message appears.
9To save and run the configuration, click Save and then Run.
A Lineage Generation job is created to link catalog sources and to generate catalog source links. Check the status of the job on the Monitor page.