When you configure the Apache Atlas catalog source, you define the settings for the metadata extraction capability.
The metadata extraction capability extracts source metadata from external source systems. You can also configure other capabilities that the catalog source includes.
You can save the catalog source configuration at any point after you enter the connection information. After you save the catalog source, you can choose to run the catalog source job. To run the job once, click Run. To run metadata extraction and other capabilities on a recurring schedule, configure schedules on the Schedule tab.
Configure metadata extraction
When you configure the Apache Atlas catalog source, you choose a runtime environment, define filters, and enter configuration parameters for metadata extraction.
1In the Connection and Runtime area, choose a serverless runtime environment or the Secure Agent group where you want to run catalog source jobs.
Note:
Serverless runtime environment options are available if the catalog source works with a serverless runtime environment.
2Choose to retain, delete, or deprecate objects that are deleted from the source system in the catalog with the Metadata Change Option.
- Retain. Retains objects that are deleted from the source system in the catalog. If you update or add a filter, the catalog retains objects extracted from the previous job and extracts additional objects that match the current filter. Objects deleted from the source system are not deleted from the catalog. Enrichments added on deleted objects and relationships are retained.
- Delete. Deletes metadata from the catalog based on objects deleted from the source system and changes you make to the filter. Enrichments added on deleted objects and relationships are also permanently lost. Objects renamed in the source system are removed and recreated in the catalog.
- Deprecate. The lifecycle of objects imported into the catalog moves to Obsolete based on objects deleted from the source system and changes you make to the filter. This does not impact enrichments added on deprecated objects and relationships. Objects renamed in the source system are removed and recreated in the catalog. When you run the catalog source job again for other capabilities such as data classification, relationship discovery, or glossary association, the job doesn't consider obsolete objects. Obsolete objects remain in the catalog until they are purged when you run a Purge Obsolete Objects job on the Explore page.
Note:
You can also change the configured metadata change option when you run a catalog source.
3In the Filters area, define one or more filter conditions to apply for metadata extraction:
aFrom the Include or Exclude metadata list, choose to include or exclude metadata based on the filter parameters.
bFrom the Object type list, select Hive Database, HDFS Path, or HBase Namespace.
cEnter the filter values.
Filters can contain the following wildcards:
▪ Question mark. Represents a single character.
▪ Asterisk. Represents multiple characters or empty text.
The following image shows the filter condition options:
dTo define an additional filter with an OR condition, click the Add icon.
The following image shows that the filter includes metadata related to Hive tables in the HR database with names that start with EMP followed by a single character, includes metadata related to the table named HbaseTable located in the HbaseNS namespace, and excludes metadata related to all files in the hdfsfolder1 folder and its subfolders:
Exclude filter conditions are considered if the assets in the include filter conditions are not related or linked through lineage to the excluded assets. For example, add a filter condition to include metadata related to all tables with the name EMP across all databases (*.EMP) and then add another filter condition to exclude metadata related to the EMP table located in the HR database (HR.EMP). Here, the exclude filter condition is considered as the assets are not related or linked through lineage.
Exclude filter conditions are not considered if the assets in the include filter conditions are related or linked through lineage to the excluded assets. For example, add a filter condition to include metadata related to EMP table in the HR database (HR.EMP) and then add another filter condition to exclude metadata related to SAL table in the same HR database (HR.SAL). Here, the exclude filter condition is not considered due to the presence of lineage links between the EMP and SAL tables.
If you add a filter condition to include metadata from a table deleted from the Apache Atlas source system, Metadata Command Center ignores the filter condition.
If the value of the HDFS Path filter contains special characters, replace the special characters with an asterisk wildcard character. For example, replace /Test$~^!()*<>_Folder with /Test*Folder.
4In the Configuration Parameters area, enter configuration properties.
Note:
Click
Show Advanced
to view all configuration parameters.
The following table describes the properties that you can enter:
Property
Description
Lineage Direction
The direction of data flow between assets that you extract from Apache Atlas with the direction parameter of the LineageRESTAPI.
Select one of the following options:
- BOTH. Extracts both input and output data flow between assets.
- INPUT. Extracts only input data flow between assets.
- OUTPUT. Extracts only output data flow between assets.
Lineage Depth
The number of lineage hops to extract from Apache Atlas for filtered assets with the depth parameter of the LineageRESTAPI.
Default is 3.
Page Result Limit
Advanced parameter. The maximum number of search result entries per page from a fetch using the limit parameter of the DiscoveryRESTAPI.
Default is 1000.
Entity Bulk Fetch Count
Advanced parameter. The maximum number of entities to include in a bulk fetch when you use the BulkEntityRESTAPI.
Default is 100.
Connection Timeout
Advanced parameter. The maximum amount of time, in milliseconds, that the Secure Agent waits to set up an HTTP connection to communicate and get a response from the Apache Atlas server.
Default is -1 which means timeout is disabled.
Parallel Lineage Fetch Count
Advanced parameter. The maximum number of LineageRESTAPI calls that can run simultaneously to retrieve lineage data.
Default is 5.
5Optional. In the Configuration Parameters area, enter additional settings.
The following table describes the property that you enter for additional settings:
Note:
The
Additional Settings
section appears when you click
Show Advanced
.
Property
Description
Expert Parameters
Enter additional configuration options to be passed at runtime. Required if you need to troubleshoot the catalog source job.
Caution:
Use expert parameters when it is recommended by Informatica Global Customer Support.
6Configure additional capabilities for the catalog source by clicking on the tabs.
Configure lineage discovery
Enable the lineage discovery capability and use CLAIRE to build complete lineage by recommending endpoint catalog source objects to assign to reference catalog source connections.
1Click the Lineage Discovery tab.
2Select Enable Lineage Discovery.
3In the Filters area, define one or more filter conditions to apply for lineage discovery.
To define filters, you can choose to select catalog source types, asset groups, or enter a catalog source name or search from a list of catalog sources.
aSelect Yes to view filter options.
bFrom the Include/Exclude list, choose to include or exclude catalog sources for lineage discovery based on the filter parameters.
cFrom the filter type list, select catalog source type, catalog source name, or asset group.
dIn the filter value field, select the required catalog source types, or click the Search button and select catalog sources or asset groups.
Filters can contain the asterisk wildcard to represent multiple characters or empty text.
The filter options appear.
Examples:
▪ To include or exclude all Oracle catalog sources, select Catalog Source Type as the filter type and select Oracle in the filter value field.
▪ To include or exclude the 'Oracle_Retail' catalog source, select Catalog Source Name as the filter type and search for the catalog source or enter Oracle_Retail in the filter value field.
▪ To include or exclude all catalog sources with names that start with 'Oracle', select Catalog Source Name as the filter type and search for the catalog source or enter Oracle* in the filter value field.
▪ To include or exclude all catalog sources with names that end with 'Retail', select Catalog Source Name as the filter type and search for the catalog source or enter *Retail in the filter value field.
▪ To include or exclude all catalog sources with names that contain 'Ret', select Catalog Source Name as the filter type and search for the catalog source or enter *Ret* in the filter value field.
▪ To include or exclude all catalog sources that are part of the 'Financial Group' asset group, select Asset Group as the filter type and search Financial Group in the filter value field.
Note:
You can't add more than one include or exclude filter for the same filter type.
eOptionally, to define an additional filter with an AND condition, click the Add icon.
For more information about lineage discovery, see Lineage discovery.