Enterprise Data Catalog Scanner Configuration Guide > Configuring Cloud Resources > AWS Glue

AWS Glue

Amazon Web Services Glue is an ETL service of the Amazon Web Services ecosystem that uses data moved across different stores. Glue captures the metadata of multiple data stores that are part of the Amazon Web Services ecosystem.

Objects Extracted

The Glue resource extracts metadata from the following assets in a Glue data source:

•Databases
•Tables
•Columns
• Derived metadata from Glue Catalog. You can also view the lineage between Glue assets and derived assets.

Note: If the derived asset is a struct data type, the Catalog displays the asset as a field along with the existing struct hierarchy. For example, a person.json file includes first name, last name, phone, and address information. The address includes nested fields such as city, state, and pin code. Enterprise Data Catalog extracts the person asset as a field along with the nested fields.

•Jobs

You can extract Glue job metadata, but Enterprise Data Catalog does not parse that metadata for lineage.

Permissions to Configure the Resource

To access Glue, make sure that you perform one of the following steps before you configure the resource:

•Assign the AWSGlueServiceRole role to the Amazon Web Services IAM user.
•Create a custom policy with the following permissions to the Glue service, and then assign the custom policy to an Amazon Web Services IAM user:

- GetConnection
- GetConnections
- GetDatabase
- GetDatabases
- GetJob
- GetJobs
- GetPartition
- GetPartitions
- GetTable
- GetTables

Connect to a Glue Data Source Enabled for SSL

To connect to a Glue data source enabled for SSL, perform the following steps:

1. Download the Glue SSL certificates using a web browser.

Note: Make sure that you import the Glue Trust Services certificate in the Certificates directory.

2. Copy the certificates to the <INFA_HOME>/services/shared/security/ directory.
3. Go to the <INFA_HOME>/source/java/jre/bin directory and then run the following keytool command to import each copied certificate as a trusted certificate in to the Informatica domain keystore:

keytool -import -file <INFA_HOME>/services/shared/security/<certificate>.cer -alias <alias name> -keystore <INFA_HOME>/services/shared/security/infa_truststore.jks -storepass <Informatica domain keystore password>

Basic Information

The General tab includes the following basic information about the resource:

Information	Description
Name	The name of the resource.
Description	The description of the resource.
Resource type	The type of the resource.
Execute On	You can choose to execute on the default catalog server or offline.

Resource Connection Properties

The General tab includes the following properties:

Property	Description
Role-based Authentication	Option to use the Amazon Elastic Compute Cloud (Amazon EC2) instance profile credentials when Enterprise Data Catalog is installed on an Amazon EC2 instance.
AWS Access Key	Access Key of the Amazon Web Services account.
AWS Secret Key	Secret key of the Amazon Web Services account.
AWS Region	Amazon Web Services region from where you want to scan the Glue Catalog.

The following table describes the properties that you can configure in the Source Metadata section of the Metadata Load Settings tab:

Property	Description
Enable Source Metadata	Enables metadata extraction
Database Filter	Filter that enables you to include or exclude databases in the resource run. You can also specify a regular expression that represents databases you want to include or exclude.
Table Filter	Filter that enables you to enter a suitable combination of regular expression and wildcard characters to include or exclude specific assets in the resource run that match the regular expression format. You can also enter table names to include them in the resource run. Use a semicolon to separate the wildcard patterns and table names.
Enable Reference Resources	Option to extract metadata about assets that are not included in this resource, but referred to in the resource. Examples include source and target tables in PowerCenter mappings, and source tables and files from Tableau reports.
Create Athena Resources	Indicates whether or not to create an Athena data source.
Retain Unresolved Reference Assets	Option to retain unresolved reference assets in the catalog after you assign connections. Retaining unresolved reference assets help you view the complete lineage. The unresolved assets include deleted files, temporary tables, and other assets that are not present in the primary resource.
Auto Assign Connections	Indicates whether the connections must be assigned automatically.
Memory	The memory value required to run a scanner job. Specify one of the following memory values: - Low - Medium - High Note: For more information about the memory values, see the Tuning Enterprise Data Catalog Performance article on How To-Library Articles tab in the Informatica Doc Portal
Custom Options	JVM parameters that you can set to configure scanner container. Use the following arguments to configure the parameters: - -Dscannerloglevel=<DEBUG/INFO/ERROR>. Changes the log level of scanner to values, such as DEBUG, INFO, or ERROR. Default value is INFO. - -Dscanner.container.core=<No. of core>. Increases the core for the scanner container. The value must be a number. - -Dscanner.yarn.app.environment=<key=value>. Key value pair that you need to set in the Yarn environment. Use a comma to separate the multiple key value pairs. - -Dscanner.pmem.enabled.container.memory.jvm.memory.ratio=<1.0/2.0>. Increases the scanner container memory when pmem is enabled. The default value is 1.
Track Data Source Changes	View metadata source change notifications in Enterprise Data Catalog.