AWS Glue
Amazon Web Services Glue is an ETL service of the Amazon Web Services ecosystem that uses data moved across different stores. Glue captures the metadata of multiple data stores that are part of the Amazon Web Services ecosystem.
Objects Extracted
The Glue resource extracts metadata from the following assets in a Glue data source:
- •Databases
- •Tables
- •Columns
- • Derived metadata from Glue Catalog. You can also view the lineage between Glue assets and derived assets.
Note: If the derived asset is a struct data type, the Catalog displays the asset as a field along with the existing struct hierarchy. For example, a person.json file includes first name, last name, phone, and address information. The address includes nested fields such as city, state, and pin code. Enterprise Data Catalog extracts the person asset as a field along with the nested fields.
- •Jobs
You can extract Glue job metadata, but Enterprise Data Catalog does not parse that metadata for lineage.
Permissions to Configure the Resource
To access Glue, make sure that you perform one of the following steps before you configure the resource:
- •Assign the AWSGlueServiceRole role to the Amazon Web Services IAM user.
- •Create a custom policy with the following permissions to the Glue service, and then assign the custom policy to an Amazon Web Services IAM user:
- - GetConnection
- - GetConnections
- - GetDatabase
- - GetDatabases
- - GetJob
- - GetJobs
- - GetPartition
- - GetPartitions
- - GetTable
- - GetTables
Connect to a Glue Data Source Enabled for SSL
To connect to a Glue data source enabled for SSL, perform the following steps:
- 1. Download the Glue SSL certificates using a web browser.
Note: Make sure that you import the Glue Trust Services certificate in the Certificates directory.
- 2. Copy the certificates to the <INFA_HOME>/services/shared/security/ directory.
- 3. Go to the <INFA_HOME>/source/java/jre/bin directory and then run the following keytool command to import each copied certificate as a trusted certificate in to the Informatica domain keystore:
keytool -import -file <INFA_HOME>/services/shared/security/<certificate>.cer -alias <alias name> -keystore <INFA_HOME>/services/shared/security/infa_truststore.jks -storepass <Informatica domain keystore password>
Basic Information
The General tab includes the following basic information about the resource:
Information | Description |
---|
Name | The name of the resource. |
Description | The description of the resource. |
Resource type | The type of the resource. |
Execute On | You can choose to execute on the default catalog server or offline. |
Resource Connection Properties
The General tab includes the following properties:
Property | Description |
---|
Role-based Authentication | Option to use the Amazon Elastic Compute Cloud (Amazon EC2) instance profile credentials when Enterprise Data Catalog is installed on an Amazon EC2 instance. |
AWS Access Key | Access Key of the Amazon Web Services account. |
AWS Secret Key | Secret key of the Amazon Web Services account. |
AWS Region | Amazon Web Services region from where you want to scan the Glue Catalog. |
The following table describes the properties that you can configure in the Source Metadata section of the Metadata Load Settings tab:
Property | Description |
---|
Enable Source Metadata | Enables metadata extraction |
Database Filter | Filter that enables you to include or exclude databases in the resource run. You can also specify a regular expression that represents databases you want to include or exclude. |
Table Filter | Filter that enables you to enter a suitable combination of regular expression and wildcard characters to include or exclude specific assets in the resource run that match the regular expression format. You can also enter table names to include them in the resource run. Use a semicolon to separate the wildcard patterns and table names. |
Enable Reference Resources | Option to extract metadata about assets that are not included in this resource, but referred to in the resource. Examples include source and target tables in PowerCenter mappings, and source tables and files from Tableau reports. |
Create Athena Resources | Indicates whether or not to create an Athena data source. |
Retain Unresolved Reference Assets | Option to retain unresolved reference assets in the catalog after you assign connections. Retaining unresolved reference assets help you view the complete lineage. The unresolved assets include deleted files, temporary tables, and other assets that are not present in the primary resource. |
Auto Assign Connections | Indicates whether the connections must be assigned automatically. |
Memory | The memory value required to run a scanner job. Specify one of the following memory values: Note: For more information about the memory values, see the Tuning Enterprise Data Catalog Performance article on How To-Library Articles tab in the Informatica Doc Portal |
Custom Options | JVM parameters that you can set to configure scanner container. Use the following arguments to configure the parameters: - - -Dscannerloglevel=<DEBUG/INFO/ERROR>. Changes the log level of scanner to values, such as DEBUG, INFO, or ERROR. Default value is INFO.
- - -Dscanner.container.core=<No. of core>. Increases the core for the scanner container. The value must be a number.
- - -Dscanner.yarn.app.environment=<key=value>. Key value pair that you need to set in the Yarn environment. Use a comma to separate the multiple key value pairs.
- - -Dscanner.pmem.enabled.container.memory.jvm.memory.ratio=<1.0/2.0>. Increases the scanner container memory when pmem is enabled. The default value is 1.
|
Track Data Source Changes | View metadata source change notifications in Enterprise Data Catalog. |