Amazon S3
Amazon S3 is a simple storage service offered by Amazon Web Services (AWS) that provides object storage through a web service interface.
Objects Extracted
Enterprise Data Catalog extracts files from Amazon S3 data source. You can also extract metadata from an Amazon S3 compatible storage such as Scality RING.
Note: In Enterprise Data Catalog version 10.4.1.2 or later, you can use the Amazon S3 resource to connect to an Amazon S3 data source by using a temporary session token.
Permissions to Configure the Resource
Configure the read permission on the Amazon S3 data source for the user account that you use to access the data source.
Configure the access permission for the user account if you use a user account that is different from the user account used to create the Amazon S3 data source.
Configure the GetObject and ListBucket permissions to perform metadata and profile scans.
In the Amazon EC2 domain, if the user account is configured for an IAM role, specify only the AWS bucket name and AWS URL. Do not include the AWS Secret Access Key and perform test connection. For Endpoint detection and response (EDR), do not specify the AWS access key and secret key.
Prerequisites
Before you use the Amazon S3 resource, perform the following steps to run the certificates-importer.jar utility to import the required root certificates:
- 1. Disable Catalog Service.
- 2. Navigate to the directory where the utility is located: <Informatica installation directory>/services/CatalogService/ScannerBinaries/certificates-importer
- 3. From the certificates-importer directory, download the following JAR file: certificates-importer.jar
- 4. In the command prompt, run the following command: java -jar certificates-importer.jar
- 5. Provide the following arguments:
- a. Path to the directory that contains certificates.
- b. Path to the Informatica truststore.
- c. Your keystore password.
- d. Optionally: -f
Use this argument to if you want the utility to import all certificates without your interaction.
- 6. Enable Catalog Service.
Note: If the proxy server used to connect to the data source is SSL enabled, you must download the proxy server certificates on the Informatica domain machine.
Supported File Types
The Amazon S3 resource enables you to extract metadata from structured, unstructured, and extended unstructured files.
The resource supports the following structured file types:
- •AVRO files
- •Delimited files
- •Text files
- •JSON files
- •Parquet files
- •XML files
The resource supports the following unstructured file types:
- •Apple files
- •Compressed files
- •Email
- •Microsoft Excel
- •Microsoft PowerPoint
- •Microsoft Word
- •Apache OpenOffice files
- •PDF files
- •Web pages.
Assign read permission to the files to extract metadata.
Basic Information
The General tab includes the following basic information about the resource:
Information | Description |
---|
Name | The name of the resource. |
Description | The description of the resource. |
Resource type | The type of the resource. |
Execute On | You can choose to execute on the default catalog service or offline. |
Resource Connection Properties
The General tab includes the following properties:
Property | Description |
---|
S3 Supported Type | Indicates if the data source is an Amazon S3 compatible storage such as Scality RING. |
REST Endpoint URL | The REST end point URL of the Amazon S3 compatible storage to extract metadata. |
Amazon Web Services Bucket URL | Amazon Web Services URL to access a bucket. |
Are Credentials Temporary | Enables the resource to connect and extract metadata from an Amazon S3 data source by using temporary credentials. |
AWS Access Key ID | Amazon Web Services access key ID to sign requests that you send to Amazon Web Services. Note: Specify the Amazon S3 compatible storage access key ID to sign requests that you send to the Amazon S3 compatible storage. |
AWS Secret Access Key | Amazon Web Services secret access key to sign requests that you send to Amazon Web Services. Note: Specify the Amazon S3 compatible storage secret access key to sign requests that you send to the Amazon S3 compatible storage. |
AWS Session Token | The temporary session token to connect to the Amazon S3 data source. |
Amazon Web Services Bucket Name | Amazon Web Services bucket name that Enterprise Data Catalog needs to scan. |
Source Directory | The source directory from where metadata must be extracted. |
Connect through a proxy server | Proxy server to connect to the data source. Default is Disabled. |
Proxy Host | Host name or IP address of the proxy server. |
Proxy Port | Port number of the proxy server. |
Proxy User Name | Required for authenticated proxy. Authenticated user name to connect to the proxy server. |
Proxy Password | Required for authenticated proxy. Password for the authenticated user name. |
The Metadata Load Settings tab includes the following properties:
Property | Description |
---|
Enable Source Metadata | Extracts metadata from the data source. |
File Types | Select any or all of the following file types from which you want to extract metadata: - - All. Use this option to specify if you want to extract metadata from all file types.
- - Select. Use this option to specify that you want to extract metadata from specific file types. Perform the following steps to specify the file types:
- 1. Click Select. The Select Specific File Types dialog box appears.
- 2. Select the required files from the following options:
- - Extended unstructured formats. Use this option to extract metadata from file types such as audio files, video files, image files, and ebooks.
- - Structured file types. Use this option to extract metadata from file types, such as Avro, Parquet, JSON, XML, text, and delimited files.
- - Unstructured file types. Use this option to extract metadata from file types such as Microsoft Excel, Microsoft PowerPoint, Microsoft Word, web pages, compressed files, emails, and PDF.
- 3. Click Select.
Note: You can select Specific File Types option in the dialog box to select files under all the categories.
|
Other File Types | Extracts basic file metadata such as, file size, path, and time stamp, from file types not present in the File Types property. |
Treat Files Without Extension As | Select one of the following options to identify files without an extension: |
Enter File Delimiter | Specify the file delimiter if the file from which you extract metadata uses a delimiter other than the following list of delimiters: - - Comma (,)
- - Horizontal tab (\t)
- - Semicolon (;)
- - Colon (:)
- - Pipe symbol (|)
Verify that you enclose the delimiter in single quotes. For example, '$'. Use a comma to separate multiple delimiters. For example, '$','%','&' |
First Level Directory | Specify a directory or a list of directories under the source directory. If you leave this option blank, Enterprise Data Catalog imports all the files from the specified source directory. To specify a directory or a list of directories, you can perform the following steps: - 1. Click Select.... The Select First Level Directory dialog box appears.
- 2. Use one of the following options to select the required directories:
- - Select from list: select the required directories from a list of directories.
- - Select using regex: provide an SQL regular expression to select schemas that match the expression.
Note: If you want to select multiple directories, you must separate the directories with a semicolon (;). |
Enable Exclusion Filter | Filter to exclude folders from the data source during the metadata extraction phase. This option appears when you choose Amazon S3 V2 as the resource type. |
Filter Condition | Filter condition to exclude folders from the data source. Select the filter condition from the following list: - - Starting With. Excludes all folders that start with the keyword.
- - Ending With. Excludes all folders that end with the keyword.
- - Contains. Excludes all folders that contain the keyword.
- - Named. Excludes all folders that are named as the keyword.
This option appears when you choose Amazon S3 V2 as the resource type. |
Filter Value | Filter value or pattern for the filter condition. Specify the value or pattern within double quotes. Use a comma to separate multiple values. This option appears when you choose Amazon S3 V2 as the resource type. |
Is Filter Case Sensitive | Specify if the filter value is case sensitive. Default is True. This option appears when you choose Amazon S3 V2 as the resource type. |
Incremental Scan | Scans only the files that you added or modified after the last run. Discovery incremental profiling picks the file received from metadata extraction. Note: You can use incremental profiling only when metadata is enabled. For more information about incremental scanning, see the FAQ: What is an incremental scan? Knowledge Base article. |
Recursive Scan | Recursively scans the subdirectories under the selected first-level directories. Recursive scan is required for partitioned file discovery. |
Enable Partitioned File Discovery | Identifies and publishes horizontally partitioned files under the same directory and files organized in hierarchical Hive-style directory structures as a single partitioned file. |
Non Strict Mode | Detects partitions in parquet files when compatible schemas are identified in the files. |
Case Sensitive | Specifies that the resource is configured for case sensitivity. Select one of the following values: - - True. Select this check box to specify that the resource is configured as case sensitive.
- - False. Clear this check box to specify that the resource is configured as case insensitive.
The default value is True. |
Memory | The memory required to run the scanner job. Select one of the following values based on the data set size imported: Note: For more information about the memory values, see the Tuning Enterprise Data Catalog Performance article on How To-Library Articles tab in the Informatica Doc Portal |
Custom Options | JVM parameters that you can set to configure scanner container. Use the following arguments to configure the parameters: - - -Dscannerloglevel=<DEBUG/INFO/ERROR>. Changes the log level of scanner to values, such as DEBUG, ERROR, or INFO. Default value is INFO.
- - -Dscanner.container.core=<No. of core>. Increases the core for the scanner container. The value should be a number.
- - -Dscanner.yarn.app.environment=<key=value>. Key pair value that you need to set in the Yarn environment. Use a comma to separate the key pair value.
- - -Dscanner.pmem.enabled.container.memory.jvm.memory.ratio=<1.0/2.0>. Increases the scanner container memory when pmem is enabled. Default value is 1.
- - -DmaxPartFilesToValidatePerTable=<number>. Validates the specified number of part files in the partitioned table. Default value is 10.
- - -DmaxPartFilesToValidatePerPartition=<number>. Validates the specified number of part files for each partition in the partition table. Default value is 5.
- - -DexcludePatterns=<comma separated regex patterns>. Excludes the files while parsing partition tables based on the regex pattern. By default, file names that start with a period and an underscore are excluded.
|
Track Data Source Changes | View metadata source change notifications in Enterprise Data Catalog. |
Custom Partition Configuration File | Detects custom partitions in the data source. Select the configuration file in JSON format. This option appears when you choose Amazon S3 V2 as the resource type. |
Pruned Partition Configuration File | Specify the configuration file in JSON format for partition pruning. This option appears when you choose Amazon S3 V2 as the resource type. |
Disable Partition Pruning | Option to disable partition pruning. This option appears when you choose Amazon S3 V2 as the resource type. |
You can enable data discovery for an Amazon S3 resource. For more information, see the
Enable Data Discovery topic.
You can enable composite data domain discovery for an Amazon S3 resource. For more information, see the
Composite Data Domain Discovery topic.
You can use Amazon S3 and AWS Databricks Delta tables to run column profiles and discover data domains on both native and AWS Databricks run-time environments.
Profile Avro files
You can extract metadata, discover Avro partitions, and run profiles on Avro files with multiple-level hierarchy using an Amazon S3 resource on the Spark engine. When you run profiles on Avro files, the data types of assets appear in the profiling results of the Enterprise Data Catalog tool.
The following asset data types appear in the profiling results:
- •Arrays with primitive data types. You can view the primitive data type of an array in the System Attributes section of the Overview tab of the asset.
- •Arrays with complex data types. You can expand the list to view the data types of arrays with complex data types in the Fields tab of the asset.
- •Unions with multiple primitive data types. You can expand the list to view the data types of unions with multiple primitive data types that are not null in the Fields tab of the asset. All the data types in the union appear in the list.
- •Unions with null and primitive or complex data type appear as primitive or complex data type respectively in the catalog.
- •Maps. You can expand the list to view the data types of maps with keys and values in the Fields tab of the asset.
- •Only primitive data types appear in the catalog. Logical data types do not appear in the catalog.
When you select the Non Strict Mode in the Metadata Load Settings tab of the resource to detect partitions in Avro files, the partition discovery happens in the strict mode.
If partition folder contains more than 10 subfolders and some files or subfolders contain more than 10 files, some folders are not detected for potential partition. To avoid this issue, you can use the -DmaxChildPathsToValidate JVM option to override the default value and increase the number of folders to be validated.
You cannot profile Avro files that contain any of the following data types:
- •Union of multiple primitive
- •Enum
- •Map with complex values
Note: The Avro file that includes any of the above data types also fails during profiling.