Enterprise Data Catalog Scanner Configuration Guide > Configuring Cloud Resources > Amazon S3
  

Amazon S3

Amazon S3 is a simple storage service offered by Amazon Web Services (AWS) that provides object storage through a web service interface.

Objects Extracted

Enterprise Data Catalog extracts files from Amazon S3 data source. You can also extract metadata from an Amazon S3 compatible storage such as Scality RING.
Note: In Enterprise Data Catalog version 10.4.1.2, you can use the Amazon S3 resource to connect to an Amazon S3 data source by using a temporary session token.

Permissions to Configure the Resource

Configure the read permission on the Amazon S3 data source for the user account that you use to access the data source. Configure the access permission for the user account if you use a user account that is different from the user account used to create the Amazon S3 data source.

Supported File Types

The Amazon S3 resource enables you to extract metadata from structured and unstructured files.
The structured files supported are:
The unstructured files supported are:
Assign read and write permissions to the files to extract metadata.

Resource Connection Properties

The General tab includes the following properties:
Property
Description
Amazon Web Services Bucket URL
Amazon Web Services URL to access a bucket.
Amazon Web Services Access Key ID
Amazon Web Services access key ID to sign requests that you send to Amazon Web Services.
Note: Specify the Amazon S3 compatible storage access key ID to sign requests that you send to the Amazon S3 compatible storage.
Amazon Web Services Secret Access Key
Amazon Web Services secret access key to sign requests that you send to Amazon Web Services.
Note: Specify the Amazon S3 compatible storage secret access key to sign requests that you send to the Amazon S3 compatible storage.
Amazon Web Services Bucket Name
Amazon Web Services bucket name that Enterprise Data Catalog needs to scan.
Source Directory
The source directory from where metadata must be extracted.
The following image shows sample connection properties on the General tab:
The image displays the connection settings for an Amazon S3 resource.
The Metadata Load Settings tab includes the following properties:
Property
Description
Enable Source Metadata
Extracts metadata from the data source.
File Types
Select any or all of the following file types from which you want to extract metadata:
  • - All. Use this option to specify if you want to extract metadata from all file types.
  • - Select. Use this option to specify that you want to extract metadata from specific file types. Perform the following steps to specify the file types:
    1. 1. Click Select. The Select Specific File Types dialog box appears.
    2. 2. Select the required files from the following options:
      • - Extended unstructured formats. Use this option to extract metadata from file types such as audio files, video files, image files, and ebooks.
      • - Structured file types. Use this option to extract metadata from file types, such as Avro, Parquet, JSON, XML, text, and delimited files.
      • - Unstructured file types. Use this option to extract metadata from file types such as Microsoft Excel, Microsoft PowerPoint, Microsoft Word, web pages, compressed files, emails, and PDF.
    3. 3. Click Select.
    Note: You can select Specific File Types option in the dialog box to select files under all the categories.
Other File Types
Extracts basic file metadata such as, file size, path, and time stamp, from file types not present in the File Types property.
Treat Files Without Extension As
Select one of the following options to identify files without an extension:
  • - None
  • - Avro
  • - Parquet
Enter File Delimiter
Specify the file delimiter if the file from which you extract metadata uses a delimiter other than the following list of delimiters:
  • - Comma (,)
  • - Horizontal tab (\t)
  • - Semicolon (;)
  • - Colon (:)
  • - Pipe symbol (|)
Verify that you enclose the delimiter in single quotes. For example, '$'. Use a comma to separate multiple delimiters. For example, '$','%','&'
First Level Directory
Specify a directory or a list of directories under the source directory. If you leave this option blank, Enterprise Data Catalog imports all the files from the specified source directory.
To specify a directory or a list of directories, you can perform the following steps:
  1. 1. Click Select.... The Select First Level Directory dialog box appears.
  2. 2. Use one of the following options to select the required directories:
    • - Select from list: select the required directories from a list of directories.
    • - Select using regex: provide an SQL regular expression to select schemas that match the expression.
Note: If you want to select multiple directories, you must separate the directories with a semicolon (;).
Recursive Scan
Recursively scans the subdirectories under the selected first-level directories. Recursive scan is required for partitioned file discovery.
Enable Partitioned File Discovery
Identifies and publishes horizontally partitioned files under the same directory and files organized in hierarchical Hive-style directory structures as a single partitioned file.
Case Sensitive
Specifies that the resource is configured for case sensitivity. Select one of the following values:
  • - True. Select this check box to specify that the resource is configured as case sensitive.
  • - False. Clear this check box to specify that the resource is configured as case insensitive.
The default value is True.
Memory
The memory required to run the scanner job. Select one of the following values based on the data set size imported:
  • - Low
  • - Medium
  • - High
Note: For more information about the memory values, see the Tuning Enterprise Data Catalog Performance article on How To-Library Articles tab in the Informatica Doc Portal
JVM Options
JVM parameters that you can set to configure scanner container. Use the following arguments to configure the parameters:
  • - -Dscannerloglevel=<DEBUG/INFO/ERROR>. Changes the log level of scanner to values, such as DEBUG, ERROR, or INFO. Default value is INFO.
  • - -Dscanner.container.core=<No. of core>. Increases the core for the scanner container. The value should be a number.
  • - -Dscanner.yarn.app.environment=<key=value>. Key pair value that you need to set in the Yarn environment. Use a comma to separate the key pair value.
  • - -Dscanner.pmem.enabled.container.memory.jvm.memory.ratio=<1.0/2.0>. Increases the scanner container memory when pmem is enabled. Default value is 1.
  • - -DmaxPartFilesToValidatePerTable=<number>. Validates the specified number of part files in the partitioned table. Default value is 10.
  • - -DmaxPartFilesToValidatePerPartition=<number>. Validates the specified number of part files for each partition in the partition table. Default value is 5.
  • - -DexcludePatterns=<comma separated regex patterns>. Excludes the files while parsing partition tables based on the regex pattern. By default, file names that start with a period and an underscore are excluded.
  • - -DisS3SupportedType='true'. Enables metadata extraction from an Amazon S3 compatible storage such as Scality RING. Default value is false.
  • -DS3RestEndPoint='s3.<computer name>.com'. Specifies the end point of the Amazon S3 compatible storage to extract metadata.
  • - -DareS3CredentialsTemporary='true'. Enables the resource to connect and extract metadata from an Amazon S3 data source by using a temporary session token. Default value is false.
  • -DawsSessionToken='<Temporary session token>'. Specifies the temporary session token to connect to the Amazon S3 data source.
Track Data Source Changes
View metadata source change notifications in Enterprise Data Catalog.
You can enable data discovery for an Amazon S3 resource. For more information, see the Enable Data Discovery topic.
You can enable composite data domain discovery for an Amazon S3 resource. For more information, see the Composite Data Domain Discovery topic.