Understanding Technical Assets > Technical asset types > Amazon S3

Amazon S3

Amazon Simple Storage Service (Amazon S3) is an object storage service offered by Amazon Web Services (AWS).

Objects extracted

Metadata Command Center extracts the following metadata from an Amazon S3 source system:

•Document
•File
•Folder
•Bucket
•Iceberg Table
•Iceberg Field
•PDF
•Text

You can extract metadata from S3 data sources. By default, you extract metadata from the Amazon S3 Storage, but you can also extract metadata from the S3 Scality Storage or MinIO.

File types

You can extract metadata from the following file types:

•AVRO
•Compressed files

- TAR
- ZIP

•Delimited files

- CSV
- TSV

•Document files

- DOC
- DOCX

•JSON
•Markdown
•Microsoft Excel files

- Excel 97-2003 Workbook with XLS extension
- Excel Workbook with XLSX extension
- Excel Macro-Enabled Workbook with XLSM extension

•ORC
•Parquet
•PDF
•TXT
•XML
•XSD

The following table lists the structures associated with the file types that you can extract metadata from:

File Type	Partition structure
AVRO	Single partition, multiple partitions, schema merge
CSV	Single partition, multiple partitions, schema merge
JSON	Single partition, multiple partitions, schema merge
Parquet	Single partition, multiple partitions, schema merge
XML	Single partition, multiple partitions, schema merge

You can extract workbooks, worksheets, and columns from Microsoft Excel files.

You can extract metadata from XML and XSD file formats as XML file objects. Metadata Command Center extracts only elements and attributes from XML and XSD files. If the size of the XML file exceeds 100 KB, Metadata Command Center extracts metadata from the initial 100 KB of the file. However, for XSD file types, Metadata Command Center extracts complete metadata.

Prerequisites for configuring the Amazon S3 catalog source

Use the Amazon S3 v2 connector to connect to the Amazon S3 source system. For information about configuring a connection in Administrator, see Connections in the Cloud Common Services help.

Configure permissions

To extract metadata and run profiles, you need account access and permissions on the Amazon S3 source system.

Permissions to extract metadata: Configure the following permissions on the Amazon S3 catalog source:
Permissions to run data profiles: To perform data profiling, grant the following permissions to the user:

Data profiling for Amazon S3 objects

Configure data profiling to run profiles on the metadata extracted from an Amazon S3 source system. You can run data profiles on the following Amazon S3 objects:

•AVRO
•CSV
•Parquet

You can view the profiling statistics in Data Governance and Catalog. The data profiling task runs profiles on the following data types for AVRO, CSV, and Parquet file formats:

File format	Data type
Avro	- INT - STRING
CSV	- STRING
Parquet	- BOOLEAN - INT32 - INT64 - FLOAT - DOUBLE - DATE - DECIMAL - STRING

Sampling type: You can run the data profiling task on all rows for an Amazon S3 source system.

Note:

To run a profile on an Avro or Parquet file, connect to an advanced cluster. For more information about advanced clusters, see Advanced Clusters help.

Data classification for Amazon S3 objects

Configure data classification for Amazon S3 catalog sources to classify and organize data in your organization. You can view the data classification results in Data Governance and Catalog.

Connection properties

When you configure a connection to the Amazon S3 source system in Administrator, you can view the connection properties for that connection on the Registration page in Metadata Command Center.

The following table describes the Amazon S3 connection properties:

Property	Description
Runtime Environment	The execution platform that runs tasks. The runtime environment is either a Secure Agent or a serverless runtime environment.
Access Key	Amazon Web Services access key ID to sign requests that you send to Amazon Web Services.
Secret Key	Amazon Web Services secret access key to sign requests that you send to Amazon Web Services.
Bucket Name	Amazon Web Services bucket name that needs to be scanned.
Region	A named set of Amazon resources in the same geographical area.

Note:

To authenticate to Amazon S3 source systems, you can create and use an access key and a secret key. Optionally, you can configure Identity and Access Management (IAM) authentication in your AWS user account. If you use IAM authentication, you do not need to provide access key ID and secret access key explicitly in the Amazon S3 connection. For more information about managing IAM authentication for Amazon S3, see Amazon S3 Connector in the Connectors help.

Configuration parameters for metadata extraction

Expand the Catalog Source Configuration Options in the Metadata Extraction tab of the Configuration page. Configure the following parameters for extracting metadata from an Amazon S3 source system:

Parameter	Description
Extract Group Elements from Hierarchical Files	Select one of the following options to extract group or leaf elements from hierarchical files: - Yes. Extracts group elements from hierarchical files with the complete hierarchy of hierarchical fields. You can view the hierarchy of hierarchical files in the Hierarchy tab of assets in Data Governance and Catalog. - No. Extracts only leaf elements from hierarchical files without the complete hierarchy of hierarchical fields. You can extract group elements from hierarchical files for the following file types: - AVRO. Extracts and groups hierarchical files and hierarchical fields. - Parquet. Extracts and groups hierarchical files and hierarchical fields. - JSON. Extracts and groups hierarchical files and hierarchical fields. - XML. Extracts and groups elements and attributes. For XML file types, a maximum depth of 1000 elements is permitted within a single element in the hierarchy. - XSD. Extracts and groups elements and attributes. If you modify the Extract Group Elements from Hierarchical Files field and run the catalog source again, the asset page doesn't display the hierarchical elements in the correct hierarchy groups. If you modify the property value, purge the catalog source before you run it again.
Enable Extension-Based File Type Detection	Select one of the following options to detect file types by file extensions or by parsing the file contents: - Yes. Detects file types by file extensions. - No. Parses the file contents to detect file types. Note: You can detect file types by file extensions for the following file types: - CSV - TSV - TXT - XML
Use First Row as Header of Delimited Files	Select one of the following options to use the first row as the header or detect headers automatically for delimited files: - Yes. Detects column headers based on the following rules: - Duplicate headers get suffixed with ‘#’ followed by a number, for example, ABC#1, ABC#2. The detection is not case-sensitive. - Empty column header values appear as UnknownColumn<position>, for example UnknownColumn2. - The header row in the file is detected even if it has a different number of columns than the data rows. - No. Detects headers automatically for delimited files.
Headers of Delimited Files	Specify values to determine headers of delimited files. Separate multiple values by commas. If any value from the list is found in the first row of the delimited file, then the first row is used as the header. Note: This parameter appears only if you choose No for the Use First Row as Header of Delimited Files parameter.
Treat Files Without Extension As	Select one of the following options to identify files without an extension: - Parquet - Avro - JSON
Enter File Delimiter	Specify the file delimiter if the file from which you extract metadata uses a delimiter other than the following list of delimiters: - Comma(,) - Horizontal tab ( \t) - Semicolon (;) - Colon (:) - Pipe symbol (\|) Enclose the delimiter in single quotes. For example, '$'. Use a comma to separate multiple delimiters.
Files to be excluded during partition discovery	Specify the regular expression of the files that you want to exclude during partition discovery. Enclose each regular expression in double quotes. For example, ".json","Customer.csv","Parquet.". Use a comma to separate multiple regular expressions.

Expand the Partitioned File Configuration section in the Metadata Extraction tab of the Configuration page. The following table shows the parameters that you can configure for custom partition detection and partition pruning:

Parameter	Description
Custom Partition Detection Configuration File	Upload a configuration file in the JSON format to specify the pattern of the non-hive style custom partitions in the source system. Metadata Command Center detects only those partitions that have non-hive style date patterns.
Partitioned Pruning Configuration File	Upload a configuration file in the JSON format. In the file, specify additional information to identify partitions for determining the relevant schema in the source system.
Partition Detection	Select one of the following options: - Yes. Enables partition detection. - No. Disables partition detection.
Partition Pruning	Applicable if you enable partition detection. Select one of the following options: - Yes. Enables partition pruning. - No. Disables partition pruning. Default is Yes.

The following code shows a sample pruned partition configuration file in the JSON format:

{
"SortablePartitionKeys": [
{
"Key": "Region",
"PrunedValues": [
"USA",
"IND"
]
}
]
}

In this example, the partitions USA and IND are used for determining the schema.

Data lineage

Lineage data is available for Amazon S3 assets that connect to the following source systems:

•Databricks Notebook
•Informatica products, services, and tools
•Microsoft Azure Data Factory
•Microsoft Power BI
•Tableau

For more information about data lineage, see the Asset Discovery help.