You can use the Amazon S3 catalog source to extract metadata from an Amazon S3 source system. You can also extract metadata from a third-party storage provider such as Scality RING or MinIO.
Amazon Simple Storage Service (Amazon S3) is an object storage service offered by Amazon Web Services (AWS).
Extracted metadata
You can use the Amazon S3 catalog source to extract metadata from an Amazon S3 source system.
Metadata Command Center extracts the following metadata from an Amazon S3 source system:
•File
•Folder
•Bucket
•Iceberg Table
•Iceberg Field
You can extract metadata from S3 data sources. By default, you extract metadata from the Amazon S3 Storage, but you can also extract metadata from the S3 Scality Storage or MinIO.
You can extract metadata from the following file types:
•AVRO
•Compressed files
- TAR
- ZIP
•Delimited files
- CSV
- TSV
•JSON
•Microsoft Excel files
- Excel 97-2003 Workbook with XLS extension
- Excel Workbook with XLSX extension
- Excel Macro-Enabled Workbook with XLSM extension
•ORC
•Parquet
•TXT
•XML
•XSD
The following table lists the structures associated with the file types that you can extract metadata from:
File Type
Partition structure
AVRO
Single partition, multiple partitions, schema merge
CSV
Single partition, multiple partitions, schema merge
JSON
Single partition, multiple partitions, schema merge
Parquet
Single partition, multiple partitions, schema merge
XML
Single partition, multiple partitions, schema merge
You can extract workbooks, worksheets, and columns from Microsoft Excel files.
You can extract metadata from XML and XSD file formats as XML file objects. Metadata Command Center extracts only elements and attributes from XML and XSD files. If the size of the XML file exceeds 100 KB, Metadata Command Center extracts metadata from the initial 100 KB of the file. However, for XSD file types, Metadata Command Center extracts complete metadata.
Data profiling for Amazon S3 objects
Configure data profiling to run profiles on the metadata extracted from an Amazon S3 source system. You can view the profiling statistics in Data Governance and Catalog.
You can run data profiles on the following objects:
•Avro
•CSV
•JSON
•Parquet
To run profiles on Avro, JSON, or Parquet files, connect to an advanced cluster.
For more information about advanced clusters, see Advanced Clusters in the Cloud Common Services help.
You can also run profiles on a Secure Agent installed on an Amazon EC2 instance on AWS that uses the Instance Metadata Service Version 2 (IMDSv2) method.
The data profiling task runs profiles on the following data types for Avro, CSV, JSON, and Parquet file formats:
File format
Data type
Avro
- INT
- STRING
CSV
- STRING
JSON
- ARRAY
- DOUBLE
- INTEGER
- OBJECT
- STRING
Parquet
- BOOLEAN
- INT32
- INT64
- FLOAT
- DOUBLE
- DATE
- DECIMAL
- STRING
Sampling type
You can run the data profiling task on all rows for an Amazon S3 source system.
Compatible connectors
Before you configure an Amazon S3 catalog source, you must connect to the Amazon S3 source system.
Use the Amazon S3 v2 connector to connect to the Amazon S3 source system.
For information about configuring a connection, see Connections.