Catalog Source Configuration > Amazon S3 > Introduction to Amazon S3 catalog sources
  

Introduction to Amazon S3 catalog sources

You can use Metadata Command Center to extract metadata from a source system.
A source system is any system that contains data or metadata. For example, Amazon S3 is a source system from which you can extract metadata through an Amazon S3 catalog source. A catalog source is an object that represents and contains metadata from the source system.
Before you extract metadata from a source system, you first create and register a catalog source that represents the source system. Then you configure capabilities for the catalog source. A capability is a task that Metadata Command Center can perform, such as metadata extraction, data profiling, data classification, or glossary association.
When Metadata Command Center extracts metadata, Data Governance and Catalog displays the extracted metadata and its attributes as technical assets. You can then perform tasks such as analyzing the assets, viewing lineage, and creating links between those assets and their business context.
The following table describes the capabilities of the catalog source:
Capability
Description
Incremental metadata extraction
An incremental metadata extraction extracts only the changed and new objects since the last catalog source job run. Incremental metadata extraction doesn’t remove deleted objects from the catalog and doesn’t extract metadata of code-based objects if applicable.
Serverless Runtime Environment
A serverless runtime environment is an advanced serverless deployment solution that doesn't require downloading, installing, configuring, or maintaining a Secure Agent or Secure Agent group. You can use a serverless runtime environment in the same way that you use a Secure Agent when you configure a catalog source.
Data Profiling and Quality
  • - Data Profiling. Assesses source metadata and analyzes the collected statistics to discover content and structure, such as value distribution, patterns, and data types.
  • - Data Quality. Measures the reliability of the data and enables data usage.
  • - Data Observability. Identifies anomalies in the characteristics of the data.
Data Classification
Data classification is the process of identifying and organizing data into relevant categories based on the functional meaning of the data. Classifying data can help your organization manage risks, compliance, and data security.
Relationship Discovery
The relationship discovery capability identifies pairs of similar columns and relationships between tables within a catalog source.
Glossary Association
You can associate terms that are in the glossary with technical assets to provide user-friendly business names to technical assets. Glossary Association automatically associates glossary terms with technical assets or recommends glossary terms that you can manually associate with technical assets in Data Governance and Catalog.

Extraction and view process

To extract metadata from a source system, configure the catalog source and run the extraction job in Metadata Command Center. Then view the results in Data Governance and Catalog.
The following image shows the process to extract metadata from a source system:
The image shows the process of metadata extraction from a source system that begins with prerequisites verification, continues with the creation of the catalog source, and ends with viewing the extraction results.
After you verify prerequisites, perform the following tasks to extract metadata from Amazon S3:
  1. 1Register a catalog source. Create a catalog source object, select Amazon S3, and then select and test the connection.
  2. 2Configure the catalog source. Specify the runtime environment and configure parameters for metadata extraction. Optionally, add filters to include or exclude source system assets from metadata extraction. You can also configure other capabilities such as data profiling and quality, data classification, or glossary association.
  3. 3Optionally, associate stakeholders. Associate users with technical assets, giving the users permission to perform actions determined by their roles.
  4. 4Run or schedule the catalog source job.
After you run the catalog source job, you view the results in Data Governance and Catalog.

About the Amazon S3 catalog source

You can use the Amazon S3 catalog source to extract metadata from an Amazon S3 source system. You can also extract metadata from a third-party storage provider such as Scality RING or MinIO.
Amazon Simple Storage Service (Amazon S3) is an object storage service offered by Amazon Web Services (AWS).

Extracted metadata

You can use the Amazon S3 catalog source to extract metadata from an Amazon S3 source system.
Metadata Command Center extracts files, folders, and buckets from an Amazon S3 source system.
You can extract metadata from S3 data sources. By default, you extract metadata from the Amazon S3 Storage, but you can also extract metadata from the S3 Scality Storage or MinIO.
You can extract workbooks, worksheets, and columns from Microsoft Excel files.
The following table lists the structures associated with the file types that you can extract metadata from:
File Type
Partition structure
AVRO
Single partition, multiple partitions, schema merge
CSV
Single partition, multiple partitions, schema merge
JSON
Single partition
Parquet
Single partition, multiple partitions, schema merge
You can extract metadata from TSV and TXT file types.
You can extract metadata from XML and XSD file formats as XML file objects. Metadata Command Center extracts only elements and attributes from XML and XSD files. If the size of the XML file exceeds 100 KB, Metadata Command Center extracts metadata from the initial 100 KB of the file. However, for XSD file types, Metadata Command Center extracts complete metadata.
You can extract metadata from the following Microsoft Excel file types:

Data profiling for Amazon S3 objects

Configure data profiling to run profiles on the metadata extracted from an Amazon S3 source system. You can view the profiling statistics in Data Governance and Catalog.
You can run data profiles on the following objects:
The data profiling task runs profiles on the following data types for AVRO, CSV, JSON, and Parquet file formats:
File format
Data type
AVRO
  • - INT
  • - STRING
CSV
  • - STRING
JSON
  • - ARRAY
  • - DOUBLE
  • - INTEGER
  • - OBJECT
  • - STRING
Parquet
  • - BOOLEAN
  • - INT32
  • - INT64
  • - FLOAT
  • - DOUBLE
  • - DATE
  • - DECIMAL
  • - STRING
Sampling type
You can run the data profiling task on all rows for an Amazon S3 source system.
Note: To run a profile on an Avro or Parquet file, connect to an advanced cluster. For more information about advanced clusters, see Advanced Clusters in the Cloud Common Services help.

Compatible connectors

Before you configure an Amazon S3 catalog source, you must connect to the Amazon S3 source system.
Use the Amazon S3 v2 connector to connect to the Amazon S3 source system.
For information about configuring a connection, see Connections.