You can use Metadata Command Center to extract metadata from a source system.
A source system is any system that contains data or metadata. For example, Apache Hive is a source system from which you can extract metadata through an Apache Hive catalog source with Metadata Command Center. A catalog source is an object that represents and contains metadata from the source system.
Before you extract metadata from a source system, you first create and register a catalog source that represents the source system. Then you configure capabilities for the catalog source. A capability is a task that Metadata Command Center can perform, such as metadata extraction, data profiling, data classification, or glossary association.
When Metadata Command Center extracts metadata, Data Governance and Catalog displays the extracted metadata and its attributes as technical assets. You can then perform tasks such as analyzing the assets, viewing lineage, and creating links between those assets and their business context.
The following table describes the capabilities of the catalog source:
Capability
Description
Data Profiling and Quality
- Data Profiling. Assesses source metadata and analyzes the collected statistics to discover content and structure, such as value distribution, patterns, and data types.
- Data Quality. Measures the reliability of the data and enables data usage.
- Data Observability. Identifies anomalies in the characteristics of the data.
Data Classification
Data classification is the process of identifying and organizing data into relevant categories based on the functional meaning of the data. Classifying data can help your organization manage risks, compliance, and data security.
Glossary Association
You can associate terms that are in the glossary with technical assets to provide user-friendly business names to technical assets. Glossary Association automatically associates glossary terms with technical assets or recommends glossary terms that you can manually associate with technical assets in Data Governance and Catalog.
Extraction and view process
To extract metadata from a source system, configure the catalog source and run the extraction job in Metadata Command Center. Then view the results in Data Governance and Catalog.
The following image shows the process to extract metadata from a source system:
After you verify prerequisites, perform the following tasks to extract metadata from Apache Hive:
1Register a catalog source. Create a catalog source object, select Apache Hive, and then select and test the connection.
2Configure the catalog source. Specify the runtime environment and configure parameters for metadata extraction. Optionally, add filters to include or exclude source system assets from metadata extraction. You can also configure other capabilities such as data profiling and quality, data classification, or glossary association.
3Optionally, associate stakeholders. Associate users with technical assets, giving the users permission to perform actions determined by their roles.
4Run or schedule the catalog source job.
5Optionally, if the catalog source job generates referenced asset objects, you can assign a connection to referenced source system assets.
You can view the lineage with object references without performing connection assignment. After connection assignment, you can view the objects.
After you run the catalog source job, you view the results in Data Governance and Catalog.
About the Apache Hive catalog source
You can use the Apache Hive catalog source to extract metadata from an Apache Hive source system.
Apache Hive is a data warehouse software built on Apache Hadoop, used to query large datasets from various databases and file systems.
Extracted metadata
You can use the Apache Hive catalog source to extract metadata from an Apache Hive source.
Metadata Command Center extracts the following metadata from the Apache Hive source system:
•Database
•Schema
•Table
•Column
•External Table
•External Column
•View
•View Column
•Materialized View
Note: Objects of the Materialized View type appear as View in Data Governance and Catalog.
Data profiling for Apache Hive objects
Configure data profiling to run profiles on the metadata extracted from an Apache Hive source system. You can view the profiling statistics in Data Governance and Catalog.
You can run data profiles on the following objects:
•Views
•Tables
•External tables created in the following file formats:
- AVRO
- Parquet
- Delimited
- JSON
The data profiling task runs profiles on the following data types:
•bigint
•boolean
•char
•date
•decimal
•double
•float
•integer
•smallint
•string
•timestamp
•tinyint
•varchar
Compatible connectors
Before you configure an Apache Hive catalog source, you must connect to the Apache Hive source system.
Use Hive Connector to connect to the Apache Hive source system.
For information about configuring a connection, see Connections.