Microsoft Azure Data Lake Storage Gen2

Microsoft Azure Data Lake is a scalable data storage and analytics service that is hosted on Azure. Configure an Azure Data Lake Store (ADLS) catalog source to extract metadata from the data lake store.

Objects extracted

Metadata Command Center extracts the following metadata from an Azure Data Lake Storage Gen2 source system:

File types

You can extract metadata from the following file types:

The following table lists the structures associated with the file types that you can extract metadata from:

File Type	Partition structure
Avro	Single partition, multiple partitions, schema merge
CSV	Single partition, multiple partitions, schema merge
JSON	Single partition, multiple partitions, schema merge
Parquet	Single partition, multiple partitions, schema merge
XML	Single partition, multiple partitions, schema merge

You can extract workbooks, worksheets, and columns from Microsoft Excel files.

You can extract metadata from XML and XSD file formats as XML file objects. Metadata Command Center extracts only elements and attributes from XML and XSD files. If the size of the XML file exceeds 100 KB, Metadata Command Center extracts metadata from the initial 100 KB of the file. However, for XSD file types, Metadata Command Center extracts complete metadata.

Data profiling for Microsoft Azure Data Lake Storage Gen2 objects

Configure data profiling to run profiles on the metadata extracted from a Microsoft Azure Data Lake Storage Gen2 source system. You can run data profiles on the following Microsoft Azure Data Lake Storage Gen2 objects:

You can view the profiling statistics in Data Governance and Catalog. The data profiling task runs profiles on the following data types for AVRO, CSV, JSON, and Parquet file formats:

File format	Supported data type
AVRO	- INT - STRING
CSV	- STRING
JSON	- ARRAY - BOOLEAN - NUMBER (double) - NUMBER (float) - NUMBER (int) - NUMBER (long) - OBJECT - STRING
Parquet	- BOOLEAN - INT32 - INT64 - FLOAT - DOUBLE - DATE - DECIMAL - STRING

Note:

To run a profile on an Avro or Parquet files, use a secure agent configured with an advanced cluster. The same agent can also process delimited files. If the agent has no advanced cluster capabilities, the data profiling task runs profiles only on delimited files. For more information about advanced clusters, see Advanced Clusters help.

Data classification for Microsoft Azure Data Lake Storage Gen2 objects

Configure data classification for Microsoft Azure Data Lake Storage Gen2 catalog sources to classify and organize data in your organization. You can view the data classification results in Data Governance and Catalog.

Data lineage

Lineage data is available for Microsoft Azure Data Lake Storage Gen2 assets that connect to the following source systems: