Understanding Technical Assets > Technical asset types > Hadoop Distributed File System

Hadoop Distributed File System

Hadoop Distributed File System is a distributed file system that handles large data sets that run on commodity hardware.

The Hadoop Distributed File System catalog source is available on the following clusters:

•Cloudera Data Platform (CDP)
•Amazon EMR
•Google Dataproc
•Azure HDInsight

Objects extracted

Hadoop Distributed File System supports the Azure HDInsight, Amazon EMR, Cloudera Data Platform, and Google Dataproc distributions.

Metadata Command Center extracts the following objects from a Hadoop Distributed File System source system:

•File System
•Folder
•File
•Flat File
•Hierarchical File
•Flat Field
•Hierarchical Field
•XML File
•XSD File
•Attribute
•Element

Supported file types

You can extract metadata from the following file types:

•AVRO
•CSV
•JSON
•Microsoft Excel files

- Excel 97-2003 Workbook with XLS extension
- Excel Workbook with XLSX extension
- Excel Macro-Enabled Workbook with XLSM extension

•Parquet
•TSV
•TXT
•XML (XML and XSD files)

The following table lists the structures associated with the file types that you can extract metadata from:

File Type	Partition structure
AVRO	Single partition, multiple partitions, schema merge
CSV	Single partition, multiple partitions, schema merge
JSON	Single partition, multiple partitions, schema merge
Parquet	Single partition, multiple partitions, schema merge
XML	Single partition, multiple partitions, schema merge

You can extract workbooks, worksheets, and columns from Microsoft Excel files.

You can extract metadata from XML and XSD file formats as XML file objects. Metadata Command Center extracts only elements and attributes from XML and XSD files. If the size of the XML file exceeds 100 KB, Metadata Command Center extracts metadata from the initial 100 KB of the file. However, for XSD file types, Metadata Command Center extracts complete metadata.