Enterprise Data Catalog Scanner Configuration Guide > Configuring Data Engineering Resources > HDFS

HDFS

Use a HDFS resource to import metadata from CSV, XML, and JSON files.

Objects Extracted

The HDFS resource extracts metadata from files in a HDFS data source.

Permissions to Configure the Resource

Configure read permission on the HDFS data source for the user account that you use to access the data source.

Supported File Types

The HDFS resource enables you to extract metadata from structured and unstructured files.

The structured files supported are:

•AVRO files
•Delimited files
•Text files
•JSON files
•Parquet files
•XML files

The unstructured files supported are:

•Apple files
•Compressed files
•Email
•Microsoft Excel
•Microsoft PowerPoint
•Microsoft Word
•Open Office files
•PDF
•Web page
•Extended unstructured formats

Assign read and write permissions to the files to extract metadata.

Resource Connection Properties

The General tab includes the following properties:

Property	Description
Cluster Configuration Details	Select one of the following options to specify details to configure the resource: - Provide Configuration Details - Load from Configuration Archive File
Storage Type	Select one of the following options to specify the type of storage from which where you want to extract metadata: - DFS. Distributed File System - WASB. Windows Azure Blob Storage. Configure the following options if you select WASB: - Azure Storage Account URI. The fully qualified URI to access data stored in WASB. - Azure Storage Account Name. Name of the storage account. - Azure Storage Account Key. Key to access the storage account. - ABFS. Azure Blob File system. Configure the following options if you select ABFS: - Azure Storage Account URI. The fully qualified URI to access data stored in ABFS. - Azure Storage Account Name. Name of the storage account. - Azure Storage Account Key. Key to access the storage account.
Name Node URI 1	URI to the active HDFS NameNode. The active HDFS NameNode manages all the client operations in the cluster.
HA Cluster	Select Yes if the cluster is configured for high availability, and configure the following properties: - Name Node URI 2. URI to the secondary HDFS NameNode. The secondary HDFS NameNode stores modifications to HDFS as a log file appended to a native file system file. - HDFS Service Name. The service name configured for HDFS.
Distribution Type	Select one of the following Hadoop distribution types for the HDFS resource: - Cloudera - Hortonworks - IBM BigInsights - Azure HDInsight - Amazon EMR - MapR FS. Complete the prerequisites before you select MapR as a distribution type. See the Prerequisites for MapR FS Distribution Type section for more details. Note: This property is used if you select the Load from Configuration Archive File option to configure the resource.
MAPR Home	Specify the MapR client installation path. Note: This property is used if you selected MapR FS as the Hadoop distribution type.
User Name/User Principal	User name to connect to HDFS. Specify the Kerberos Principal if the cluster is enabled for Kerberos.
Source Directory	The source location from where metadata must be extracted.
Configuration Archive file	A zip file that contains the resource configuration properties in the following xml files: - core-site.xml - hdfs-site.xml - mapred-site.xml - yarn-site.xml Note: This property is used if you select the Load from Configuration Archive File option to configure the resource.
HDFS Transparent Encryption	Select Yes if transparent encryption is enabled for HDFS. Provide the fully qualified URI to the Key Management Server key provider in the Key Management Server Provider URI box. If you choose Yes, you need to import the SSL certificate into the Informatica domain infa_truststore.jks truststore file located in the <INFA_HOME>/services/shared/security location.
Kerberos Cluster	Select Yes if the cluster is enabled for Kerberos. If the cluster is enabled for Kerberos, provide the following details: - Hadoop RPC Protection. Select any of the following options based on the Remote Procedure Call (RPC) protection value configured for the cluster: - authentication - integrity - privacy Default is authentication. - HDFS Service Principal. The service principal name of HDFS service. - Keytab File. The path to the Kerberos Principal keytab file. Make sure that the keytab file is present at the specified location on Informatica domain host and cluster hosts of the Catalog Service.

For more information about configuring Azure storage access for Azure HDInsight distribution type, see the Data Engineering Integration Guide.

The Metadata Load Settings tab includes the following properties:

Property	Description
Enable Source Metadata	Extracts metadata from the data source.
File Types	Select any or all of the following file types from which you want to extract metadata: - All. Use this option to specify if you want to extract metadata from all file types. - Select. Use this option to specify that you want to extract metadata from specific file types. Perform the following steps to specify the file types: 1. Click Select. The Select Specific File Types dialog box appears. 2. Select the required files from the following options: - Extended unstructured formats. Use this option to extract metadata from file types such as audio files, video files, image files, and ebooks. - Structured file types. Use this option to extract metadata from file types such as JSON, Avro, Parquet, XML, text, and delimited files. - Unstructured file types. Use this option to extract metadata from file types such as Microsoft Excel, Microsoft PowerPoint, Microsoft Word, web pages, compressed files, emails, and PDF. 3. Click Select. Note: You can select Specific File Types option in the dialog box to select files under all the categories.
Other File Types	Extracts basic file metadata such as, file size, path, and timestamp, from file types that are not listed in the File Types property.
Treat Files Without Extension As	Select one of the following options to identify files without an extension: - None - Avro - Parquet
Enter File Delimiter	Specify the file delimiter if the file from which you extract metadata uses a delimiter other than the following list of delimiters: - Comma (,) - Horizontal tab (\t) - Semicolon (;) - Colon (:) - Pipe symbol (\|) Make sure that you enclose the delimiter in single quotes. For example, '$'. Use a comma to separate multiple delimiters. For example, '$','%','&'
First Level Directory	Specifies that all the directories must be selected. If you want specific directories to be selected, use the Select Directory option. This option is disabled if you had selected the Include Subdirectories option on the General tab.
Include Subdirectory	Type the required directories in the text box or click Select... to choose the required directories. This option is disabled if you had selected the Include Subdirectories option on the General tab or the Select all Directories option listed above.
Case Sensitive	Identifies whether the resource is configured for case sensitivity. Select one of the following values: - True. Select this check box to specify that the resource is configured as case sensitive. - False. Clear this check box to specify that the resource is configured as case insensitive. The default value is True.
Memory	The memory required to run the scanner job. Select one of the following values based on the data set size imported: - Low - Medium - High Note: For more information about the memory values, see the Tuning Enterprise Data Catalog Performance article on How To-Library Articles tab in the Informatica Doc Portal
JVM Options	JVM parameters that you can set to configure scanner container. Use the following arguments to configure the parameters: - -Dscannerloglevel=<DEBUG/INFO/ERROR>. Changes the log level of scanner to values, such as DEBUG, ERROR, or INFO. Default value is INFO. - -Dscanner.container.core=<No. of core>. Increases the core for the scanner container. The value should be a number. - -Dscanner.yarn.app.environment=<key=value>. Key pair value that you need to set in the Yarn environment. Use a comma to separate the key pair value. - -Dscanner.pmem.enabled.container.memory.jvm.memory.ratio=<1.0/2.0>. Increases the scanner container memory when pmem is enabled. Default value is 1. - Djava.security.auth.login.config={MAPR_HOME}/conf/mapr.login.conf. Extracts metadata from a MapR Hadoop distribution that accesses data from a HDFS data source. The MapR client installed on the Informatica domain and the cluster must be of version 6.1.0 and the MapR Hadoop distributed version must be MEP 6.x.
Track Data Source Changes	View metadata source change notifications in Enterprise Data Catalog.

You can enable data discovery for an HDFS resource. For more information about enabling data discovery, see the Enable Data Discovery topic.

You can enable composite data domain discovery for an HDFS resource. For more information about enabling composite data domain discovery, see the Composite Data Domain Discovery topic.

Running a HDFS Resource on Kerberos-enabled Cluster

If you want to run an HDFS resource scanner on a Kerberos-enabled cluster, perform the following steps:

1. Copy the krb.conf file to the following location: <Install Directory>/data/ldmbcmev/Informatica/LDM20_309/source/services/shared/security/krb5.conf
2. Copy the krb.conf file to /etc location on all the clusters where the Catalog Service is running.
3. Copy the keytab file to the /opt directory in the following locations:

- Common location for all clusters where Catalog Service is running.
- The domain machine.
- The Kerberos cluster machine.

4. Add the machine details of the kdc host in the etc/hosts location of the domain machine and the cluster machine where the Catalog Service is running.

Prerequisites for MapR FS Distribution Type

Complete the following prerequisites before you select MapR FS as a distribution type:

•Verify that the MapR client is installed on the Informatica Cluster Service nodes.
•Verify that the MapR client is installed in the same location across the Informatica Cluster Service nodes.
•Configure the MapR client to generate the mapr-clusters.conf and other files that are required to connect to the MapR cluster. Run the /opt/mapr/server/configure.sh -N my.cluster.com -c -C mynode01:7222 -HS mynode02 command to configure the MapR client.