Enterprise Data Catalog Scanner Configuration Guide > Configuring Data Engineering Resources > HDFS
  

HDFS

Use a HDFS resource to import metadata from CSV, XML, and JSON files.

Objects Extracted

The HDFS resource extracts metadata from files in a HDFS data source.

Permissions to Configure the Resource

Configure read permission on the HDFS data source for the user account that you use to access the data source.

Supported File Types

The HDFS resource enables you to extract metadata from structured, unstructured, and extended unstructured files.
The structured files supported are:
The unstructured files supported are:
The extended unstructured files are:
Assign read and write permissions to the files to extract metadata.

Prerequisites

If the domain is SSL-enabled and the cluster is Kerberos-enabled, perform the following steps:
  1. 1. Copy the krb5.conf, infa_truststore.jks and keytab files from the cluster to the following locations on the Informatica domain:
  2. 2. Specify the krb5.conf file path in the Informatica domain and the Informatica Cluster Service nodes.
  3. 3. Add the Service Principal Name (SPN) and the keytab properties to the Data Integration Service properties.
  4. 4. Run the kinit command on the Informatica domain with the required SPN and keytab file.
  5. 5. Run the kinit command on the cluster with the required SPN and keytab file.
  6. 6. Restart the Data Integration Service and the Catalog Service.

Choose Hadoop Connection for Existing HDFS Resources

If you have a HDFS resource that uses the Hive engine as the run-time environment, edit the resource, choose Blaze, Spark, or Databricks as the Run on option, select a Hadoop connection, and run the resource to view the profile results.

Import the HDFS Resource

Perform the following steps if you use a HDFS resource:
  1. 1. If you want to enable profiling for the HDFS resource, create a cluster configuration based on the type of cluster that you use.
  2. 2. Edit the HDFS resource to assign the new HDFS connection in the Source Connection field.
  3. 3. Select Blaze, Spark, or Databricks as the Run on option.
  4. 4. Select the Hadoop connection name.
  5. 5. If the Hive data source is located in a Kerberos-enabled cluster, make sure that you perform the following steps:
    1. a. Use Informatica Administrator to configure the Hadoop Kerberos Service Principal Name and the Hadoop Kerberos Keytab in the properties for the Data Integration Service.
    2. b. Set the value of Data Integration Service Hadoop Distribution Directory to /data/
    3. c. Create the cluster configuration if the cluster configuration was not created prior to upgrade. If the cluster configuration was created prior to upgrade, refresh the cluster configuration using the Informatica Administrator or the CLI.
    4. d. Use Informatica Administrator to append the following line to the JVM command line options for the Data Integration Service: -Djava.security.krb5.conf=/data. The JVM options are located in the Processes section of the Data Integration Service.

Basic Information

The General tab includes the following basic information about the resource:
Information
Description
Name
The name of the resource.
Description
The description of the resource.
Resource type
The type of the resource.
Execute On
You can choose to execute on the default catalog server or offline.

Resource Connection Properties

The General tab includes the following properties:
Property
Description
Cluster Configuration Details
Select one of the following options to specify details to configure the resource:
  • - Provide Configuration Details
  • - Load from Configuration Archive File
Storage Type
Select one of the following options to specify the type of storage from which where you want to extract metadata:
  • - DFS. Distributed File System
  • - WASB. Windows Azure Blob Storage. Configure the following options if you select WASB:
    • - Azure Storage Account URI. The fully qualified URI to access data stored in WASB.
    • - Azure Storage Account Name. Name of the storage account.
    • - Azure Storage Account Key. Key to access the storage account.
  • - ABFS. Azure Blob File system. Configure the following options if you select ABFS:
    • - Azure Storage Account URI. The fully qualified URI to access data stored in ABFS.
    • - Azure Storage Account Name. Name of the storage account.
    • - Azure Storage Account Key. Key to access the storage account.
Name Node URI 1
URI to the active HDFS NameNode. The active HDFS NameNode manages all the client operations in the cluster.
HA Cluster
Select Yes if the cluster is configured for high availability, and configure the following properties:
  • - Name Node URI 2. URI to the secondary HDFS NameNode. The secondary HDFS NameNode stores modifications to HDFS as a log file appended to a native file system file.
  • - HDFS Service Name. The service name configured for HDFS.
Distribution Type
Select one of the following Hadoop distribution types for the HDFS resource:
  • - Cloudera
  • - Hortonworks
  • - IBM BigInsights
  • - Azure HDInsight
  • - Amazon EMR
  • - MapR FS. Complete the prerequisites before you select MapR as a distribution type. See the Prerequisites for MapR FS Distribution Type section for more details.
Note: This property is used if you select the Load from Configuration Archive File option to configure the resource.
MAPR Home
Specify the MapR client installation path.
Note: This property is used if you selected MapR FS as the Hadoop distribution type.
User Name/User Principal
User name to connect to HDFS. Specify the Kerberos Principal if the cluster is enabled for Kerberos.
Source Directory
The source location from where metadata must be extracted.
Configuration Archive file
A zip file that contains the resource configuration properties in the following xml files:
  • - core-site.xml
  • - hdfs-site.xml
  • - mapred-site.xml
  • - yarn-site.xml
Note: This property is used if you select the Load from Configuration Archive File option to configure the resource.
HDFS Transparent Encryption
Select Yes if transparent encryption is enabled for HDFS. Provide the fully qualified URI to the Key Management Server key provider in the Key Management Server Provider URI box.
If you choose Yes, you need to import the SSL certificate into the Informatica domain infa_truststore.jks truststore file located in the <INFA_HOME>/services/shared/security location.
Kerberos Cluster
Select Yes if the cluster is enabled for Kerberos. If the cluster is enabled for Kerberos, provide the following details:
  • - Hadoop RPC Protection. Select any of the following options based on the Remote Procedure Call (RPC) protection value configured for the cluster:
    • - authentication
    • - integrity
    • - privacy
    Default is authentication.
  • - HDFS Service Principal. The service principal name of HDFS service.
  • - Keytab File. The path to the Kerberos Principal keytab file. Make sure that the keytab file is present at the specified location on Informatica domain host and cluster hosts of the Catalog Service.
For more information about configuring Azure storage access for Azure HDInsight distribution type, see the Data Engineering Integration Guide.
The Metadata Load Settings tab includes the following properties:
Property
Description
Enable Source Metadata
Extracts metadata from the data source.
File Types
Select any or all of the following file types from which you want to extract metadata:
  • - All. Use this option to specify if you want to extract metadata from all file types.
  • - Select. Use this option to specify that you want to extract metadata from specific file types. Perform the following steps to specify the file types:
    1. 1. Click Select. The Select Specific File Types dialog box appears.
    2. 2. Select the required files from the following options:
      • - Extended unstructured formats. Use this option to extract metadata from file types such as audio files, video files, image files, and ebooks.
      • - Structured file types. Use this option to extract metadata from file types such as JSON, Avro, Parquet, XML, text, and delimited files.
      • - Unstructured file types. Use this option to extract metadata from file types such as Microsoft Excel, Microsoft PowerPoint, Microsoft Word, web pages, compressed files, emails, and PDF.
    3. 3. Click Select.
    Note: You can select Specific File Types option in the dialog box to select files under all the categories.
Other File Types
Extracts basic file metadata such as, file size, path, and timestamp, from file types that are not listed in the File Types property.
Treat Files Without Extension As
Select one of the following options to identify files without an extension:
  • - None
  • - Avro
  • - Parquet
Enter File Delimiter
Specify the file delimiter if the file from which you extract metadata uses a delimiter other than the following list of delimiters:
  • - Comma (,)
  • - Horizontal tab (\t)
  • - Semicolon (;)
  • - Colon (:)
  • - Pipe symbol (|)
Make sure that you enclose the delimiter in single quotes. For example, '$'. Use a comma to separate multiple delimiters. For example, '$','%','&'
First Level Directory
Specifies that all the directories must be selected. If you want specific directories to be selected, use the Select Directory option. This option is disabled if you had selected the Include Subdirectories option on the General tab.
Include Subdirectory
Type the required directories in the text box or click Select... to choose the required directories. This option is disabled if you had selected the Include Subdirectories option on the General tab or the Select all Directories option listed above.
Non Strict Mode
Detects partitions in parquet files when compatible schemas are identified in the files.
Case Sensitive
Identifies whether the resource is configured for case sensitivity. Select one of the following values:
  • - True. Select this check box to specify that the resource is configured as case sensitive.
  • - False. Clear this check box to specify that the resource is configured as case insensitive.
The default value is True.
Memory
The memory required to run the scanner job. Select one of the following values based on the data set size imported:
  • - Low
  • - Medium
  • - High
Note: For more information about the memory values, see the Tuning Enterprise Data Catalog Performance article on How To-Library Articles tab in the Informatica Doc Portal
Custom Options
JVM parameters that you can set to configure scanner container. Use the following arguments to configure the parameters:
  • - -Dscannerloglevel=<DEBUG/INFO/ERROR>. Changes the log level of scanner to values, such as DEBUG, ERROR, or INFO. Default value is INFO.
  • - -Dscanner.container.core=<No. of core>. Increases the core for the scanner container. The value should be a number.
  • - -Dscanner.yarn.app.environment=<key=value>. Key pair value that you need to set in the Yarn environment. Use a comma to separate the key pair value.
  • - -Dscanner.pmem.enabled.container.memory.jvm.memory.ratio=<1.0/2.0>. Increases the scanner container memory when pmem is enabled. Default value is 1.
  • - Djava.security.auth.login.config={MAPR_HOME}/conf/mapr.login.conf. Extracts metadata from a MapR Hadoop distribution that accesses data from a HDFS data source. The MapR client installed on the Informatica domain and the cluster must be of version 6.1.0 and the MapR Hadoop distributed version must be MEP 6.x.
Track Data Source Changes
View metadata source change notifications in Enterprise Data Catalog.
You can enable data discovery for an HDFS resource. For more information about enabling data discovery, see the Enable Data Discovery topic.
You can enable composite data domain discovery for an HDFS resource. For more information about enabling composite data domain discovery, see the Composite Data Domain Discovery topic.

Profile Avro files

You can extract metadata, discover Avro partitions, and run profiles on Avro files with multiple-level hierarchy using an HDFS resource on the Spark engine. When you run profiles on Avro files, the data types of assets appear in the profiling results of the Enterprise Data Catalog tool.
The following asset data types appear in the profiling results:
When you select the Non Strict Mode in the Metadata Load Settings tab of the resource to detect partitions in Avro files, the partition discovery happens in the strict mode.
If partition folder contains more than 10 subfolders and some files or subfolders contain more than 10 files, some folders are not detected for potential partition. To avoid this issue, you can use the -DmaxChildPathsToValidate JVM option to override the default value and increase the number of folders to be validated.
You cannot profile Avro files that contain any of the following data types:
Note: The Avro file that includes any of the above data types also fails during profiling.

Prerequisites for MapR FS Distribution Type

Complete the following prerequisites before you select MapR FS as a distribution type: