HDFS
Use a HDFS resource to import metadata from CSV, XML, and JSON files.
Objects Extracted
The HDFS resource extracts metadata from files in a HDFS data source.
Permissions to Configure the Resource
Configure read permission on the HDFS data source for the user account that you use to access the data source.
Supported File Types
The HDFS resource enables you to extract metadata from structured and unstructured files.
The structured files supported are:
- •AVRO files
- •Delimited files
- •Text files
- •JSON files
- •Parquet files
- •XML files
The unstructured files supported are:
- •Apple files
- •Compressed files
- •Email
- •Microsoft Excel
- •Microsoft PowerPoint
- •Microsoft Word
- •Open Office files
- •PDF
- •Web page
- •Extended unstructured formats
Assign read and write permissions to the files to extract metadata.
Resource Connection Properties
The General tab includes the following properties:
Property | Description |
---|
Cluster Configuration Details | Select one of the following options to specify details to configure the resource: - - Provide Configuration Details
- - Load from Configuration Archive File
|
Storage Type | Select one of the following options to specify the type of storage from which where you want to extract metadata: - - DFS. Distributed File System
- - WASB. Windows Azure Blob Storage. Configure the following options if you select WASB:
- - Azure Storage Account URI. The fully qualified URI to access data stored in WASB.
- - Azure Storage Account Name. Name of the storage account.
- - Azure Storage Account Key. Key to access the storage account.
- - ABFS. Azure Blob File system. Configure the following options if you select ABFS:
- - Azure Storage Account URI. The fully qualified URI to access data stored in ABFS.
- - Azure Storage Account Name. Name of the storage account.
- - Azure Storage Account Key. Key to access the storage account.
|
Name Node URI 1 | URI to the active HDFS NameNode. The active HDFS NameNode manages all the client operations in the cluster. |
HA Cluster | Select Yes if the cluster is configured for high availability, and configure the following properties: - - Name Node URI 2. URI to the secondary HDFS NameNode. The secondary HDFS NameNode stores modifications to HDFS as a log file appended to a native file system file.
- - HDFS Service Name. The service name configured for HDFS.
|
Distribution Type | Select one of the following Hadoop distribution types for the HDFS resource: - - Cloudera
- - Hortonworks
- - IBM BigInsights
- - Azure HDInsight
- - Amazon EMR
- - MapR FS. Complete the prerequisites before you select MapR as a distribution type. See the Prerequisites for MapR FS Distribution Type section for more details.
Note: This property is used if you select the Load from Configuration Archive File option to configure the resource. |
MAPR Home | Specify the MapR client installation path. Note: This property is used if you selected MapR FS as the Hadoop distribution type. |
User Name/User Principal | User name to connect to HDFS. Specify the Kerberos Principal if the cluster is enabled for Kerberos. |
Source Directory | The source location from where metadata must be extracted. |
Configuration Archive file | A zip file that contains the resource configuration properties in the following xml files: - - core-site.xml
- - hdfs-site.xml
- - mapred-site.xml
- - yarn-site.xml
Note: This property is used if you select the Load from Configuration Archive File option to configure the resource. |
HDFS Transparent Encryption | Select Yes if transparent encryption is enabled for HDFS. Provide the fully qualified URI to the Key Management Server key provider in the Key Management Server Provider URI box. If you choose Yes, you need to import the SSL certificate into the Informatica domain infa_truststore.jks truststore file located in the <INFA_HOME>/services/shared/security location. |
Kerberos Cluster | Select Yes if the cluster is enabled for Kerberos. If the cluster is enabled for Kerberos, provide the following details: |
For more information about configuring Azure storage access for Azure HDInsight distribution type, see the Data Engineering Integration Guide.
The Metadata Load Settings tab includes the following properties:
Property | Description |
---|
Enable Source Metadata | Extracts metadata from the data source. |
File Types | Select any or all of the following file types from which you want to extract metadata: - - All. Use this option to specify if you want to extract metadata from all file types.
- - Select. Use this option to specify that you want to extract metadata from specific file types. Perform the following steps to specify the file types:
- 1. Click Select. The Select Specific File Types dialog box appears.
- 2. Select the required files from the following options:
- - Extended unstructured formats. Use this option to extract metadata from file types such as audio files, video files, image files, and ebooks.
- - Structured file types. Use this option to extract metadata from file types such as JSON, Avro, Parquet, XML, text, and delimited files.
- - Unstructured file types. Use this option to extract metadata from file types such as Microsoft Excel, Microsoft PowerPoint, Microsoft Word, web pages, compressed files, emails, and PDF.
- 3. Click Select.
Note: You can select Specific File Types option in the dialog box to select files under all the categories.
|
Other File Types | Extracts basic file metadata such as, file size, path, and timestamp, from file types that are not listed in the File Types property. |
Treat Files Without Extension As | Select one of the following options to identify files without an extension: |
Enter File Delimiter | Specify the file delimiter if the file from which you extract metadata uses a delimiter other than the following list of delimiters: - - Comma (,)
- - Horizontal tab (\t)
- - Semicolon (;)
- - Colon (:)
- - Pipe symbol (|)
Make sure that you enclose the delimiter in single quotes. For example, '$'. Use a comma to separate multiple delimiters. For example, '$','%','&' |
First Level Directory | Specifies that all the directories must be selected. If you want specific directories to be selected, use the Select Directory option. This option is disabled if you had selected the Include Subdirectories option on the General tab. |
Include Subdirectory | Type the required directories in the text box or click Select... to choose the required directories. This option is disabled if you had selected the Include Subdirectories option on the General tab or the Select all Directories option listed above. |
Case Sensitive | Identifies whether the resource is configured for case sensitivity. Select one of the following values: - - True. Select this check box to specify that the resource is configured as case sensitive.
- - False. Clear this check box to specify that the resource is configured as case insensitive.
The default value is True. |
Memory | The memory required to run the scanner job. Select one of the following values based on the data set size imported: Note: For more information about the memory values, see the Tuning Enterprise Data Catalog Performance article on How To-Library Articles tab in the Informatica Doc Portal |
JVM Options | JVM parameters that you can set to configure scanner container. Use the following arguments to configure the parameters: - - -Dscannerloglevel=<DEBUG/INFO/ERROR>. Changes the log level of scanner to values, such as DEBUG, ERROR, or INFO. Default value is INFO.
- - -Dscanner.container.core=<No. of core>. Increases the core for the scanner container. The value should be a number.
- - -Dscanner.yarn.app.environment=<key=value>. Key pair value that you need to set in the Yarn environment. Use a comma to separate the key pair value.
- - -Dscanner.pmem.enabled.container.memory.jvm.memory.ratio=<1.0/2.0>. Increases the scanner container memory when pmem is enabled. Default value is 1.
- - Djava.security.auth.login.config={MAPR_HOME}/conf/mapr.login.conf. Extracts metadata from a MapR Hadoop distribution that accesses data from a HDFS data source. The MapR client installed on the Informatica domain and the cluster must be of version 6.1.0 and the MapR Hadoop distributed version must be MEP 6.x.
|
Track Data Source Changes | View metadata source change notifications in Enterprise Data Catalog. |
You can enable data discovery for an HDFS resource. For more information about enabling data discovery, see the
Enable Data Discovery topic.
You can enable composite data domain discovery for an HDFS resource. For more information about enabling composite data domain discovery, see the
Composite Data Domain Discovery topic.
Running a HDFS Resource on Kerberos-enabled Cluster
If you want to run an HDFS resource scanner on a Kerberos-enabled cluster, perform the following steps:
- 1. Copy the krb.conf file to the following location: <Install Directory>/data/ldmbcmev/Informatica/LDM20_309/source/services/shared/security/krb5.conf
- 2. Copy the krb.conf file to /etc location on all the clusters where the Catalog Service is running.
- 3. Copy the keytab file to the /opt directory in the following locations:
- - Common location for all clusters where Catalog Service is running.
- - The domain machine.
- - The Kerberos cluster machine.
- 4. Add the machine details of the kdc host in the etc/hosts location of the domain machine and the cluster machine where the Catalog Service is running.
Prerequisites for MapR FS Distribution Type
Complete the following prerequisites before you select MapR FS as a distribution type:
- •Verify that the MapR client is installed on the Informatica Cluster Service nodes.
- •Verify that the MapR client is installed in the same location across the Informatica Cluster Service nodes.
- •Configure the MapR client to generate the mapr-clusters.conf and other files that are required to connect to the MapR cluster. Run the /opt/mapr/server/configure.sh -N my.cluster.com -c -C mynode01:7222 -HS mynode02 command to configure the MapR client.