Enterprise Data Catalog Scanner Configuration Guide > Configuring File Management Resources > File System
  

File System

You can choose the File System resource to import metadata from files in Windows and Linux file systems into the catalog. You can choose the Local File option to import the metadata from the files located on a local machine. You can choose the SFTP or Secure File Transfer Protocol option to import the files located on a remote Linux machine and choose the SMB/CIFS or Server Message Block/Common Internet File System option to import the files located on a remote Windows machine.

Resource Connection Properties

In the General tab, you can enter a name and brief description for the resource. In the Additional properties section, choose File System as the resource.
Choose the following options in the Connection Properties section to import the metadata from the files located on the local machine:
Property
Description
File Protocol
Choose Local File to scan the files in the local Windows or Linux machine.
Path
Specify the absolute path of the file from where you want to import the metadata into the catalog.
Make sure that the absolute path that you enter in this field is available on the machine where you have installed Enterprise Data Catalog and all the cluster nodes.
Test Connection
Click Test Connection to verify the connection to the specified location.
Choose the following options in the Connection Properties section to import the metadata from the files located on a Linux machine:
Property
Description
File Protocol
Choose SFTP to scan the files on a remote Linux machine.
User Name
Enter the user name to access the Linux machine.
Password
Enter the password to access the Linux machine.
Host
Specify the host name or IP address of the Linux machine.
Port
Specify the port number on which the SFTP file protocol is configured on the machine. The default value is 22.
Path
Specify the network or shared path of the file from where you want to import the metadata into the catalog.
Test Connection
Click Test Connection to verify the connection to the specified location.
Choose the following options in the Connection Properties section to import the metadata from the files located on a Windows machine:
Property
Description
File Protocol
Choose SMB/CIFS to scan the files located on a remote Windows machine.
User Name
Enter the user name to access the Windows machine.
Password
Enter the password to access the Windows machine.
Physical Location
Applies to flat files. Enter the absolute path of the shared directory on the Windows machine.
For example, assume that the files are located in the c:\user1\1021\SMB\test_files\test1 and c:\user1\1021\SMB\test_files\test2 folders. You map the c:\user1\1021\SMB location which results in the \\SMB\ shared directory. In this scenario, enter \\SMB\ in the Physical Location field.
For more information about how Enterprise Data Catalog uses the physical location field for lineage, see Appendix F - Lineage between PowerCenter Resources and Flat Files.
Host
Specify the host name or IP address of the machine.
Path
Specify the network or shared path from where you want to import the metadata into the catalog.
For example, enter \\SMB\test_files\ in the Path field.
Test Connection
Click Test Connection to verify the connection to the specified location.
The Metadata Load Settings tab includes the following properties:
Property
Description
Enable Source Metadata
Extracts and ingests metadata from the data source.
File Types
Select any or all of the following file types from which you want to extract metadata:
  • - All. Use this option to specify if you want to extract metadata from all file types.
  • - Select. Use this option to specify that you want to extract metadata from specific file types. Perform the following steps to specify the file types:
    1. 1. Click Select. The Select Specific File Types dialog box appears.
    2. 2. Select the required files from the following options:
      • - Extended unstructured formats. Use this option to extract metadata from file types such as audio files, video files, image files, and ebooks.
      • - Structured file types. Use this option to extract metadata from file types such as JSON, XML, text, and delimited files.
      • - Unstructured file types. Use this option to extract metadata from file types such as Microsoft Excel, Microsoft PowerPoint, Microsoft Word, web pages, compressed files, emails, and PDF.
    3. 3. Click Select.
    Note: You can select Specific File Types option in the dialog box to select files under all the categories.
Treat Files Without Extension As
Select one of the following options to identify files without an extension:
  • - None
  • - Avro
  • - Parquet
Enter File Delimiter
You can specify one of the following delimiters:
  • - Single-character delimiter. Specify the file delimiter if the file from which you extract metadata uses a delimiter other than the following delimiters:
    • - Comma (,)
    • - Horizontal tab (\t)
    • - Semicolon (;)
    • - Colon (:)
    • - Pipe symbol (|)
    Verify that you enclose the delimiter in single quotes. For example, '$'. Use a comma to separate multiple delimiters. For example, '$','%','&'
  • - Multi-character delimiter. Specify multiple characters as a delimiter within single quotes for CSV files. For example, '|$#'.
  • If you specify a multi-character delimiter, do not specify any other delimiter for the scanner. When you run a file system resource with a multi-character delimiter, the scanner considers the files that contain the delimiter as valid CSV files and the remaining files as unstructured files while processing the files.
The following resources do not support multi-character delimiter for CSV files in Enterprise Data Catalog:
  • - File System resource that uses SFTP or SMB/CIFS protocol.
  • - Amazon S3
  • - OneDrive
  • - SharePoint
  • - Microsoft Azure Blob Storage
  • - Azure Data Lake Store
Other File Types
Extracts basic file metadata such as size of the file, path to the file, and time stamp information from other file types.
First Level Directory
Specify a directory or a list of directories under the source directory. If you leave this option blank, Enterprise Data Catalog imports all the files from the specified source directory.
To specify a directory or a list of directories, you can perform the following steps:
  1. 1. Click Select.... The Select First Level Directory dialog box appears.
  2. 2. Select the required directories using one of the following options:
    • - Select from list: select the required directories from a list of directories.
    • - Select using regex: provide an SQL regular expression to select schemas that match the expression.
Note: If you are selecting multiple directories, you must separate the directories using a semicolon (;).
Include Subdirectory
Select this option to import all the files in the subdirectories under the source directory.
Case Sensitive
Specifies that the resource is configured for case sensitivity. Select one of the following values:
  • - True. Select this check box to specify that the resource is configured as case sensitive.
  • - False. Clear this check box to specify that the resource is configured as case insensitive.
The default value is True.
Memory
The memory required to run the scanner job. Select one of the following values based on the data set size imported:
  • - Low
  • - Medium
  • - High
Note: For more information about the memory values, see the Tuning Enterprise Data Catalog Performance article on How To-Library Articles tab in the Informatica Doc Portal
JVM Options
JVM parameters that you can set to configure scanner container. Use the following arguments to configure the parameters:
  • - -Dscannerloglevel=<DEBUG/INFO/ERROR>. Changes the log level of scanner to values, such as DEBUG, ERROR, or INFO. Default value is INFO.
  • - -Dscanner.container.core=<No. of core>. Increases the core for the scanner container. The value should be a number.
  • - -Dscanner.yarn.app.environment=<key=value>. Key pair value that you need to set in the Yarn environment. Use a comma to separate the key pair value.
  • - -Dscanner.pmem.enabled.container.memory.jvm.memory.ratio=<1.0/2.0>. Increases the scanner container memory when pmem is enabled. Default value is 1.
Track Data Source Changes
View metadata source change notifications in Enterprise Data Catalog.
You can enable data discovery for a File System resource. For more information about enabling data discovery, see the Enable Data Discovery topic.
You can enable composite data domain discovery for a File System resource. For more information about enabling composite data domain discovery, see the Composite Data Domain Discovery topic.