Connections for INFACore > Connections to source and target endpoints > Microsoft Azure Data Lake Store Gen2
  

Microsoft Azure Data Lake Store Gen2

Create a Microsoft Azure Data Lake Store Gen2 connection to read from or write to Microsoft Azure Data Lake Store Gen2.

Feature snapshot

Operation
Support
Read
Yes
Write
Yes

Prerequisites

Before you configure the connection properties, you'll need to get information from your Azure account.
The following video shows you how to get information from your Azure account:
https://infa.media/3ThYFEB

Connection properties

The following table describes the Microsoft Azure Data Lake Storage Gen2 connection properties:
Property
Description
Connection Name
Name of the connection.
Each connection name must be unique within the organization. Connection names can contain alphanumeric characters, spaces, and the following special characters: _ . + -,
Maximum length is 255 characters.
Account Name
Microsoft Azure Data Lake Storage Gen2 account name or the service name.
Authentication Type
Authentication type to access the Microsoft Azure Data Lake Storage Gen2 account.
Select one of the following options:
  • - Service Principal Authentication. Uses the client ID, client secret, and tenant ID to connect to Microsoft Azure Data Lake Storage Gen2.
  • - Shared Key Authentication. Uses the account key to connect to Microsoft Azure Data Lake Storage Gen2.
  • - Managed Identity Authentication. Select to authenticate using identities that are assigned to applications in Azure to access Azure resources in Microsoft Azure Data Lake Storage Gen2.
Client ID
Applies to Service Principal Authentication and Managed Identity Authentication.
The client ID of your application.
To use service principal authentication, specify the application ID or client ID for your application registered in the Azure Active Directory.
To use managed identity authentication, specify the client ID for the user-assigned managed identity. If the permission is provided by system-assigned managed identity, leave the field empty. If there is no system-assigned identity but only a single user-assigned managed identity, you may also leave the field empty.
Client Secret
Applies to Service Principal Authentication.
The client secret key to complete the OAuth authentication in the Azure Active Directory.
Tenant ID
Applies to Service Principal Authentication.
The directory ID of the Azure Active Directory.
Account Key
Applies to Shared Key Authentication.
The account key for the Microsoft Azure Data Lake Storage Gen2 account.
File System Name
The name of the file system in the Microsoft Azure Data Lake Storage Gen2 account.
Directory Path
The path of an existing directory without the file system name.
You can select one of the following syntax:
  • - / for root directory
  • - /dir1
  • - dir1/dir2
There is no default directory.
Adls Gen2 End-point
The type of Microsoft Azure endpoints.
Select one of the following endpoints:
  • - core.windows.net. Connects to Azure endpoints.
  • - core.usgovcloudapi.net. Connects to US government Microsoft Azure Data Lake storage Gen2 endpoints.
  • - core.chinacloudapi.cn. Connects to Microsoft Azure Data Lake storage Gen2 endpoints in the China region.
Default is core.windows.net.

Read properties

The following table describes the advanced source properties that you can configure in the Python code to read from Microsoft Azure Data Lake Storage Gen2:
Property
Description
Concurrent Threads
Number of concurrent connections to extract data from the Microsoft Azure Data Lake Storage Gen2. When reading a large file or object, you can spawn multiple threads to process data. Configure Block Size to divide a large file into smaller parts.
Default is 4. Maximum is 10.
Filesystem Name Override
Overrides the default file system name.
Source Type
Select the type of source from which you want to read data. You can select the following source types:
  • - File
  • - Directory
Default is File.
Allow Wildcard Characters
Indicates whether you want to use wildcard characters for the directory source type.
Directory Override
Microsoft Azure Data Lake Storage Gen2 directory that you use to read data. Default is root directory. The directory path specified at run time overrides the path specified while creating a connection.
You can specify an absolute or a relative directory path:
  • - Absolute path. The Secure Agent searches this directory path in the specified file system.
  • Example of absolute path: Dir1/Dir2
  • - Relative path. The Secure Agent searches this directory path in the native directory path of the object.
  • Example of relative path: /Dir1/Dir2
    When you use the relative path, the imported object path is added to the file path used during the metadata fetch at runtime.
Do not specify a root directory (/) to override the directory.
File Name Override
Source object. Select the file from which you want to read data. The file specified at run time overrides the file specified in Object.
Block Size
Applicable to flat file format. Divides a large file into smaller specified block size. When you read a large file, divide the file into smaller parts and configure concurrent connections to spawn the required number of threads to process data in parallel.
Specify an integer value for the block size.
Default value in bytes is 8388608.
Timeout Interval
Not applicable.
Recursive Directory Read
Indicates whether you want to read objects stored in subdirectories in mappings.
Incremental File Load
Not applicable.
Compression Format
Reads compressed data from the source.
Select one of the following options:
  • - None. Select to read Avro, ORC, and Parquet files that use Snappy compression. The compressed files must have the .snappy extension.
  • You cannot read compressed JSON files.
  • - Gzip. Select to read flat files and Parquet files that use Gzip compression. The compressed files must have the .gz extension.
You cannot preview data for a compressed flat file.
Interim Directory
Optional. Applicable to flat files and JSON files.
Path to the staging directory in the Secure Agent machine.
Specify the staging directory where you want to stage the files when you read data from Microsoft Azure Data Lake Storage Gen2. Ensure that the directory has sufficient space and you have write permissions to the directory.
Default staging directory is /tmp.
You cannot specify an interim directory when you use the Hosted Agent.

Write properties

The following table describes the advanced target properties that you can configure in the Python code to write to Microsoft Azure Data Lake Storage Gen2:
Advanced Target Property
Description
Concurrent Threads
Number of concurrent connections to load data from the Microsoft Azure Data Lake Storage Gen2. When writing a large file, you can spawn multiple threads to process data. Configure Block Size to divide a large file into smaller parts.
Default is 4. Maximum is 10.
Filesystem Name Override
Overrides the default file name.
Directory Override
Microsoft Azure Data Lake Storage Gen2 directory that you use to write data. Default is root directory. The Secure Agent creates the directory if it does not exist. The directory path specified at run time overrides the path specified while creating a connection.
You can specify an absolute or a relative directory path:
  • - Absolute path - The Secure Agent searches this directory path in the specified file system.
  • Example of absolute path: Dir1/Dir2
  • - Relative path - The Secure Agent searches this directory path in the native directory path of the object.
  • Example of relative path: /Dir1/Dir2
    When you use the relative path, the imported object path is added to the file path used during the metadata fetch at runtime.
Do not specify a root directory (/) to override the directory.
File Name Override
Target object. Select the file from which you want to write data. The file specified at run time overrides the file specified in Object.
Write Strategy
Applicable to flat files in mappings.
When you create a mapping in advanced mode, you can use write strategy for both flat files and complex files.
If the file exists in Microsoft Azure Data Lake Storage Gen2, you can select to overwrite or append the existing file.
The maximum size of data that you can append is 450 MB.
When you append data for mappings in advanced mode, the data is appended as a new part file in the existing target directory.
Block Size
Applicable to flat, Avro, and Parquet file formats. Divides a large file into smaller specified block size. When you write a large file, divide the file into smaller parts and configure concurrent connections to spawn the required number of threads to process data in parallel.
Specify an integer value for the block size.
Default value in bytes is 8388608.
Compression Format
Compresses and writes data to the target based on the format you specify.
Select one of the following options:
  • - None. Select to write Avro, ORC, and Parquet files that use Snappy compression.
  • You cannot write compressed JSON files.
  • - Gzip. Select to write flat files and Parquet files that use Gzip compression.
When the task runs, the file extensions .gz or .snappy do not appear in target object name.
Timeout Interval
Not applicable.
Interim Directory
Optional. Applicable to flat files and JSON files.
Path to the staging directory in the Secure Agent machine.
Specify the staging directory where you want to stage the files when you write data to Microsoft Azure Data Lake Storage Gen2. Ensure that the directory has sufficient space and you have write permissions to the directory.
Default staging directory is /tmp.
You cannot specify an interim directory for mappings in advanced mode.
You cannot specify an interim directory when you use the Hosted Agent.