Microsoft Azure Data Lake Storage Gen2 sources in mappings
In a mapping, you can configure a source transformation to represent a single Microsoft Azure Data Lake Storage Gen2 object.
The following table describes the Microsoft Azure Data Lake Storage Gen2 source properties that you can configure in a source transformation:
Property
Description
Connection
Name of the source connection. Select a source connection or click New Parameter to define a new parameter for the source connection.
If you want to overwrite the parameter at runtime, select the Allow parameter to be overridden at run time option when you create a parameter. When the task runs, the agent uses the parameters from the file that you specify in the task advanced session properties. Ensure that the parameter file is in the correct format.
When you switch between a non-parameterized and a parameterized Microsoft Azure Data Lake Storage Gen2 connection, the advanced property values are retained.
Source Type
Select Single Object or Parameter.
Object
Name of the source object.
Ensure that the headers or file data does not contain special characters.
Parameter
Select an existing parameter for the source object or click New Parameter to define a new parameter for the source object. The Parameter property appears only if you select Parameter as the source type.
When you parameterize the source object, specify the complete object path including the file system in the default value of the parameter.
If you want to overwrite the parameter at runtime, select the Allow parameter to be overridden at run time option when you create a parameter. When the task runs, the agent uses the parameters from the file that you specify in the task advanced session properties. Ensure that the parameter file is in the correct format.
Format
The file format that the Microsoft Azure Data Lake Storage Gen2 Connector uses to read data from Microsoft Azure Data Lake Storage Gen2.
Note: Ensure that the source file is not empty.
You can select from the following file format types:
Default is None. If you select None as the format type, Microsoft Azure Data Lake Storage Gen2 Connector reads data from Microsoft Azure Data Lake Storage Gen2 files in binary format.
Determines the type of data that you can read when you select the Document file format.
You can only read from PDF files with document file format.
Intelligent Structure Model1
Applies to Discover Structure format type. Determines the underlying patterns in a sample file and auto-generates a model for files with the same data and structure.
Select one of the following options to associate a model with the transformation:
- Select. Select an existing model.
- New. Create a new model. Select Design New to create the model. Select Auto-generate from sample file for Intelligent Structure Discovery to generate a model based on sample input that you select.
Select one of the following options to validate the XML source object against an XML-based hierarchical schema:
- Source object doesn't require validation.
- Source object requires validation against a hierarchical schema. Select to validate the XML source object against an existing or a new hierarchical schema.
When you create a mapping task, on the Runtime Options tab, you configure how Data Integration handles the schema mismatch. You can choose to skip the mismatched files and continue to run the task or stop the task when the task encounters the first file that does not match.
For more information, see Components.
1Applies only to mappings in advanced mode.
The following table describes the Microsoft Azure Data Lake Storage Gen2 source advance properties:
Property
Description
Concurrent Threads1
Number of concurrent connections to extract data from the Microsoft Azure Data Lake Storage Gen2. When reading a large file or object, you can spawn multiple threads to process data. Configure Block Size to divide a large file into smaller parts.
Default is 4. Maximum is 10.
Filesystem Name Override
Overrides the default file system name.
Source Type
Select the type of source from which you want to read data. You can select the following source types:
- File
- Directory
Default is File.
Allow Wildcard Characters
Indicates whether you want to use wildcard characters for the directory source type.
Microsoft Azure Data Lake Storage Gen2 directory that you use to read data. Default is root directory. The directory path specified at run time overrides the path specified while creating a connection.
You can specify an absolute or a relative directory path:
- Absolute path - The Secure Agent searches this directory path in the specified file system.
Example of absolute path: Dir1/Dir2
- Relative path - The Secure Agent searches this directory path in the native directory path of the object.
Example of relative path: /Dir1/Dir2
When you use the relative path, the imported object path is added to the file path used during the metadata fetch at runtime.
Do not specify a root directory (/) to override the directory.
File Name Override
Source object. Select the file from which you want to read data. The file specified at run time overrides the file specified in Object.
Block Size1
Applicable to flat file format. Divides a large file into smaller specified block size. When you read a large file, divide the file into smaller parts and configure concurrent connections to spawn the required number of threads to process data in parallel.
Specify an integer value for the block size.
Default value in bytes is 8388608.
Timeout Interval
Not applicable.
Recursive Directory Read
Indicates whether you want to read objects stored in subdirectories in mappings.
- None. Select to read Avro, ORC, and Parquet files that use Snappy compression. The compressed files must have the .snappy extension.
You cannot read compressed JSON files.
- Gzip. Select to read flat files and Parquet files that use Gzip compression. The compressed files must have the .gz extension.
You cannot preview data for a compressed flat file.
Interim Directory1
Optional. Applicable to flat files and JSON files.
Path to the staging directory in the Secure Agent machine.
Specify the staging directory where you want to stage the files when you read data from Microsoft Azure Data Lake Storage Gen2. Ensure that the directory has sufficient space and you have write permissions to the directory.
Default staging directory is /tmp.
You cannot specify an interim directory when you use the Hosted Agent.
Tracing Level
Sets the amount of detail that appears in the log file. You can choose terse, normal, verbose initialization or verbose data. Default is normal.
1Doesn't apply to mappings in advanced mode.
2Applies only to mappings in advanced mode.
Directory source in Microsoft Azure Data Lake Storage Gen2 sources
You can select the type of source from which you want to read data.
You can select the following type of sources from the Source Type option under the advanced source properties:
•File
•Directory
Use the following rules and guidelines to select Directory as the source type:
•All the source files in the directory must contain the same metadata.
•All the files must have data in the same format. For example, delimiters, header fields, and escape characters must be same.
•All the files under a specified directory are parsed. To parse the files in the subdirectories, use recursive read.
•When you run a mapping that reads data from a directory, the agent creates a single file in the target. When you create a mapping in advanced mode, the agent creates multiple files in the target.
Wildcard characters
When you read data from an Avro, flat, JSON, ORC, Parquet, or PDF, you can use wildcard characters to specify the source file name.
To use wildcard characters for the source file name, select the source type as Directory and enable the Allow Wildcard Characters option in the advanced source properties.
When you read an Avro, JSON, ORC, Parquet, flat file, or PDF, you can use the ? and * wildcard characters to define one or more characters in a search.
You can use the following wildcard characters:
? (Question mark)
The question mark character (?) allows one occurrence of any character. For example, if you enter the source file name as a?b.txt, the Secure Agent reads data from files with the following names:
- a1b.txt
- a2b.txt
- aab.txt
- acb.txt
* (Asterisk)
The asterisk mark character (*) allows zero or more than one occurrence of any character. If you enter the source file name as a*b.txt, the Secure Agent reads data from files with the following names:
- aab.txt
- a1b.txt
- ab.txt
- abc11b.txt
Rules and guidelines for wildcard characters
Consider the following rules and guidelines when you use wildcard characters:
Mappings
- You cannot use wildcard characters when you read from partition columns.
- When you read a complex file in a mapping, do not use a tilde (~) in the sub-directory name or file name.
- When you use wildcard characters in directory override, the Secure Agent reads data from the folders as well as the files that match the name pattern.
Mappings in advanced mode
- When you read a flat file or complex file and enable wildcard characters, ensure that the path specified in the directory override or file name override matches the file path in the source.
- When you use wildcard characters, ensure that the file name does not start with a special character.
- When you read a flat file, do not use the following special characters in the directory name or sub-directory name in the directory override:
[] {} " ' + ^ % * ? space
Reading files from subdirectories
You can read objects stored in subdirectories in Microsoft Azure Data Lake Storage Gen2 in mappings.
You can use recursive read for flat files and complex files in mappings. You cannot use recursive read for Delta files in mappings.When you create a mapping in advanced mode, you cannot use recursive read for flat files.
To enable recursive read, select the source type as Directory in the advanced source properties. Enable the Recursive Directory Read advanced source property to read objects stored in subdirectories.
Rules and guidelines for reading from subdirectories
Consider the following rules and guidelines when you read objects stored in subdirectories:
Mappings
- When you read from or write to a flat file in Microsoft Azure Data Lake Storage Gen2, ensure that the directory or subdirectory name does not contain the percentage (%) character. Else, the mapping fails.
- You cannot use recursive read when you read from partition columns.
- When you read a complex file in a mapping, do not use a tilde (~) in the subdirectory name or file name.
- When the FileName field for the source and target is mapped, the file is created in the following format:
- When you read a flat file with only headers and no data and map the FileName field, the expected directory structure is not created with the FileName field.
Mappings in advanced mode
- When you read a complex file and enable recursive read, ensure that the path specified in the directory override or file name override matches the file path in the source.
- When you read a flat file, do not use the following special characters in the directory name or sub-directory name in the directory override:
[] {} " ' + ^ % * ? space
Incrementally loading files
You can incrementally load source files in a directory to read and process only the files that have changed since the last time the mapping task ran.
You can incrementally load files only from mappings in advanced mode. Ensure that all of the source files exist in the same Cloud environment. When you use discover structure and document file format in a mapping in advanced mode, you can incrementally load source files from a directory to read and process only the files that have changed since the last time the mapping ran.
To incrementally load source files, select Incremental File Load and Directory as the source type in the advanced read options of the Microsoft Azure Data Lake Storage Gen2 data object.
When you incrementally load files from Microsoft Azure Data Lake Storage Gen2, the job loads files that have changed from the last load time to five minutes before the job started running. For example, if you run a job at 2:00 p.m, the job loads files changed before 1:55 p.m. The five-minute buffer ensures that the job loads only complete files because uploading objects on Microsoft Azure Data Lake Storage Gen2 can take a few minutes to complete.
When you configure a mapping task, the Incremental File Load section lists the Source transformations that will incrementally load files and the time that the last job completed loading the files. By default, the next job that runs checks for files modified after the last load time.
You can also override the load time that the mapping uses to look for changed files in the specified source directory. You can reset the incremental file load settings to perform a full load of all the changed files in the directory, or you can configure a time that the mapping uses to look for changed files.
A mapping in advanced mode that incrementally loads a directory that contains complex file format such as JSON fails if there are no new or changed files in the source since the last run.
You can enable full SQL ELT optimization when you want to load data from Microsoft Azure Data Lake Storage Gen2 sources to your data warehouse in Microsoft Azure Synapse SQL. While loading the data to Microsoft Azure Synapse SQL, you can transform the data as per your data warehouse model and requirements. When you enable full SQL ELT optimization on a mapping task, the mapping logic is pushed to the Azure environment to leverage Azure commands. For more information, see the help for Microsoft Azure Synapse SQL Connector.
If you need to load data to any other supported cloud data warehouse, see the connector help for the applicable cloud data warehouse.