Amazon S3 V2 Connector > Amazon S3 V2 sources and targets > Amazon S3 V2 sources

Amazon S3 V2 sources

You can use an Amazon S3 V2 object as a source in a mapping or amapping task.

Specify the name and description of the Amazon S3 V2 source. Configure the Amazon S3 V2 source and advanced properties for the source object.

Data encryption in Amazon S3 V2 sources

You can decrypt data when you read from an Amazon S3 V2 source.

The following table lists the encryption types that you can use for various file types:

Encryption type	File type
Client-side encryption	Binary¹, Flat
Server-side encryption	Avro, Binary¹, Delta¹, Flat, JSON², ORC, Parquet
Server-side encryption with KMS	Avro, Binary¹, Delta¹, Flat, JSON², ORC, Parquet
Informatica encryption	Binary¹, Flat
¹Doesn't apply to mappings in advanced mode. ²Applies only to mappings in advanced mode.

Client-side encryption for Amazon S3 V2 sources

Client-side encryption is a technique to encrypt data before transmitting the data to the Amazon S3 server.

You can read a client-side encrypted file in an Amazon S3 bucket. To read client-side encrypted files, you must provide a master symmetric key or customer master key in the connection properties. The Secure Agent decrypts the data by using the master symmetric key or customer master key.

When you generate a client-side encrypted file using a third-party tool, metadata for the encrypted file is generated. To read an encrypted file from Amazon S3, you must upload the encrypted file and the metadata for the encrypted file to the Amazon S3 bucket.

You require the following keys in the metadata when you upload the encrypted file:

•Content-Type
•x-amz-meta-x-amz-key
•x-amz-meta-x-amz-unencrypted-content-length
•x-amz-meta-x-amz-matdesc
•x-amz-meta-x-amz-iv

Reading a client-side encrypted file

Perform the following tasks to read a client-side encrypted file:

1Provide the master symmetric key when you create an Amazon S3 V2 connection.

Ensure that you provide a 256-bit AES encryption key in Base64 format.

2Copy the local_policy.jar and US_export_policy.jar files to either of the following directories available within your Secure Agent installation:

- From <Secure Agent installation directory>/jdk/jre/lib/security/policy/unlimited/ to <Secure Agent installation directory>/jdk/jre/lib/security/
- From <Secure Agent installation directory>/jdk8/jre/lib/security/policy/unlimited/ to <Secure Agent installation directory>/jdk8/jre/lib/security/

3Restart the Secure Agent.

Server-side encryption for Amazon S3 V2 sources

Server-side encryption is a technique to encrypt data using Amazon S3-managed encryption keys. Server-side encryption with KMS is a technique to encrypt data using the AWS KMS-managed customer master key.

Server-side encryption: To read a server-side encrypted file, select the encrypted file in the Amazon S3 V2 source.

Server-side encryption with KMS: To read a server-side encrypted file with KMS, specify the AWS KMS-managed customer master key in the Customer Master Key ID connection property and select the encrypted file in the Amazon S3 V2 source.

Note: You do not need to specify the encryption type in the advanced source properties.

Informatica encryption for Amazon S3 V2 sources

You can download a binary or flat source file that is encrypted using the Informatica crypto libraries in the local machine or staging location and decrypt the source files.

Informatica encryption is applicable only when you run mappings on the Secure Agent machine. To read a source file that is encrypted using the Informatica crypto libraries, perform the following tasks:

1Ensure that the organization administrator has permission to Informatica crypto libraries license when you create an Amazon S3 V2 connection.
2Select Informatica Encryption as the encryption type in the advanced source properties.

Note: For Informatica Encryption in the advanced cluster, you must install the Secure Agent on the Amazon EC2 machine.

When you read an Informatica encrypted source file and select the Informatica Encryption as the encryption type, the data preview fails.

To preview the data successfully, select a dummy source file that contains same metadata present in the Informatica encrypted source file that you want to read. Enter the file name of the Informatica encrypted source file in the File Name advanced source property to override the file name of the dummy source file. Then, select Informatica Encryption as the encryption type in the advanced source property.

Note: When you use Informatica encryption in a mapping, you cannot decrypt more than 1000 files.

Source types in Amazon S3 V2 sources

You can select the type of source from which you want to read data.

You can select the following type of sources from the Source Type option under the Amazon S3 V2 advanced source properties:

File: You must enter the bucket name that contains the Amazon S3 file. If applicable, include the folder name that contains the target file in the <bucket_name>/<folder_name> format.; Amazon S3 V2 Connector provides the option to override the value of the Folder Path and File Name properties during run time.

Directory

Reading from multiple files

You can read multiple files, which are of flat format type, from Amazon S3 and write data to a target in a mapping.

You can use the following types of manifest files:

•Custom manifest file
•Amazon Redshift manifest file

Custom manifest file

You can read multiple files, which are of flat format type, from Amazon S3 and write data to a target. To read multiple flat files, all files must be available in the same Amazon S3 bucket.

When you want to read from multiple sources in the Amazon S3 bucket, you must create a .manifest file that contains all the source files with the respective absolute path or directory path. You must specify the .manifest file name in the following format: <file_name>.manifest.

For example, the .manifest file contains source files in the following format:

{
"fileLocations": [
{
"URIs": [
"dir1/dir2/dir3/file_1.csv",
"dir1/dir2/dir3/file_2.csv",
"dir1/file_3.csv"
]
},
{
"URIPrefixes": [
"dir1/dir2/dir3/",
"dir1/dir2/dir4/"
]
},
{
"WildcardURIs": [
"dir1/dir2/dir3/*.csv"
]
}
]
}

The custom manifest file contains the following tags:

•URIs. Specify the path for the files relative to the bucket name.
•URIPrefixes. Specify the path for the directory relative to the bucket name.
•WildcardURIs. Specify an asterisk (*) wildcard in the file name, which are of flat format type, to fetch files from the Amazon S3 bucket. Specify the asterisk (*) wildcard to fetch all the files or only the files that match the name pattern.

You can specify URIs, URIPrefixes, WildcardURIs, or all sections within fileLocations in the .manifest file.

You cannot use the wildcard characters to specify folder names. For example, { "WildcardURIs": [ "multiread_wildcard/dir1*/", "multiread_wildcard/*/" ] }.

The Data Preview tab displays the data of the first file available in the URI specified in the .manifest file. If the URI section is empty, the first file in the folder specified in URIPrefixes is displayed.

Amazon Redshift manifest file

You can use an Amazon Redshift manifest file created by the UNLOAD command to read multiple flat files from Amazon S3. All flat files must have the same metadata and must be available in the same Amazon S3 bucket.

Create a .manifest file and list all the source files with the URL that includes the bucket name and full object path for the file. You must specify the .manifest file name in the following format: <file_name>.manifest.

For example, the Amazon Redshift manifest file contains source files in the following format:

{
"entries": [
{"url": "s3://mybucket-alpha/2013-10-04-custdata", "mandatory":true},
{"url": "s3://mybucket-alpha/2013-10-05-custdata", "mandatory":true},
{"url": "s3://mybucket-beta/2013-10-04-custdata", "mandatory":true},
{"url": "s3://mybucket-beta/2013-10-05-custdata", "mandatory":true},
]
}

The Redshift manifest file format contains the following tags:

url: The url tag consists of the source file in the following format:
mandatory: Amazon S3 V2 Connector uses the mandatory tag to determine whether to continue reading the files in the .manifest file or not, based on the following scenarios:; By default, the value of mandatory tag is false.

Incrementally loading files

You can incrementally load source files in a directory to read and process only the files that have changed since the last time the mapping task ran.

You can incrementally load files only from mappings in advanced mode. Ensure that all of the source files exist in the same Cloud environment.

To incrementally load source files, select Incremental File Load and Directory as the source type in the advanced read options of the Amazon S3 V2 data object.

When you incrementally load files from Amazon S3, the job loads files that have changed from the last load time to 20 minutes before the job started running. For example, if you run a job at 2:00 p.m, the job loads files changed before 1:45 p.m. The 20-minute buffer ensures that the job loads only complete files, since uploading objects on Amazon S3 can take a few minutes to complete.

When you configure a mapping task, the Incremental File Load section lists the Source transformations that incrementally loads files and the time that the last job completed loading the files. By default, the next job that runs checks for files modified after the last load time.

The image shows the details of incremental file load

You can also override the load time that the mapping uses to look for changed files in the specified source directory. You can reset the incremental file load settings to perform a full load of all the changed files in the directory, or you can configure a time that the mapping uses to look for changed files.

A mapping in advanced mode that incrementally load a directory that contains complex file formats such as Parquet and Avro fails if there are no new or changed files in the source since the last run.

For more information on incremental loading, see Reprocessing incrementally-loaded source files in Tasks.

Wildcard characters

When you run a mapping in advanced mode to read data from an Avro, flat, JSON, ORC, or Parquet file, you can use the ? and * wildcard characters to specify the folder path or file name.

To use wildcard characters for the folder path or file name, select the Allow Wildcard Characters option in the advanced read properties of the Amazon S3 V2 data object.

? (Question mark): The question mark character (?) allows one occurrence of a character. For example, if you enter the source file name as a?b.txt, the Secure Agent reads data from files with the following names:
* (Asterisk): The asterisk mark character (*) allows zero or more than one occurrence of a character. For example, if you enter the source file name as a*b.txt, the Secure Agent reads data from files with the following names:

You can use the asterisk (*) wildcard to fetch all the files or only the files that match the name pattern. Specify the wildcard character in the following format:

•abc*.txt
•abc.*

If you specify abc*.txt, the Secure Agent reads all the file names starting with the term abc and ending with the .txt file extension. If you specify abc.*, the Secure Agent reads all the file names starting with the term abc regardless of the extension.

Rules and guidelines for wildcard characters

Consider the following rules and guidelines when you use wildcard characters to specify the folder path or file name:

•You cannot specify wildcard characters in a bucket name.
•When you specify wildcard characters in a folder path and the Amazon S3 bucket does not contain folders matching the name pattern, the mapping fails.
•When you specify wildcard characters in a file name and the Amazon S3 bucket does not contain files matching the name pattern, the mapping fails.
•When you use wildcard characters in a folder path for a mapping in advanced mode, the Secure Agent reads data from the folders and the files that match the name pattern.

Recursively read files from directories

You can read objects stored in subdirectories in Amazon S3 V2 mappings in advanced mode. You can use recursive read for flat, Avro, JSON, ORC, and Parquet files. The files that you read using recursive read must have the same metadata.

To enable recursive read, select the source type as Directory in the advanced source properties. Enable the Recursive Directory Read advanced source property to read objects stored in subdirectories.

You can also use recursive read when you specify wildcard characters in a folder path or file name. For example, you can use a wildcard character to recursively read files in the following ways:

•Folder path is /abc*/. Returns all files within any folder or subfolder that has a pattern abc at the starting of the folder name.
•Folder path is /abc*/ and file name is myfile*. Returns all files that have a pattern myfile at the starting of the file name within a folder or subfolder and has a pattern abc at the starting of the folder name.

Source partitioning

You can configure fixed partitioning to optimize the mapping performance at run time when you read data from flat, Avro, ORC, or Parquet files. You can configure fixed partitioning only on mappings.

The partition type controls how the agent distributes data among partitions at partition points. With partitioning, the Secure Agent distributes rows of source data based on the number of threads that you define as partition.

Enable partitioning when you configure the Source transformation in the Mapping Designer.

On the Partitions tab for the Source transformation, you select fixed partitioning and enter the number of partitions based on the amount of data that you want to read. By default, the value of the Number of partitions field is one.

The following image shows the configured partitioning:

On the Partitions tab of the Source transformation, the partitioning type is Fixed and the number of partitions is set to 2.

The Secure Agent enables the partition according to the size of the Amazon S3 V2 source file. The file name is appended with a number starting from 0 in the following format: <file name>_<number>

If you enable partitioning and the precision for the source column is less than the maximum data length in that column, you might receive unexpected results. To avoid unexpected results, the precision for the source column must be equal to or greater than the maximum data length in that column for partitioning to work as expected.

Note: If you configure partitioning for an Amazon S3 V2 source in a mapping to read from a manifest file, compressed .gz file, or a read directory file, the Secure Agent ignores the partition. However, the task runs successfully.

Reading source objects path

When you import source objects, the Secure Agent appends a FileName field to the imported source object. The FileName field stores the absolute path of the source file from which the Secure Agent reads the data at run time.

For example, a directory contains a number of files and each file contains multiple records that you want to read. You select the directory as source type in the Amazon S3 V2 source advanced properties. When you run the mapping, the Secure Agent reads each record and stores the absolute path of the respective source file in the FileName field.

The FileName field is applicable to the following file formats:

•Avro
•Binary. Applicable only to mappings.
•ORC
•Parquet

Note: Avoid using FileName as the column name in the source data. FileName is a reserved keyword. The name is case sensitive.

When you use the FileName field in a source object, the Secure Agent reads file names and directory names differently for mappings and mappings in advanced mode.

Feature	Mapping	Mappings in advanced mode
File name	xyz.amazonaws.com/aa.bb.bucket/1024/characterscheckfor1024	s3a://<bucket_name>/customer.avro
Directory name	<absolute path of the file including the file name>	s3a://<bucket_name>/avro/<directory_name>/<file_name>

Note: The FileName field in a source object uses the format with -, by default. For example, s3-us-west-2.amazonaws.com/<bucket_name>/automation/customer.avro.

To change the format for the FileName field to use ., set the JVM option changeS3EndpointForFileNamePort = true. For example, s3.us-west-2.amazonaws.com/<bucket_name>/automation/customer.avro.

SQL ELT optimization

You can enable full SQL ELT optimization when you want to load data from Amazon S3 sources to your data warehouse in Amazon Redshift. While loading the data to Amazon Redshift, you can transform the data as per your data warehouse model and requirements. When you enable full SQL ELT optimization on a mapping task, the mapping logic is pushed to the AWS environment to leverage AWS commands. You cannot configure SQL ELT optimization for a task based on a mapping configured in advanced mode.

For more information on SQL ELT optimization, see the help for Amazon Redshift V2 Connector. If your use case involves loading data to any other supported cloud data warehouse, see the connector help for the applicable cloud data warehouse.