You can ingest streaming data from a supported source to any on-premises and cloud targets that Streaming Ingestion and Replication supports.
You can use the following targets in a streaming ingestion and replication task:
•Amazon Kinesis Data Firehose
•Amazon Kinesis Streams
•Amazon S3
•Databricks
•Flat file
•Google BigQuery V2
•Google Cloud Storage
•Google PubSub
•JDBC V2
• Kafka
- Apache Kafka
- Confluent Kafka
- Amazon Managed Streaming (Amazon MSK)
•Microsoft Azure Data Lake Store Gen2
•Microsoft Azure Event Hubs
To determine the connectors to use for these target types, see Connectors and Connections > Streaming Ingestion and Replication connectors.
Amazon Kinesis Data Firehose target
Use a Kinesis target to receive data from a source and write the data to an Amazon Kinesis Data Firehose target. To create a Kinesis target, use the Amazon Kinesis connection type.
Kinesis Firehose is a real-time data stream processing service that Amazon Kinesis offers within the AWS ecosystem. Use Kinesis Firehose to batch, encrypt, and compress data. Kinesis Firehose can automatically scale to meet system needs.
To configure access for Kinesis Firehose as a target, perform the following tasks:
•Create an AWS account with the required IAM permissions for the IAM user to use the AWS Kinesis Data Firehose service.
•Define a Firehose delivery stream. Configure source as Direct PUT or other sources.
•Grant required permissions to the IAM user credentials based on the target the user is writing to.
Use a Kinesis target to receive data from source services and write the data to an Amazon Kinesis Stream. To create a Kinesis target, use the Amazon Kinesis connection type.
Kinesis Streams is a real-time data stream processing service that Amazon Kinesis offers within the AWS ecosystem. Kinesis Streams is a customizable option that you can use to build custom applications to process and analyze streaming data. As Kinesis Streams cannot automatically scale to meet data in-flow demand, you must manually provision enough capacity to meet system needs.
Before you use a Kinesis stream target, perform the following tasks:
1In the Amazon Kinesis console, create and configure an Amazon Kinesis stream.
2In the Amazon Web Services (AWS) Identity and Access Management (IAM) service, create a user.
3Download the access key and secret access key that are generated during the user creation process.
4Associate the user with a group that has permissions to write to the Kinesis stream.
Amazon S3 target
Use an Amazon S3 V2 connector to write the streaming and replication data to an Amazon S3 target.
Amazon Simple Storage Service (Amazon S3) is storage service in which you can copy data from a streaming source and simultaneously move data to any target. You can use Amazon S3 to transfer the data from a list of configured source connections to an Amazon S3 target. You can accomplish these tasks by using the AWS Management Console web interface.
You can use an Amazon S3 object as a target in a streaming ingestion and replication task. You can configure the Amazon S3 target and advanced properties for a target object.
Data partitioning
The streaming ingestion and replication task can create partitions on the Amazon S3 V2 target and write data to the partitions. To use partitioning, you must select a partitioning interval according to which the task creates the partitions. Based on the selected time interval, the streaming ingestion and replication job saves the incoming message in an /<object name>/<Year>/<month>/<day>/<hour>/<minutes> partition in the Amazon S3 V2 bucket. The streaming ingestion and replication job adds timestamp hierarchy folders in the Amazon S3 bucket.
You can enable partitioning in the following ways when you configure a task:
- Add the ${Timestamp} expression to the object name in the Object Name/Expression field and select a time interval.
For example, enter /streaming/${Timestamp} in the Object Name/Expression field, select a partitioning interval of five minutes, and run the streaming ingestion and replication task at 15:20 on June 8, 2022. The streaming ingestion and replication job saves the incoming message in the /streaming/2022/06/08/15/20 partition in the Amazon S3 V2 bucket. The job saves data that streams during the next time interval under the hour folder, that is, 2022/06/08/15/25.
Note: If you add a ${Timestamp} expression to the object name in the Object Name/Expression field and don't select a partitioning interval, the streaming ingestion and replication job saves the objects to the 0/ folder in the Amazon S3 bucket.
- Use a regular expression in the Object Name/Expression field and select a partitioning interval. You don't need to add the ${Timestamp} expression to a regular expression.
For example, run a streaming ingestion and replication task at 12:10 on June 18, 2022 with a regular expression $"SourceTable":"((.*?))"$, and a partitioning interval of five minutes. The incoming data is {"SourceTable":"xyz"}. The streaming ingestion and replication job saves the xyz object in the 2022/06/18 /12/10 folder hierarchy in the Amazon S3 bucket. The job saves data that streams during the next time interval under the hour folder, that is, 2022/06/08/15/25.
Use a streaming ingestion and replication task to write data to a Databricks target. To create a Databricks target, use the Databricks connection type. The Databricks target requires a Databricks cluster version 6.3 or later.
Databricks is an open source storage layer that provides ACID transactions and works on top of existing data lakes. Databricks uses proprietary Delta software to manage stored data and allow fast access to the data.
You can access Delta Lake tables built on top of the following storage types:
•Azure Data Lake Storage (ADLS) Gen2
•Amazon Web Services (AWS) S3
The Databricks target writes data to one or more Delta Lake tables on Databricks. You can use the Databricks target in a streaming ingestion and replication task for the following use cases:
•Ingest bulk data from all streaming sources into Databricks tables
•Merge change data capture (CDC) from all streaming sources and write to Databricks tables
The Databricks connection uses a JDBC URL to connect to the Databricks cluster. When you configure the target, you specify the JDBC URL and credentials to use to connect to the cluster. You also define the connection information that the target uses to connect to the staging location in Amazon S3 or Azure Data Lake Storage Gen2.
You specify the tables in Delta Lake to which you want to write the data. The target writes data from record fields to table columns based on matching names.
Flat file target
Use a streaming ingestion and replication task to write data from various sources to a flat file target. The task writes real-time streaming data from various sources to the file system of the Secure Agent that runs the dataflow.
A streaming ingestion and replication task writes the data to the staging directory with the file name that you provide. When the task adds new content to the file, it precedes it with a New Line character (\n) in the target.
The flat file target performs the file rollover action and you can configure the rollover properties. The file rollover process closes the current file and creates a new file on the basis of the file size, event count, or time. To configure rollover, specify the Rollover Size, Rollover Events Count, or the Rollover Time properties of the target service. The rollover process moves the file from the staging directory to the target and renames the file. The file name format of the renamed file is the original file name with an addition of the time stamp and counter information (yyyy_mm_dd-hh_mm_ss_counter). For example, during rollover, the file streaming.txt is renamed to streaming-2021_08_16-17_17_30_4.txt.
You can implement a combination of the rollover properties. For example, if you set the rollover events count to 1000, the rollover size to 1 GB, and the rollover time to 1 hour, the task rolls the file over when the file reaches a size of 1 GB even if the 1000 events are not accumulated and the 1-hour period has not elapsed.
Google BigQuery V2 target
Use a streaming ingestion and replication task to write data to a Google BigQuery V2 database target. To create a Google BigQuery V2 target, use the Google BigQuery V2 connection type.
You can use a Google BigQuery V2 target to write data to a BigQuery table. The target consumes data in the JSON format. The target ignores the records if the fields don't match the table columns of the database. The Google BigQuery V2 target accepts data in a simple JSON format or in an array of a simple JSON format.
Before you use Google BigQuery V2 Connector, you must complete the following prerequisite tasks:
•Ensure that you have a Google service account to access Google BigQuery.
•Ensure that you have the client_email, project_id, private_key, and region ID values for the service account. Enter the values in the corresponding Service Account ID, Project ID, Service Account Key, and Region ID connection properties when you create a Google BigQuery V2 connection.
•You must specify the private_key_id and the client_id properties in the Provide Optional Properties field. The Google BigQuery V2 connection test fails on providing these parameters. However, you can ignore the failed test connection and run the streaming ingestion and replication job. Use the following format:
"private_key_id":"<private key ID>" and "client_id":"<client ID>"
• If you want to configure a timeout interval for a Google BigQuery connection, specify the timeout interval property in the Provide Optional Properties field of the connection properties. Use the following format:
"timeout": "<timeout_interval_in_seconds>"
•A table to write the data to must exist before you deploy the streaming ingestion and replication task.
•You must have read and write access to the Google BigQuery datasets that contain the target tables.
Google Cloud Storage V2 target
Use a streaming ingestion and replication task to write data to a Google Cloud Storage target. To create a Google Cloud Storage target, use the Google Cloud Storage V2 connection type.
You can use Google Cloud Storage to stream multimedia, store custom data analytics pipelines, or distribute large data objects to users through direct download. You can write data to Google Cloud Storage for data backup. In the event of a database failure, you can read the data from Google Cloud Storage and restore it to the database.
Google Cloud Storage offers different storage classes based on factors such as data availability, latency, and price. Google Cloud Storage has the following components:
•Projects. In Google Cloud Storage, all resources are stored within a project. Project is a top-level container that stores billing details and user details. You can create multiple projects. A project has a unique project name, project ID, and project number.
•Buckets. Each bucket acts like a container that stores data. You can use buckets to organize and access data. You can create more than one bucket but you cannot nest buckets. You can create multiple folders within a bucket and you can also nest folders. You can define access control lists to manage objects and buckets. An access control list consists of permission and scope entries. Permission defines the access to perform a read or write operation. Scope defines a user or a group who can perform the operation.
•Objects. Objects comprise the data that you upload to Google Cloud Storage. You can create objects in a bucket. Objects consist of object data and object metadata components. The object data is a file that you store in Google Cloud Storage. The object metadata is a collection of name-value pairs that describe object qualities.
Before you use Google Cloud Storage V2 Connector, you must complete the following prerequisite tasks:
1Ensure that you have a Google service account to access Google Cloud Storage.
2Ensure that you have the client_email, project_id, and private_key values for the service account. You will need to enter these details when you create a Google Cloud Storage connection in the Administrator.
3Ensure that you have enabled the Google Cloud Storage JSON API for your service account. Google Cloud Storage V2 Connector uses the Google API to integrate with Google Cloud Storage.
4Verify that you have write access to the Google Cloud Storage bucket that contains the target file.
5Ensure that you have enabled a license to use a Cloudera CDH or Hortonworks HDP package in your organization.
When you deploy a streaming ingestion and replication task, the Secure Agent uses the Google Cloud Storage API to perform the specified operation and writes data to Google Cloud Storage files. You can write data into a Google Cloud Storage target. You cannot perform update, upsert, or delete operations on a Google Cloud Storage target.
Google PubSub target
Use a streaming ingestion and replication task to write data to a Google PubSub topic. To create a Google PubSub target, use the Google PubSub connection type.
Google PubSub is an asynchronous messaging service that decouples services that produce events from services that process events. You can use Google PubSub as a messaging-oriented middleware or for event ingestion and delivery for streaming analytics pipelines. Google PubSub offers durable message storage and real-time message delivery with high availability and consistent performance at scale. You can run Google PubSub servers in all the available Google Cloud regions around the world.
Before you use Google PubSub connector, you must ensure that you meet the following prerequisites:
•Your organization has the Google PubSub Connector license.
•You have a Google service account JSON key to access Google PubSub.
•You have the client_email, client_id, and private_key values for the Google service account. You need these details when you create a Google PubSub connection in Administrator.
In a streaming ingestion and replication task, you can use a Google PubSub target to publish to messages to a Google PubSub topic.
JDBC V2 target
Use a streaming ingestion and replication task to write data to a database target. To create a JDBC V2 target, use the JDBC V2 connection type.
You can use a JDBC V2 target to write data to a database table. The target consumes data in JSON format. The target ignores fields that don't map to the table columns of the database.
Consider the following prerequisites before you configure JDBC V2 as a target:
•A table to write data must exist before deploying the streaming ingestion task.
• Copy the database driver files to the following directory:
Note: The JDBC V2 target accepts data only in a simple JSON or an array of a simple JSON format.
Kafka target
Use a streaming ingestion and replication task to write data to a Kafka target. To create a Kafka target, use the Kafka connection type.
Kafka is a publish-subscribe messaging system. It is an open-source distributed streaming platform. This platform allows systems that generate data to persist their data in real-time in a Kafka topic. Any topic can then be read by any number of systems who need that data in real-time. Kafka can serve as an interim staging area for streaming data that can be consumed by different downstream consumer applications.
Kafka runs as a cluster that comprises of one or more Kafka brokers. Kafka brokers stream data in the form of messages, publishes the messages to a topic, subscribes the messages from a topic, and then writes it to the Kafka target.
When you create a Kafka target, you create a Kafka producer to write Kafka messages. You can use each Kafka target in a streaming ingestion and replication job that writes streaming Kafka messages. When you configure a Kafka target, specify the topic to publish the messages and the IP address and port on which the Kafka broker runs. If a Kafka topic does not exist in the target, instead of manually creating topics you can also configure your brokers to auto-create topics when a non-existent topic is published to.
You can use the same Kafka connection to create an Amazon Managed Streaming for Apache Kafka (Amazon MSK) or a Confluent Kafka connection. You can then use the Amazon MSK or the Confluent Kafka target in a streaming ingestion and replication task to write messages to an Apache Kafka or a Confluent Kafka target.
Microsoft Azure Data Lake Storage Gen2 target
Use a streaming ingestion and replication task to write data to a Microsoft Azure Data Lake Storage Gen2 target. To create a Microsoft Azure Data Lake Storage Gen2 target, use the Microsoft Azure Data Lake Storage Gen2 connection type.
Microsoft Azure Data Lake Storage Gen2 is a next-generation data lake solution for big data analytics. You can store data in the form of directories and sub-directories, making it efficient for data access and manipulation. You can store data of any size, structure, and format. You can process large volumes of data to achieve faster business outcomes. Data scientists and data analysts can use data in the data lake to find out specific patterns before you move the analyzed data to a data warehouse. You can use big data analytics available on top of Microsoft Azure Blob storage.
The streaming ingestion and replication task writes data to Microsoft Azure Data Lake Storage Gen2 based on the specified conditions.
For more information about Microsoft Azure Data Lake Storage Gen2, see the Microsoft Azure Data Lake Storage Gen2 documentation.
Microsoft Azure Event Hubs target
Use a streaming ingestion and replication task to write data to an Azure Event Hubs target. To create an Azure Event Hubs target, use the Azure Event Hubs connection type.
Azure Event Hubs is a highly scalable data streaming platform and event ingestion service that receives and processes events. Azure Event Hubs can ingest and process large volumes of events with low latency and high reliability. It is a managed service that can handle message streams from a wide range of connected devices and systems.
Any entity that sends data to an event hub is an event publisher. Event publishers can publish events using HTTPS or Kafka 1.0 and later. Event publishers use a Shared Access Signature (SAS) token to identify themselves to an event hub and can have a unique identity or use a common SAS token.
For more information about Event Hubs, see the Microsoft Azure Event Hubs documentation.