Streaming Ingestion and Replication > Streaming Ingestion and Replication > Streaming Ingestion and Replication targets
  

Streaming Ingestion and Replication targets

You can ingest streaming data from a supported source to any on-premises and cloud targets that Streaming Ingestion and Replication supports.
You can use the following targets in a streaming ingestion and replication task:
To determine the connectors to use for these target types, see Connectors and Connections > Streaming Ingestion and Replication connectors.

Amazon Kinesis Data Firehose target

Use a Kinesis target to receive data from a source and write the data to an Amazon Kinesis Data Firehose target. To create a Kinesis target, use the Amazon Kinesis connection type.
Kinesis Firehose is a real-time data stream processing service that Amazon Kinesis offers within the AWS ecosystem. Use Kinesis Firehose to batch, encrypt, and compress data. Kinesis Firehose can automatically scale to meet system needs.
To configure access for Kinesis Firehose as a target, perform the following tasks:

Amazon Kinesis Streams target

Use a Kinesis target to receive data from source services and write the data to an Amazon Kinesis Stream. To create a Kinesis target, use the Amazon Kinesis connection type.
Kinesis Streams is a real-time data stream processing service that Amazon Kinesis offers within the AWS ecosystem. Kinesis Streams is a customizable option that you can use to build custom applications to process and analyze streaming data. As Kinesis Streams cannot automatically scale to meet data in-flow demand, you must manually provision enough capacity to meet system needs.
Before you use a Kinesis stream target, perform the following tasks:
  1. 1In the Amazon Kinesis console, create and configure an Amazon Kinesis stream.
  2. 2In the Amazon Web Services (AWS) Identity and Access Management (IAM) service, create a user.
  3. 3Download the access key and secret access key that are generated during the user creation process.
  4. 4Associate the user with a group that has permissions to write to the Kinesis stream.

Amazon S3 target

Use an Amazon S3 V2 connector to write the streaming and replication data to an Amazon S3 target.
Amazon Simple Storage Service (Amazon S3) is storage service in which you can copy data from a streaming source and simultaneously move data to any target. You can use Amazon S3 to transfer the data from a list of configured source connections to an Amazon S3 target. You can accomplish these tasks by using the AWS Management Console web interface.
You can use an Amazon S3 object as a target in a streaming ingestion and replication task. You can configure the Amazon S3 target and advanced properties for a target object.
Data partitioning
The streaming ingestion and replication task can create partitions on the Amazon S3 V2 target and write data to the partitions. To use partitioning, you must select a partitioning interval according to which the task creates the partitions. Based on the selected time interval, the streaming ingestion and replication job saves the incoming message in an /<object name>/<Year>/<month>/<day>/<hour>/<minutes> partition in the Amazon S3 V2 bucket. The streaming ingestion and replication job adds timestamp hierarchy folders in the Amazon S3 bucket.
You can enable partitioning in the following ways when you configure a task:
For more information, see Amazon S3 target properties.

Databricks target

Use a streaming ingestion and replication task to write data to a Databricks target. To create a Databricks target, use the Databricks connection type. The Databricks target requires a Databricks cluster version 6.3 or later.
Databricks is an open source storage layer that provides ACID transactions and works on top of existing data lakes. Databricks uses proprietary Delta software to manage stored data and allow fast access to the data.
You can access Delta Lake tables built on top of the following storage types:
The Databricks target writes data to one or more Delta Lake tables on Databricks. You can use the Databricks target in a streaming ingestion and replication task for the following use cases:
The Databricks connection uses a JDBC URL to connect to the Databricks cluster. When you configure the target, you specify the JDBC URL and credentials to use to connect to the cluster. You also define the connection information that the target uses to connect to the staging location in Amazon S3 or Azure Data Lake Storage Gen2.
You specify the tables in Delta Lake to which you want to write the data. The target writes data from record fields to table columns based on matching names.

Flat file target

Use a streaming ingestion and replication task to write data from various sources to a flat file target. The task writes real-time streaming data from various sources to the file system of the Secure Agent that runs the dataflow.
A streaming ingestion and replication task writes the data to the staging directory with the file name that you provide. When the task adds new content to the file, it precedes it with a New Line character (\n) in the target.
The flat file target performs the file rollover action and you can configure the rollover properties. The file rollover process closes the current file and creates a new file on the basis of the file size, event count, or time. To configure rollover, specify the Rollover Size, Rollover Events Count, or the Rollover Time properties of the target service. The rollover process moves the file from the staging directory to the target and renames the file. The file name format of the renamed file is the original file name with an addition of the time stamp and counter information (yyyy_mm_dd-hh_mm_ss_counter). For example, during rollover, the file streaming.txt is renamed to streaming-2021_08_16-17_17_30_4.txt.
You can implement a combination of the rollover properties. For example, if you set the rollover events count to 1000, the rollover size to 1 GB, and the rollover time to 1 hour, the task rolls the file over when the file reaches a size of 1 GB even if the 1000 events are not accumulated and the 1-hour period has not elapsed.

Google BigQuery V2 target

Use a streaming ingestion and replication task to write data to a Google BigQuery V2 database target. To create a Google BigQuery V2 target, use the Google BigQuery V2 connection type.
You can use a Google BigQuery V2 target to write data to a BigQuery table. The target consumes data in the JSON format. The target ignores the records if the fields don't match the table columns of the database. The Google BigQuery V2 target accepts data in a simple JSON format or in an array of a simple JSON format.
Before you use Google BigQuery V2 Connector, you must complete the following prerequisite tasks:

Google Cloud Storage V2 target

Use a streaming ingestion and replication task to write data to a Google Cloud Storage target. To create a Google Cloud Storage target, use the Google Cloud Storage V2 connection type.
You can use Google Cloud Storage to stream multimedia, store custom data analytics pipelines, or distribute large data objects to users through direct download. You can write data to Google Cloud Storage for data backup. In the event of a database failure, you can read the data from Google Cloud Storage and restore it to the database.
Google Cloud Storage offers different storage classes based on factors such as data availability, latency, and price. Google Cloud Storage has the following components:
Before you use Google Cloud Storage V2 Connector, you must complete the following prerequisite tasks:
  1. 1Ensure that you have a Google service account to access Google Cloud Storage.
  2. 2Ensure that you have the client_email, project_id, and private_key values for the service account. You will need to enter these details when you create a Google Cloud Storage connection in the Administrator.
  3. 3Ensure that you have enabled the Google Cloud Storage JSON API for your service account. Google Cloud Storage V2 Connector uses the Google API to integrate with Google Cloud Storage.
  4. 4Verify that you have write access to the Google Cloud Storage bucket that contains the target file.
  5. 5Ensure that you have enabled a license to use a Cloudera CDH or Hortonworks HDP package in your organization.
When you deploy a streaming ingestion and replication task, the Secure Agent uses the Google Cloud Storage API to perform the specified operation and writes data to Google Cloud Storage files. You can write data into a Google Cloud Storage target. You cannot perform update, upsert, or delete operations on a Google Cloud Storage target.

Google PubSub target

Use a streaming ingestion and replication task to write data to a Google PubSub topic. To create a Google PubSub target, use the Google PubSub connection type.
Google PubSub is an asynchronous messaging service that decouples services that produce events from services that process events. You can use Google PubSub as a messaging-oriented middleware or for event ingestion and delivery for streaming analytics pipelines. Google PubSub offers durable message storage and real-time message delivery with high availability and consistent performance at scale. You can run Google PubSub servers in all the available Google Cloud regions around the world.
Before you use Google PubSub connector, you must ensure that you meet the following prerequisites:
In a streaming ingestion and replication task, you can use a Google PubSub target to publish to messages to a Google PubSub topic.

JDBC V2 target

Use a streaming ingestion and replication task to write data to a database target. To create a JDBC V2 target, use the JDBC V2 connection type.
You can use a JDBC V2 target to write data to a database table. The target consumes data in JSON format. The target ignores fields that don't map to the table columns of the database.
Consider the following prerequisites before you configure JDBC V2 as a target:
Note: The JDBC V2 target accepts data only in a simple JSON or an array of a simple JSON format.

Kafka target

Use a streaming ingestion and replication task to write data to a Kafka target. To create a Kafka target, use the Kafka connection type.
Kafka is a publish-subscribe messaging system. It is an open-source distributed streaming platform. This platform allows systems that generate data to persist their data in real-time in a Kafka topic. Any topic can then be read by any number of systems who need that data in real-time. Kafka can serve as an interim staging area for streaming data that can be consumed by different downstream consumer applications.
Kafka runs as a cluster that comprises of one or more Kafka brokers. Kafka brokers stream data in the form of messages, publishes the messages to a topic, subscribes the messages from a topic, and then writes it to the Kafka target.
When you create a Kafka target, you create a Kafka producer to write Kafka messages. You can use each Kafka target in a streaming ingestion and replication job that writes streaming Kafka messages. When you configure a Kafka target, specify the topic to publish the messages and the IP address and port on which the Kafka broker runs. If a Kafka topic does not exist in the target, instead of manually creating topics you can also configure your brokers to auto-create topics when a non-existent topic is published to.
You can use the same Kafka connection to create an Amazon Managed Streaming for Apache Kafka (Amazon MSK) or a Confluent Kafka connection. You can then use the Amazon MSK or the Confluent Kafka target in a streaming ingestion and replication task to write messages to an Apache Kafka or a Confluent Kafka target.

Microsoft Azure Data Lake Storage Gen2 target

Use a streaming ingestion and replication task to write data to a Microsoft Azure Data Lake Storage Gen2 target. To create a Microsoft Azure Data Lake Storage Gen2 target, use the Microsoft Azure Data Lake Storage Gen2 connection type.
Microsoft Azure Data Lake Storage Gen2 is a next-generation data lake solution for big data analytics. You can store data in the form of directories and sub-directories, making it efficient for data access and manipulation. You can store data of any size, structure, and format. You can process large volumes of data to achieve faster business outcomes. Data scientists and data analysts can use data in the data lake to find out specific patterns before you move the analyzed data to a data warehouse. You can use big data analytics available on top of Microsoft Azure Blob storage.
The streaming ingestion and replication task writes data to Microsoft Azure Data Lake Storage Gen2 based on the specified conditions.
For more information about Microsoft Azure Data Lake Storage Gen2, see the Microsoft Azure Data Lake Storage Gen2 documentation.

Microsoft Azure Event Hubs target

Use a streaming ingestion and replication task to write data to an Azure Event Hubs target. To create an Azure Event Hubs target, use the Azure Event Hubs connection type.
Azure Event Hubs is a highly scalable data streaming platform and event ingestion service that receives and processes events. Azure Event Hubs can ingest and process large volumes of events with low latency and high reliability. It is a managed service that can handle message streams from a wide range of connected devices and systems.
Any entity that sends data to an event hub is an event publisher. Event publishers can publish events using HTTPS or Kafka 1.0 and later. Event publishers use a Shared Access Signature (SAS) token to identify themselves to an event hub and can have a unique identity or use a common SAS token.
For more information about Event Hubs, see the Microsoft Azure Event Hubs documentation.