Kafka Data Objects
A Kafka data object is a physical data object that represents data in a Kafka stream. After you configure a Messaging connection, create a Kafka data object to read from Apache Kafka brokers.
Kafka runs as a cluster comprised of one or more servers each of which is called a broker. Kafka brokers stream data in the form of messages. These messages are published to a topic.
Kafka topics are divided into partitions. Spark Streaming can read the partitions of the topics in parallel. This gives better throughput and could be used to scale the number of messages processed. Message ordering is guaranteed only within partitions. For optimal performance you should have multiple partitions. You can create or import a Kafka data object.
When you configure the Kafka data object, specify the topic name that you read from or write to. You can specify the topic name or use a regular expression for the topic name pattern only when you read from Kafka. To subscribe to multiple topics that match a pattern, you can specify a regular expression. When you run the application on the cluster, the pattern matching is done against topics before the application runs. If you add a topic with a similar pattern when the application is already running, the application will not read from the topic.
After you create a Kafka data object, create a read operation. You can use the Kafka data object read operation as a source in streaming mappings. If you want to configure high availability for the mapping, ensure that the Kafka cluster is highly available. You can also read from a Kerberised Kafka cluster.
When you configure the data operation read properties, you can specify the time from which the Kafka source starts reading Kafka messages from a Kafka topic.
When you configure the data operation properties, specify the format in which the Kafka data object reads data. You can specify XML, JSON, Avro, or Flat as format. When you specify XML format, you must provide a XSD file. When you specify Avro format, provide a sample Avro schema in a .avsc file. When you specify JSON or Flat format, you must provide a sample file.
You can pass any payload format directly from source to target in Streaming mappings. You can project columns in binary format pass a payload from source to target in its original form or to pass a payload format that is not supported.
Streaming mappings can read, process, and write hierarchical data. You can use array, struct, and map complex data types to process the hierarchical data. You assign complex data types to ports in a mapping to flow hierarchical data. Ports that flow hierarchical data are called complex ports.
For more information about processing hierarchical data, see the Informatica Big Data Management User Guide.
For more information about Kafka clusters, Kafka brokers, and partitions, see
http://kafka.apache.org/082/documentation.html.
Kafka Data Object Overview Properties
The Data Integration Service uses overview properties when it reads data from or writes data to a Kafka broker.
Overview properties include general properties that apply to the Kafka data object. They also include object properties that apply to the resources in the Kafka data object. The Developer tool displays overview properties for Kafka messages in the Overview view.
General Properties
The following table describes the general properties that you configure for Kafka data objects:
Property | Description |
---|
Name | The name of the Kafka data object. |
Description | The description of the Kafka data object. |
Connection | The name of the Kafka connection. |
Objects Properties
The following table describes the objects properties that you configure for Kafka data objects:
Property | Description |
---|
Name | The name of the topic or topic pattern of the Kafka data object. |
Description | The description of the Kafka data object. |
Native Name | The native name of Kafka data object. |
Path Information | The type and name of the topic or topic pattern of the Kafka data object. |
Column Properties
The following table describes the column properties that you configure for Kafka data objects:
Property | Description |
---|
Name | The name of the Kafka data object. |
Native Name | The native name of the Kafka data object. |
Type | The native data type of the Kafka data object. |
Precision | The maximum number of significant digits for numeric data types, or the maximum number of characters for string data types. |
Scale | The scale of the data type. |
Description | The description of the Kafka data object. |
Access Type | The type of access the port or column has. |
Kafka Data Object Read Operation Properties
The Data Integration Service uses read operation properties when it reads data from a Kafka broker.
General Properties
The Developer tool displays general properties for Kafka sources in the Read view.
The following table describes the general properties that you view for Kafka sources:
Property | Description |
---|
Name | The name of the Kafka broker. This property is read-only. You can edit the name in the Overview view. When you use the Kafka broker as a source in a mapping, you can edit the name in the mapping. |
Description | The description of the Kafka broker. |
Ports Properties
Ports properties for a physical data object include port names and port attributes such as data type and precision.
The following table describes the ports properties that you configure for Kafka broker sources:
Property | Description |
---|
Name | The name of the resource. |
Type | The native data type of the resource. |
Precision | The maximum number of significant digits for numeric data types, or the maximum number of characters for string data types. |
Scale | The scale of the data type. |
Description | The description of the resource. |
Run-time Properties
The run-time properties displays the name of the connection.
The following table describes the run-time property that you configure for Kafka sources:
Property | Description |
---|
Connection | Name of the Kafka connection. |
Advanced Properties
The Developer tool displays the advanced properties for Kafka sources in the Output transformation in the Read view.
The following table describes the advanced properties that you can configure for Kafka sources:
Property | Description |
---|
Operation Type | Specifies the type of data object operation. This is a read-only property. |
Guaranteed Processing | Guaranteed processing ensures that the mapping processes messages published by the sources and delivers them to the targets at least once. In the event of a failure, there could be potential duplicates but the messages are processed successfully. If the external source or the target is not available, the mapping execution stops to avoid any data loss. Select this option to avoid data loss in the event of failure of Kafka brokers. |
Start Position Offset | The time from which the Kafka source starts reading Kafka messages from a Kafka topic. You can select one of the following options: - - Custom. Read messages from a specific time.
- - Earliest. Read the earliest messages available on the Kafka topic.
- - Latest. Read messages received by the Kafka topic after the mapping has been deployed.
This property is applicable for Kafka versions 0.10.1.0 and later. |
Custom Start Position Timestamp | The time in GMT from which the Kafka source starts reading Kafka messages from a Kafka topic. Specify a time in the following format: dd-MM-yyyy HH:mm:ss.SSS The milliseconds are optional. This property is applicable for Kafka versions 0.10.1.0 and above. |
Consumer Configuration Properties | The configuration properties for the consumer. If the Kafka data object is reading data from a Kafka cluster that is configured for Kerberos authentication, include the following property: security.protocol=SASL_PLAINTEXT,sasl.kerberos.service.name=kafka,sasl.mechanism=GSSAPI |
Sources Properties
The sources properties list the resources of the Kafka data object.
The following table describes the sources property that you can configure for Kafka sources:
Property | Description |
---|
Sources | The sources which the Kafka data object reads from. You can add or remove sources. |
Column Projection Properties
The Developer tool displays the column projection properties in the Properties view of the Read operation.
To specify column projection properties, double click on the read operation and select the data object. The following table describes the columns projection properties that you configure for Kafka sources:
Property | Description |
---|
Column Name | The name field that contains data. This property is read-only. |
Type | The native data type of the source. This property is read-only. |
Enable Column Projection | Indicates that you use a schema to read the data that the source streams. By default, the data is streamed in binary format. To change the format in which the data is streamed, select this option and specify the schema format. |
Schema Format | The format in which the source streams data. Select one of the following formats: |
Schema | Specify the XSD schema for the XML format, a sample file for JSON, or .avsc file for Avro format. For the Flat file format, configure the schema to associate a flat file to the Kafka source. When you provide a sample file, the Data Integration Service uses UTF-8 code page when reading the data. |
Column Mapping | The mapping of source data to the data object. Click View to see the mapping. |
Project Column as Complex Data Type | Project columns as complex data type for hierarchical data. For more information, see the Informatica Big Data Management User Guide. |
Configuring Schema for Flat Files
Configure schema for flat files when you configure column projection properties.
1. On the Column Projection tab, enable column projection and select the flat schema format.
The page displays the column projection properties page.
2. On the column projection properties page, configure the following properties:
- - Sample Metadata File. Select a sample file.
- - Code page. Select the UTF-8 code page.
- - Format. Format in which the source processes data. Default value is Delimited. You cannot change it.
3. Click Next.
4. In the delimited format properties page, configure the following properties:
Property | Description |
---|
Delimiters | Specify the character that separates entries in the file. Default is a comma (,). You can only specify one delimiter at a time. If you select Other and specify a custom delimiter, you can only specify a single-character delimiter. |
Text Qualifier | Specify the character used to enclose text that should be treated as one entry. Use a text qualifier to disregard the delimiter character within text. Default is No quotes. You can only specify an escape character of one character. |
Preview Options | Specify the escape character. The row delimiter is not applicable as only one row is created at a time. |
Maximum rows to preview | Specify the rows of data you want to preview. |
5. Click Next to preview the flat file data object.
If required, you can change the column attributes. The data type timestampWithTZ format is not supported.
6. Click Finish.
The data object opens in the editor.