Sqoop Mappings in a Hadoop Environment
After you copy the Type 4 JDBC driver .jar files required for Sqoop connectivity to the externaljdbcjars directory, enable Sqoop in a JDBC connection, and import a Sqoop source or Sqoop target, you can create a mapping. You can then run the Sqoop mapping in the Hadoop run-time environment with a Hadoop connection. You can run Sqoop mappings on the Blaze, Spark, and Hive engines.
If you use Cloudera Connector Powered by Teradata or Hortonworks Connector for Teradata, you can run mappings on the Blaze or Spark engines. If you use MapR Connector for Teradata, you can run mappings on the Spark engine.
In the mapping, you can specify additional Sqoop arguments and disable the Sqoop connector.
Note: If you add or delete a Type 4 JDBC driver .jar file required for Sqoop connectivity from the externaljdbcjars directory, changes take effect after you restart the Data Integration Service. If you run the mapping on the Blaze engine, changes take effect after you restart the Data Integration Service and Blaze Grid Manager. When you run the mapping for the first time, you do not need to restart the Data Integration Service and Blaze Grid Manager. You need to restart the Data Integration Service and Blaze Grid Manager only for the subsequent mapping runs.
Sqoop Mapping-Level Arguments
If a data object uses Sqoop, you can click the corresponding Read transformation or Write transformation in the Sqoop mapping to define the arguments that Sqoop must use to process the data. The Data Integration Service merges the additional Sqoop arguments that you specify in the mapping with the arguments that you specified in the JDBC connection and constructs the Sqoop command.
The Sqoop arguments that you specify in the mapping take precedence over the arguments that you specified in the JDBC connection. However, if you do not enable the Sqoop connector in the JDBC connection but enable the Sqoop connector in the mapping, the Data Integration Service does not run the mapping through Sqoop. The Data Integration Service runs the mapping through JDBC.
You can configure the following Sqoop arguments in a Sqoop mapping:
- •m or num-mappers
- •split-by
- •batch
- •infaoptimize
For a complete list of the Sqoop arguments that you can configure, see the Sqoop documentation.
m or num-mappers
The m or num-mappers argument defines the number of map tasks that Sqoop must use to import and export data in parallel.
Use the following syntax:
-m <number of map tasks>
--num-mappers <number of map tasks>
If you configure the m argument or num-mappers argument, you must also configure the split-by argument to specify the column based on which Sqoop must split the work units.
Use the m argument or num-mappers argument to increase the degree of parallelism. You might have to test different values for optimal performance.
When you configure the m argument or num-mappers argument and run Sqoop mappings on the Spark or Blaze engines, Sqoop dynamically creates partitions based on the file size.
Note: If you configure the num-mappers argument to export data on the Blaze or Spark engine, Sqoop ignores the argument. Sqoop creates map tasks based on the number of intermediate files that the Blaze or Spark engine creates.
split-by
The split-by argument defines the column based on which Sqoop splits work units.
Use the following syntax:
--split-by <column_name>
You can configure the split-by argument to improve the performance. If the primary key does not have an even distribution of values between the minimum and maximum range, you can configure the split-by argument to specify another column that has a balanced distribution of data to split the work units.
If you do not define the split-by column, Sqoop splits work units based on the following criteria:
- •If the data object contains a single primary key, Sqoop uses the primary key as the split-by column.
- •If the data object contains a composite primary key, Sqoop defaults to the behavior of handling composite primary keys without the split-by argument. See the Sqoop documentation for more information.
- •If the data object does not contain a primary key, the value of the m argument and num-mappers argument default to 1.
Rules and Guidelines for the split-by Argument
Consider the following restrictions when you configure the split-by argument:
- •If you configure the split-by argument and the split-by column contains NULL values, Sqoop does not import the rows that contain NULL values. However, the mapping runs successfully and no error is written in the YARN log.
- •If you configure the split-by argument and the split-by column contains special characters, the Sqoop import process fails.
- •The split-by argument is required in the following scenarios:
- - You use Cloudera Connector Powered by Teradata or Hortonworks Connector for Teradata, and the Teradata table does not contain a primary key.
- - You create a custom query to override the default query when you import data from a Sqoop source.
batch
The batch argument indicates that Sqoop must export data in batches.
Use the following syntax:
--batch
You can configure the batch argument to improve the performance.
infaoptimize
The infaoptimize argument defines whether you want to disable the performance optimization of Sqoop pass-through mappings on the Spark engine.
When you run a Sqoop pass-through mapping on the Spark engine, the Data Integration Service optimizes mapping performance in the following scenarios:
- •You read data from a Sqoop source and write data to a Hive target that uses the Text format.
- •You read data from a Sqoop source and write data to an HDFS target that uses the Flat, Avro, or Parquet format.
If you want to disable the performance optimization, set the --infaoptimize argument to false. For example, if you see data type issues after you run an optimized Sqoop mapping, you can disable the performance optimization.
Use the following syntax:
--infaoptimize false
Configuring Sqoop Properties in the Mapping
You can specify additional Sqoop arguments and disable the Sqoop connector at the mapping level. The Sqoop arguments that you specify at the mapping level take precedence over the arguments that you specified in the JDBC connection.
1. Open the mapping that contains the data object for which you want to configure Sqoop properties.
2. Select the Read or Write transformation that is associated with the data object.
3. Click the Advanced tab.
4. To disable the Sqoop connector for the data object, select the Disable Sqoop Connector check box.
5. Perform one of the following steps:
- - To specify additional Sqoop import arguments for the data object, enter the import arguments in the Additional Sqoop Import Arguments text box.
- - To specify additional Sqoop export arguments for the data object, enter the export arguments in the Additional Sqoop Export Arguments text box.
The Data Integration Service merges the additional Sqoop arguments that you specified in the mapping with the arguments that you specified in the JDBC connection and constructs the Sqoop command. The Data Integration Service then invokes Sqoop on a Hadoop node.