Run-time Environment
You can choose native or Hadoop as the run-time environment for a column profile. You can choose a Blaze or Spark engine in the Hadoop run-time environment. Informatica Analyst sets the run-time environment in the profile definition after you choose a run-time environment.
Native Environment
When you run a profile in the native run-time environment, the Analyst tool submits the profile jobs to the Profiling Service Module. The Profiling Service Module then breaks down the profile jobs into a set of mappings. The Data Integration Service runs these mappings on the same machine where the Data Integration Service runs and writes the profile results to the profiling warehouse. By default, all profiles run in the native run-time environment.
You can use native sources to create and run profiles in the native environment. A native data source is a non-Hadoop source, such as a flat file, relational source, or mainframe source. You can also run a profile on a mapping specification or a logical data source with a Hive or HDFS data source in the native environment.
Hadoop Environment
You can choose the Blaze engine or Spark engine to run the profiles in the Hadoop run-time environment.
After you choose the Blaze or Spark, you can select a Hadoop connection. The Data Integration Service pushes the profile logic to the Blaze or Spark Engine on the Hadoop cluster to run profiles.
When you run a profile in the Hadoop environment, the Developer tool submits the profile jobs to the Profiling Service Module. The Profiling Service Module then breaks down the profile jobs into a set of mappings. The Data Integration Service pushes the mappings to the Hadoop environment through the Hadoop connection. The Blaze engine or Spark engine processes the mappings and the Data Integration Service writes the profile results to the profiling warehouse.
Column Profiles for Sqoop Data Sources
You can run a column profile on data objects that use Sqoop. After you choose Hadoop as a validation environment, you can select the Blaze engine or Spark engine on the Hadoop connection to run the column profiles.
When you run a column profile on a logical data object or customized data object, you can configure the num-mappers argument to achieve parallelism and optimize performance. You must also configure the split-by argument to specify the column based on which Sqoop must split the work units.
Use the following syntax:
--split-by <column_name>
If the primary key does not have an even distribution of values between the minimum and maximum range, you can configure the split-by argument to specify another column that has a balanced distribution of data to split the work units.
If you do not define the split-by column, Sqoop splits work units based on the following criteria:
When you use Cloudera Connector Powered by Teradata or Hortonworks Connector for Teradata and the Teradata table does not contain a primary key, the split-by argument is required.