Data Domain Discovery Options in Informatica Analyst
Use the data domain discovery options to choose the columns, data domains, and inference options for data domain discovery. Inference options include choosing whether you want to run data domain discovery based on a rule on column data, column name, or both.
Data Domain Column Selection in Informatica Analyst
You can click Edit in the Specify Settings screen to choose the columns you want to run as a part of data domain discovery. You can view all the columns in the data source in the Select Source screen in the profile wizard. You can choose different columns for column profile and data domain discovery.
The following table describes the Edit dialog box properties for data domain discovery:
Option | Description |
---|
Name | Displays the column name. |
Type | Displays the documented data type of the column. |
Precision | Displays the maximum precision for the column. |
Scale | Displays the scale of the column. |
Nullable | Indicates a column that can have null values. |
Key | Indicates whether the column is documented as a primary key or foreign key. |
Data Domain Selection in Informatica Analyst
The Data Domain pane in the Specify Settings screen lists all the data domains from the data domain glossary. You can choose the data domains you want to run as a part of data domain discovery.
The following table describes the Data Domain properties for data domain discovery:
Option | Description |
---|
Name | Displays the data domain name. You can choose one or more data domains or data domain group. |
Description | Displays the description for the data domain. |
DomainGroups | Displays the name of the data domain group to which the data domain belongs. |
Data Domain Inference Options in Informatica Analyst
Inference options determine whether data domain discovery must run on column data, column name, or both. You can specify the maximum number of source rows the profile can analyze. You can choose a conformance criteria for data domain discovery. You can exclude null values from data domain discovery. You can set the data domain inference options in the Specify Settings screen in the profile wizard.
The following table describes the inference options for data domain discovery:
Option | Description |
---|
Data | Runs the profile on column data. |
Columns | Runs the profile on column titles. |
Data and Columns | Runs the profile on both column data and column titles. |
Minimum percentage of rows | The minimum conformance percentage of rows in the data set required for a data domain match. |
Minimum number of rows | The minimum number of rows in the data set required for a data domain match. |
Exclude null values for data domain discovery | Excludes the null values from the data set for data domain discovery. |
Edit | Select the columns for data domain discovery. |
All Rows | Runs the profile on all rows from the source. |
Sample first | Choose maximum number of rows the profile can run on. The Analyst tool chooses the rows starting from the first row in the source. |
Random sample | Choose a random sample of rows from the data source. |
Random sample (auto) | The Analyst tool chooses a random sample of rows based on the size of the data source. |
Exclude approved data types and data domains from the data type and data domain inference in the subsequent profile runs | Excludes the approved data type or data domain from data type and data domain inference from the next profile run. |
Minimum Conformance Percentage
You can choose a minimum percentage rows in the data set as a conformance criteria for data domain discovery.
The conformance percentage is the ratio of the number of matching rows divided by the total number of rows.
Note: The Analyst tool considers null values as nonmatching rows. Columns containing a high number of null values might not result in data domain inference unless you specify a low value for minimum conformance percentage.
Example
You have a data source with 10,000 rows where the Comments column has Social Security Numbers in 2,500 rows. You create a column profile with data domain discovery and set a minimum percentage of rows to 30% as the conformance criteria. When you run the profile, the profile results do not display the Social Security Numbers as an inferred data domain because the minimum conformance criteria is 30% of rows or 3,000 rows in the data source.
Minimum Conforming Rows
You can choose a minimum number of rows in the data set as a conformance criteria for data domain discovery.
Example
You have a data source with 10,000 rows where the Comments column has email address in three rows. You create a column profile with data domain discovery profile and set the minimum number of rows to 1 as the conformance criteria. When you run the profile, the profile results display the email address as an inferred data domain with three conforming rows along with the other inferred data domains.
Exclude Null Values in Data Domain Discovery
You can exclude null values when you perform data domain discovery on a data source. When you select the minimum percentage of rows with the exclude null values option, the conformance percentage is the ratio of number of matching rows divided by the total number of rows minus the null values in the column.
The data domain discovery process differs when you choose the Exclude null values from data domain discovery option and the multiple sampling options or filters.
The following scenarios explain the data domain discovery results when you choose the exclude null values option along with a sampling option and filters:
- •With All rows as the sampling option and no filters. Data domain discovery ignores all the null values in the column.
- •With a sampling option and no filters. Data domain discovery ignores all the null values in the sampled data and runs on the rest of the sampled data.
- •With All rows as the sampling option and with filters. Data domain discovery ignores all the null values in the filtered data and runs on the rest of the filtered data.
- •With a sampling option and filters. Data domain discovery ignores the null values in the filtered data in the sample and runs on the rest of the filtered data.
Example
You have a data source with 10,000 rows where 3,000 rows have Social Security Numbers in the Comments column. You create a column profile with data domain discovery and choose the following options:
- •Select the Exclude null values from data domain discovery option.
- •Select All rows as the sampling option.
- •Select the Minimum percentage of rows option and configure the option to 12%.
When you run the profile, the profile runs on the data set and ignores the null values during data domain discovery.