Enterprise Data Lake
This section describes new Enterprise Data Lake features in version 10.2.2.
Apply Active Rules
Effective in version 10.2.2, you can use active rules in projects.
Active rules are mapplets developed using the Developer tool. You can use active rules to apply complex transformations such as aggregator and Data Quality transformations to worksheets for matching and consolidation.
An active rule uses all rows within a data set as input. You can select multiple worksheets to use as inputs to the rule. The application adds a worksheet containing the rule output to the project.
For more information, see the "Prepare Data" chapter in the Informatica 10.2.2 Enterprise Data Lake User Guide.
Delete Duplicate Rows
Effective in version 10.2.2, you can delete rows containing duplicate values from a worksheet.
For more information, see the "Prepare Data" chapter in the Informatica 10.2.2 Enterprise Data Lake User Guide.
Cluster and Categorize Column Data
Effective in version 10.2.2, you can cluster similar values in a column, and then categorize the values based on recommendations from Enterprise Data Lake. The application uses a phonetic algorithm to cluster similar values, and then suggests that you replace the less frequently occurring values with the most frequently occurring value.
For more information, see the "Prepare Data" chapter in the Informatica 10.2.2 Enterprise Data Lake User Guide.
CLAIRE-based Recommendations
Effective in version 10.2.2, the application uses the embedded CLAIRE machine learning discovery engine to provide recommendations when you prepare data.
When you view the Project page, the application displays alternate and additional recommendations derived from upstream data sources based on data lineage, as well as documented primary-foreign key relationships.
When you select a column in a worksheet during data preparation, the application displays suggestions to improve the data based on the column data type in the Column Overview panel.
When you perform a join operation on two worksheets, the application utilizes primary-foreign key relationships to indicate incompatible sampling when low overlap for desired key pairs occurs.
For more information, see the "Prepare Data" chapter in the Informatica 10.2.2 Enterprise Data Lake User Guide.
Conditional Aggregation
Effective in 10.2.2, you can use AND and OR logic to apply multiple conditions on IF calculations that you use when you create an aggregate worksheet in a project.
- •Use AND with all operators to include more than one column in a condition.
- •Use OR with the IS, IS NOT and IS BETWEEN operators to include more than one value within a column in a condition.
For more information, see the "Prepare Data" chapter in the Informatica 10.2.2 Enterprise Data Lake User Guide.
Data Masking
Effective in version 10.2.2, Enterprise Data Lake integrates with Informatica Dynamic Data Masking, a data security product, to enable masking of sensitive data in data assets.
To enable data masking in Enterprise Data Lake, you configure the Dynamic Data Masking Server to apply masking rules to data assets in the data lake. You also configure the Informatica domain to enable Enterprise Data Lake to connect to the Dynamic Data Masking Server.
Dynamic Data Masking intercepts requests sent to the data lake from Enterprise Data Lake, and applies the masking rules to columns in the requested asset. When Enterprise Data Lake users view or perform operations on columns containing masked data, the actual data is fully or partially obfuscated based on the masking rules applied.
For more information, see the "Masking Sensitive Data" chapter in the Informatica 10.2.2 Enterprise Data Lake Administrator Guide.
Localization
Effective in version 10.2.2, the user interface supports Japanese. You can also use non-Latin characters in project names and descriptions.
Partitioned Sources and Targets
Effective in version 10.2.2, Enterprise Data Lake can read data from partitioned sources during import, publish, or copy operations. The application can also append data to partitioned targets in the data lake during import, publish, copy, or upload operations.
Add Comments to Recipe Steps
Effective in version 10.2.2, you can add a comment to a recipe step. Use comments to improve collaboration and provide details to meet auditing requirements.
For more information, see the "Prepare Data" chapter in the Informatica 10.2.2 Enterprise Data Lake User Guide.
Save a Recipe as a Mapping
Effective in version 10.2.2, you can save a recipe as a mapping, instead of publishing the recipe and creating a new output table.
You can save the mapping to the Model repository associated with the Enterprise Data Lake Service, or you can save the mapping to an .xml file. Developers can use the Developer tool to review and modify the mapping, and then execute the mapping when appropriate based on system resource availability.
For more information, see the "Prepare Data" chapter in the Informatica 10.2.2 Enterprise Data Lake User Guide.
Amazon S3, ADLS, WASB, MapR-FS as Data Sources
Effective in version 10.2.2, you can prepare data in files stored in the following data sources:
- •Amazon S3
- •MapR-FS
- •Microsoft Azure Data Lake Storage
- •Windows Azure Storage Blob
You must create a resource in Enterprise Data Catalog for each data source containing data that you want to prepare. A resource is a repository object that represents an external data source or metadata repository. Scanners attached to a resource extract metadata from the resource and store the metadata in Enterprise Data Catalog.
For more information about creating resources in Enterprise Data Catalog, see the "Managing Resources" chapter in the Informatica 10.2.2 Catalog Administrator Guide.
Statistical Functions
Effective in version 10.2.2, you can apply the following statistical functions to columns in a worksheet when you prepare data:
- •AVG
- •AVGIF
- •COUNT
- •COUNTIF
- •COUNTDISTINCT
- •COUNTDISTINCTIF
- •MAX
- •MAXIF
- •MIN
- •MINIF
- •STDDEV
- •STDDEVIF
- •SUM
- •SUMIF
- •VARIANCE
- •VARIANCEIF
For more information, see the "Prepare Data" chapter in the Informatica 10.2.2 Enterprise Data Lake User Guide.
Date and Time Functions
Effective in version 10.2.2, you can apply the following date and time functions to columns in a worksheet when you prepare data:
- •ADD_TO_DATE
- •CURRENT_DATETIME
- •DATETIME
- •DATE_DIFF
- •DATE_TO_UNIXTIME
- •EXTRACT_MONTH_NAME
- •UNIXTIME_TO_DATE
- •Convert Date to Text
- •Convert Text to Date
For more information, see the "Prepare Data" chapter in the Informatica 10.2.2 Enterprise Data Lake User Guide.
Math Functions
Effective in version 10.2.2, you can apply the following math functions to columns when you prepare data:
- •EXP
- •LN
- •LOG
- •PI
- •POWER
- •SQRT
For more information, see the "Prepare Data" chapter in the Informatica 10.2.2 Enterprise Data Lake User Guide.
Text Functions
Effective in version 10.2.2, you can apply the following text functions to columns when you prepare data:
- •ENDSWITH
- •ENDSWITH_IGNORE_CASE
- •FIND_IGNORE_CASE
- •FIND_REGEX
- •FIRST_CHARACTER_TO_NUMBER
- •NUMBER_TO_CHARACTER
- •PROPER_CASE
- •REMOVE_NON_ALPHANUMERIC_CHARACTERS
- •STARTSWITH
- •STARTSWITH_IGNORE_CASE
- •SUBSTITUTE_REGEX
- •TRIM_ALL
- •Convert Date to Text
- •Convert Number to Text
- •Convert Text to Date
- •Convert Text to Number
For more information, see the "Prepare Data" chapter in the Informatica 10.2.2 Enterprise Data Lake User Guide.
Window Functions
Effective in version 10.2.2, you can use window functions to perform operations on groups of rows within a worksheet. The group of rows on which a function operates is called a window, which you define with a partition key, an order by key, and optional offsets. A window function calculates a return value for every input row within the context of the window.
You can use window functions to perform the following tasks:
- •Retrieve data from previous or subsequent rows.
- •Calculate a cumulative sum or a cumulative average based on a group of rows.
- •Assign a sequential row number to each row in a group of rows.
- •Replace null values in rows with the preceding non-null value within a group of rows.
- •Generate session identifiers that you can use to group rows based on a specific time period, such as web site visits recorded in a log file.
You can apply multiple window functions to a worksheet. For example, you might apply a function to calculate the sum of values for each row following the current row within a window, and then apply another function to calculate the average of the same values.
Enterprise Data Lake adds a column containing the results of each function you apply to the worksheet.
For more information, see the "Prepare Data" chapter in the Informatica 10.2.2 Enterprise Data Lake User Guide.
Purge Audit Events
Effective in version 10.2.2, you can run the infacmd edl purgeevents command to delete user activity events from the audit history database. You can optionally run the command to delete project history events from the database.
Spark Execution Engine
Effective in version 10.2.2, Enterprise Data Lake uses the Spark engine for high resource consumption activities such as asset publication, and to run active rule mapplets that use the Python transformation. Using the Spark engine for high resource consumption activities provides better performance, and enables an Enterprise Data Lake deployment on Amazon Elastic MapReduce (EMR) to take advantage of autoscaling.