Intelligent Data Lake Administrator Guide > Introduction to Intelligent Data Lake Administration > Intelligent Data Lake Concepts

Intelligent Data Lake Concepts

To successfully administer Intelligent Data Lake, you must understand the concepts used in the product.

Catalog

A catalog is an indexed inventory of all the information assets in an enterprise. The assets can come from different types of enterprise systems. Assets can include such items as a database table, report, folder, user account, or business glossary definition.

The catalog provides a single comprehensive view of the data in the enterprise. The catalog contains metadata about each asset, including profile statistics, asset ratings, data domains, and data relationships. Metadata can come from scans of enterprise system repositories or from data enrichment by analysts.

You use Live Data Map to create the catalog. When you manage the catalog, you create resources to represent data sources. Each resource contains the properties required to connect to the data source, extract metadata from the data source, and load the metadata to the catalog.

Intelligent Data Lake requires that you use Live Data Map Administrator to create the following resource types for the catalog:

Hive resource for the data lake
Resources for other enterprise systems
Domain User resource

Data Lake

A data lake is a centralized repository of large volumes of structured and unstructured data. A data lake can contain different types of data, including raw data, refined data, master data, transactional data, log file data, and machine data. In Intelligent Data Lake, the data lake is a Hadoop cluster.

Analysts use the Intelligent Data Lake application to search, discover, and prepare data that resides in the data lake. When analysts prepare the data, they combine, cleanse, and transform the data to create new insights.

Data can be added to the data lake in the following ways:

Analysts use the Intelligent Data Lake application to upload data.: Analysts can upload delimited text files to the data lake. When analysts upload data, Intelligent Data Lake writes the uploaded data to a Hive table in the data lake.
Analysts use the Intelligent Data Lake application to publish prepared data.: When analysts publish prepared data, Intelligent Data Lake writes the transformed input source to a Hive table in the data lake.
Administrators run Informatica mappings to populate the data lake.
Administrators and developers run third-party tools to load data into the data lake.: Administrators, developers, or analysts can use data movement tools from other vendors to load data into the data lake.

Data Asset

A data asset is data that you work with as a unit. A data asset is one of the assets described in the catalog. Data assets can include items such as a flat file, table, or view. A data asset can include data stored in or outside the data lake.

Analysts use the Intelligent Data Lake application to search for and discover any assets included in the catalog. However, analysts can only prepare data assets that are stored in the data lake as Hive tables.

After analysts find the data asset they are interested in, they add the data asset to a project and then prepare the data for analysis. Data preparation includes combining, cleansing, transforming, and structuring data in project worksheets.

Data Publication

Data publication is the process of making prepared data available in the data lake.

When analysts publish prepared data, Intelligent Data Lake writes the transformed input source to a Hive table in the data lake. Other analysts can add the published data to their projects and create new data assets. Or analysts can use a third-party business intelligence or advanced analytic tool to run reports to further analyze the published data.

During the publication process, Live Data Map scans the published data to immediately add the metadata to the catalog.

Recipes and Mappings

A recipe includes the list of input sources and the steps taken to prepare data in a worksheet. When analysts publish prepared data, Intelligent Data Lake applies the recipe to the data in the input source. Intelligent Data Lake converts the recipe into an Informatica mapping and stores the mapping in the Model repository.

During the publication process, Intelligent Data Lake uses a similar naming convention for the projects and mappings stored in the Model repository. The mappings are accessible from the Developer tool. You can use the Developer tool to view the mappings generated from the recipes. You can operationalize the mappings to regularly load data with the new structure into the data lake.

The following image displays a project named customer_address_details in the Intelligent Data Lake application. The project contains one worksheet with published data:

1. Project in the Intelligent Data Lake application
2. Worksheet with published data

The following image displays the Object Explorer view in the Developer tool. The Object Explorer view displays the project and converted mapping stored in the Model repository during the publication of the prepared data in the customer_address_details project:

1. Project in the Developer tool
2. Converted mapping within the project