Enterprise Data Preparation Concepts
To successfully use Enterprise Data Preparation, you must understand the concepts that are used in the application.
Data Asset
A data asset is source data that you work with as a unit. A data asset can include items such as a flat file, a table, or a table view. A data asset can include data stored in or outside the data lake, such as in external databases.
You use Enterprise Data Preparation to search for assets described in the Enterprise Data Catalog, which serves as a centralized repository that stores the metadata for assets extracted from various external sources. Enterprise Data Catalog contains metadata such as profile statistics, asset ratings, data domains, and data relationships for each asset. When you search for assets in Enterprise Data Preparation, you actually search the metadata in Enterprise Data Catalog.
After you find the data asset you are interested in, you can add the data asset to a project and begin preparing the data for analysis.
Projects, Worksheets and Recipes
A project is a container that stores data assets and worksheets.
When you add a data asset to a project, Enterprise Data Preparation creates a corresponding worksheet that contains a sample of the data in a spreadsheet-like presentation. You use the worksheet to preview the data and to prepare it for publication. You do not directly change the data in the input source.
When you publish the prepared data, Enterprise Data Preparation applies the recipe to the data in the input source and creates a new data asset.
Enterprise Data Preparation captures each step you complete in the worksheet in the recipe for the worksheet. When you publish the prepared data, Enterprise Data Preparation applies the recipe to transform the data.
Data Publication
Data publication is the process of making prepared data available in the data lake.
When analysts publish prepared data, Enterprise Data Preparation writes the transformed input source to a Hive table or to a file in the data lake. Other analysts can add the published data to their projects and create new data assets, or use a third-party business intelligence or advanced analytic tool to run reports to further analyze the published data.
Enterprise Data Preparation also converts the recipe into an Informatica mapping, and stores the mapping in the Model repository associated with the Enterprise Data Preparation Service. During the publication process, Enterprise Data Preparation uses a similar naming convention for the projects and mappings stored in the Model repository. You can use the Developer tool to view the generated mappings. You can operationalize the mappings to regularly load data with the new structure into the data lake.
The following image displays a project named customer_address_details in Enterprise Data Preparation. The project contains one worksheet, CustMasterData, with published data:
The following image displays the Object Explorer view in the Developer tool. The Object Explorer view displays the converted mapping stored in the Model repository after publication of the CustMasterData prepared data in the customer_address_details project:
Data Visualization
Data visualization is the process of assessing and validating published data using ad-hoc queries on the data to generate charts, tables, and other visual formats.
You can run queries and use dynamic forms to view the data in the form of graphs, charts, and other visual formats to assess its relevance for your business purposes. You can only run visualization on worksheets that are prepared and published. Collaborators on the project can access the notebook of the worksheet. You can share the notebook so that all users with the required permissions can view the visualization of the worksheet.
Data Preparation Process
Enterprise Data Preparation connects to several Hadoop services on a Hadoop cluster to read from and write to Hive tables and locations in the data lake, to write events, and to store sample preparation data.
Enterprise Data Preparation connects to the following services in the Hadoop cluster:
When an analyst uploads data to the data lake, the Enterprise Data Preparation Service connects to the Hadoop Distributed File System (HDFS) to stage the data in HDFS files.
When an analyst prepares data, the Interactive Data Preparation Service connects to HDFS to store the sample data being prepared in worksheets to HDFS files.
When an analyst previews data, the Enterprise Data Preparation Service connects to the Data Integration Service and reads the first 100 rows from the mapping using the JDBC driver.
When an analyst prepares data, the Interactive Data Preparation Service connects to HDFS and reads sample data from the Hive table or file, and displays the data in a worksheet.
When an analyst uploads data, the Enterprise Data Preparation Service connects to the Data Integration Service to read the temporary data staged in HDFS. If the analyst uploads data to Hive, the application writes the data to a Hive table.
When an analyst publishes prepared data, the Enterprise Data Preparation Service connects to the Data Integration Service to run the converted mappings in the Hadoop environment. The Data Integration Service applies the mapping to the data in the input source, and writes the transformed data to a Hive table in the data lake.