Big Data Management User Guide > Introduction to Informatica Big Data Management > Big Data Management Tasks
  

Big Data Management Tasks

Use Big Data Management when you want to access, analyze, prepare, transform, and stream data faster than traditional data processing environments.
You can use Big Data Management for the following tasks:
Note: The Informatica Big Data Management User Guide describes how to run big data mappings in the native environment or the Hadoop environment. For information on specific license and configuration requirements for a task, refer to the related product guides.

Read from and Write to Big Data Sources and Targets

In addition to relational and flat file data, you can access unstructured and semi-structured data, social media data, and data in a Hive or Hadoop Distributed File System (HDFS) environment.
You can access the following types of data:
Transaction data
You can access different types of transaction data, including data from relational database management systems, online transaction processing systems, online analytical processing systems, enterprise resource planning systems, customer relationship management systems, mainframe, and cloud.
Unstructured and semi-structured data
You can use data objects with an intelligent structure model, or Data Processor transformations, to read and transform unstructured and semi-structured data.
You can use data objects with an intelligent structure model to read and transform unstructured and semi-structured data on a Spark engine. For example, you can use a complex file data object with an intelligent structure model in a mapping to parse a Microsoft Excel file to load accounting data into S3 storage buckets. For more information, see Processing Unstructured and Semi-structured Data with Intelligent Structure Model Overview. The intelligent structure model is quickly auto-generated from a representative file and can be easily updated or customized.
Alternatively, you can use the Data Processor transformation in a workflow to parse unstructured and semi-structured data. For example, you can parse a Microsoft Excel file to load customer and order data into relational database tables. Data Processor transformations have broad functionality and format support, but require manual setup. For more information, see the Data Transformation User Guide.
You can use HParser with a Data Transformation service to transform complex data into flattened, usable formats for Hive, PIG, and MapReduce processing. HParser processes complex files, such as messaging formats, HTML pages and PDF documents. HParser also transforms formats such as ACORD, HIPAA, HL7, EDI-X12, EDIFACT, AFP, and SWIFT. For more information, see the Data Transformation HParser Operator Guide.
Social media data
You can use PowerExchange® adapters for social media to read data from social media web sites like Facebook, Twitter, and LinkedIn. You can also use the PowerExchange for DataSift to extract real-time data from different social media web sites and capture data from DataSift regarding sentiment and language analysis. You can use PowerExchange for Web Content-Kapow to extract data from any web site.
Data in Hadoop
You can use PowerExchange adapters to read data from or write data to Hadoop. For example, you can use PowerExchange for Hive to read data from or write data to Hive. You can use PowerExchange for HDFS to extract data from and load data to HDFS. Also, you can use PowerExchange for HBase to extract data from and load data to HBase.
Data in Amazon Web Services
You can use PowerExchange adapters to read data from or write data to Amazon Web services. For example, you can use PowerExchange for Amazon Redshift to read data from or write data to Amazon Redshift. Also, you can use PowerExchange for Amazon S3 to extract data from and load data to Amazon S3.
For more information about PowerExchange adapters, see the related PowerExchange adapter guides.

Perform Data Discovery

Data discovery is the process of discovering the metadata of source systems that include content, structure, patterns, and data domains. Content refers to data values, frequencies, and data types. Structure includes candidate keys, primary keys, foreign keys, and functional dependencies. The data discovery process offers advanced profiling capabilities.
In the native environment, you can define a profile to analyze data in a single data object or across multiple data objects. In the Hadoop environment, you can push column profiles and the data domain discovery process to the Hadoop cluster.
Run a profile to evaluate the data structure and to verify that data columns contain the types of information you expect. You can drill down on data rows in profiled data. If the profile results reveal problems in the data, you can apply rules to fix the result set. You can create scorecards to track and measure data quality before and after you apply the rules. If the external source metadata of a profile or scorecard changes, you can synchronize the changes with its data object. You can add comments to profiles so that you can track the profiling process effectively.
For more information, see the Informatica Data Discovery Guide.

Perform Data Lineage on Big Data Sources

Perform data lineage analysis in Enterprise Information Catalog for big data sources and targets.
Use Enterprise Information Catalog to create a Cloudera Navigator resource to extract metadata for big data sources and targets and perform data lineage analysis on the metadata. Cloudera Navigator is a data management tool for the Hadoop platform that enables users to track data access for entities and manage metadata about the entities in a Hadoop cluster.
You can create one Cloudera Navigator resource for each Hadoop cluster that is managed by Cloudera Manager. Enterprise Information Catalog extracts metadata about entities from the cluster based on the entity type.
Enterprise Information Catalog extracts metadata for the following entity types:
Note: Enterprise Information Catalog does not extract metadata for MapReduce job templates or executions.
For more information, see the Informatica Catalog Administrator Guide.

Stream Machine Data

You can stream machine data in real time. To stream machine data, use Informatica Edge Data Streaming.
Edge Data Streaming is a highly available, distributed, real-time application that collects and aggregates machine data. You can collect machine data from different types of sources and write to different types of targets. Edge Data Streaming consists of source services that collect data from sources and target services that aggregate and write data to a target.
For more information, see the Informatica Vibe Data Stream for Machine Data User Guide.

Process Streamed Data in Real Time

You can process streamed data in real time. To process streams of data in real time and uncover insights in time to meet your business needs, use Informatica Big Data Streaming.
Create Streaming mappings to collect the streamed data, build the business logic for the data, and push the logic to a Spark engine for processing. The Spark engine uses Spark Streaming to process data. The Spark engine reads the data, divides the data into micro batches and publishes it.
For more information, see the Informatica Big Data Streaming User Guide.

Manage Big Data Relationships

You can manage big data relationships by integrating data from different sources and indexing and linking the data in a Hadoop environment. Use Big Data Management to integrate data from different sources. Then use the MDM Big Data Relationship Manager to index and link the data in a Hadoop environment.
MDM Big Data Relationship Manager indexes and links the data based on the indexing and matching rules. You can configure rules based on which to link the input records. MDM Big Data Relationship Manager uses the rules to match the input records and then group all the matched records. MDM Big Data Relationship Manager links all the matched records and creates a cluster for each group of the matched records. You can load the indexed and matched record into a repository.
For more information, see the MDM Big Data Relationship Management User Guide.

Use a Cluster Workflow to Create Clusters on a Cloud Platform

You can create a workflow in the Developer tool that creates a cluster on a cloud platform and runs Mapping tasks and other tasks.
A cluster workflow contains a Create Cluster task that has configuration properties for a cluster and a reference to cloud provisioning and Hadoop connections.
When you deploy and run a cluster workflow, it creates a cluster on a cloud platform and runs Mapping tasks and other tasks on the cluster.
You can optionally include a Delete Cluster task that terminates the cluster when workflow tasks are complete. A cluster that is created and then terminated is called an ephemeral cluster.