Big Data Management Component Architecture
The Big Data Management components include client tools, application services, repositories, and third-party tools that Big Data Management uses for a big data project. The specific components involved depend on the task you perform.
The following image shows the components of Big Data Management:
Clients and Tools
Based on your product license, you can use multiple Informatica tools and clients to manage big data projects.
Use the following tools to manage big data projects:
- Informatica Administrator
- Monitor the status of profile, mapping, and MDM Big Data Relationship Management jobs on the Monitoring tab of the Administrator tool. The Monitoring tab of the Administrator tool is called the Monitoring tool. You can also design a Vibe Data Stream workflow in the Administrator tool.
- Informatica Analyst
- Create and run profiles on big data sources, and create mapping specifications to collaborate on projects and define business logic that populates a big data target with data.
- Informatica Developer
- Create and run profiles against big data sources, and run mappings and workflows on the Hadoop cluster from the Developer tool.
Application Services
Big Data Management uses application services in the Informatica domain to process data.
Big Data Management uses the following application services:
- Analyst Service
The Analyst Service runs the Analyst tool in the Informatica domain. The Analyst Service manages the connections between service components and the users that have access to the Analyst tool.
- Data Integration Service
The Data Integration Service can process mappings in the native environment or push the mapping for processing to the Hadoop cluster in the Hadoop environment. The Data Integration Service also retrieves metadata from the Model repository when you run a Developer tool mapping or workflow. The Analyst tool and Developer tool connect to the Data Integration Service to run profile jobs and store profile results in the profiling warehouse.
- Mass Ingestion Service
- The Mass Ingestion Service manages and validates mass ingestion specifications that you create in the Mass Ingestion tool. The Mass Ingestion Service deploys specifications to the Data Integration Service. When a specification runs, the Mass Ingestion Service generates ingestion statistics.
- Metadata Access Service
- The Metadata Access Service is a user-managed service that allows the Developer tool to access Hadoop connection information to import and preview metadata. The Metadata Access Service contains information about the Service Principal Name (SPN) and keytab information if the Hadoop cluster uses Kerberos authentication. You can create one or more Metadata Access Services on a node. Based on your license, the Metadata Access Service can be highly available. Informatica recommends to create a separate Metadata Access Service instance for each Hadoop distribution. If you use a common Metadata Access Service instance for different Hadoop distributions, you might face exceptions.
HBase, HDFS, Hive, and MapR-DB connections use the Metadata Access Service when you import an object from a Hadoop cluster. Create and configure a Metadata Access Service before you create HBase, HDFS, Hive, and MapR-DB connections.
- Model Repository Service
The Model Repository Service manages the Model repository. The Model Repository Service connects to the Model repository when you run a mapping, mapping specification, profile, or workflow.
Repositories
Big Data Management uses repositories and other databases to store data related to connections, source metadata, data domains, data profiling, data masking, and data lineage. Big Data Management uses application services in the Informatica domain to access data in repositories.
Big Data Management uses the following databases:
- Model repository
The Model repository stores profiles, data domains, mapping, and workflows that you manage in the Developer tool. The Model repository also stores profiles, data domains, and mapping specifications that you manage in the Analyst tool.
- Profiling warehouse
The Data Integration Service runs profiles and stores profile results in the profiling warehouse.
Hadoop Environment
Big Data Management can connect to clusters that run different Hadoop distributions. Hadoop is an open-source software framework that enables distributed processing of large data sets across clusters of machines. You might also need to use third-party software clients to set up and manage your Hadoop cluster.
Big Data Management can connect to the supported data source in the Hadoop environment, such as HDFS, HBase, or Hive, and push job processing to the Hadoop cluster. To enable high performance access to files across the cluster, you can connect to an HDFS source. You can also connect to a Hive source, which is a data warehouse that connects to HDFS.
It can also connect to NoSQL databases such as HBase, which is a database comprising key-value pairs on Hadoop that performs operations in real-time. The Data Integration Service pushes mapping and profiling jobs to the Blaze, Spark, or Hive engine in the Hadoop environment.
Big Data Management supports more than one version of some Hadoop distributions. By default, the cluster configuration wizard populates the latest supported version.
Hadoop Utilities
Big Data Management uses third-party Hadoop utilities such as Sqoop to process data efficiently.
Sqoop is a Hadoop command line program to process data between relational databases and HDFS through MapReduce programs. You can use Sqoop to import and export data. When you use Sqoop, you do not need to install the relational database client and software on any node in the Hadoop cluster.
To use Sqoop, you must configure Sqoop properties in a JDBC connection and run the mapping in the Hadoop environment. You can configure Sqoop connectivity for relational data objects, customized data objects, and logical data objects that are based on a JDBC-compliant database. For example, you can configure Sqoop connectivity for the following databases:
- •Aurora
- •Greenplum
- •IBM DB2
- •IBM DB2 for z/OS
- •Microsoft SQL Server
- •Netezza
- •Oracle
- •Teradata
The Model Repository Service uses JDBC to import metadata. The Data Integration Service runs the mapping in the Hadoop run-time environment and pushes the job processing to Sqoop. Sqoop then creates map-reduce jobs in the Hadoop cluster, which perform the import and export job in parallel.
Specialized Sqoop Connectors
When you run mappings through Sqoop, you can use the following specialized connectors:
- OraOop
You can use OraOop with Sqoop to optimize performance when you read data from or write data to Oracle. OraOop is a specialized Sqoop plug-in for Oracle that uses native protocols to connect to the Oracle database.
You can configure OraOop when you run Sqoop mappings on the Spark and Hive engines.
- Teradata Connector for Hadoop (TDCH) Specialized Connectors for Sqoop
You can use the following TDCH specialized connectors for Sqoop to read data from or write data to Teradata:
- - Cloudera Connector Powered by Teradata
- - Hortonworks Connector for Teradata (powered by the Teradata Connector for Hadoop)
- - MapR Connector for Teradata
These connectors are specialized Sqoop plug-ins that Cloudera, Hortonworks, and MapR provide for Teradata. They use native protocols to connect to the Teradata database.
Informatica supports Cloudera Connector Powered by Teradata and Hortonworks Connector for Teradata on the Blaze and Spark engines. When you run Sqoop mappings on the Blaze engine, you must configure these connectors. When you run Sqoop mappings on the Spark engine, the Data Integration Service invokes these connectors by default.
Informatica supports MapR Connector for Teradata on the Spark engine. When you run Sqoop mappings on the Spark engine, the Data Integration Service invokes the connector by default.
Note: For information about running native Teradata mappings with Sqoop, see the Informatica PowerExchange for Teradata Parallel Transporter API User Guide.
Big Data Management Engines
When you run a big data mapping, you can choose to run the mapping in the native environment or a Hadoop environment. If you run the mapping in a Hadoop environment, the mapping will run on one of the following job execution engines:
- •Blaze engine
- •Spark engine
- •Hive engine
For more information about how Big Data Management uses each engine to run mappings, workflows, and other tasks, see the chapter about Big Data Management Engines.
High Availability
High availability refers to the uninterrupted availability of Hadoop cluster components.
You can use high availability for the following services and security systems in the Hadoop environment on Cloudera CDH, Hortonworks HDP, and MapR Hadoop distributions:
- •Apache Ranger
- •Apache Ranger KMS
- •Apache Sentry
- •Cloudera Navigator Encrypt
- •HBase
- •Hive Metastore
- •HiveServer2
- •Name node
- •Resource Manager