You can use Metadata Command Center to extract metadata from a source system.
A source system is any system that contains data or metadata. For example, Databricks is a source system from which you can extract metadata through a Databricks catalog source. A catalog source is an object that represents and contains metadata from the source system.
Before you extract metadata from a source system, you first create and register a catalog source that represents the source system. Then you configure capabilities for the catalog source. A capability is a task that Metadata Command Center can perform, such as metadata extraction, data profiling, data classification, or glossary association.
When Metadata Command Center extracts metadata, Data Governance and Catalog displays the extracted metadata and its attributes as technical assets. You can then perform tasks such as analyzing the assets, viewing lineage, and creating links between those assets and their business context.
The following table describes the capabilities of the catalog source:
Capability
Description
Advanced Programming Language Parsing
Advanced Programming Language Parsing parses the source system code in addition to extracting objects from the source system.
Data Profiling and Quality
- Data Profiling. Assesses source metadata and analyzes the collected statistics to discover content and structure, such as value distribution, patterns, and data types.
- Data Quality. Measures the reliability of the data and enables data usage.
- Data Observability. Identifies anomalies in the characteristics of the data.
Data Classification
Data classification is the process of identifying and organizing data into relevant categories based on the functional meaning of the data. Classifying data can help your organization manage risks, compliance, and data security.
Glossary Association
You can associate terms that are in the glossary with technical assets to provide user-friendly business names to technical assets. Glossary Association automatically associates glossary terms with technical assets or recommends glossary terms that you can manually associate with technical assets in Data Governance and Catalog.
Extraction and view process
To extract metadata from a source system, configure the catalog source and run the catalog source job in Metadata Command Center. Then view the results in Data Governance and Catalog.
The following image shows the process to extract metadata from a source system:
After you verify prerequisites, perform the following tasks to extract metadata from Databricks:
1Register a catalog source. Create a catalog source object, select the source system, and select the connection.
2Configure the catalog source. Specify the runtime environment and configure parameters for metadata extraction. Optionally, add filters to include or exclude source system assets from metadata extraction. You can also configure other capabilities such as data profiling and quality.
3Optionally, associate stakeholders. Associate users with technical assets, giving the users permission to perform actions determined by their roles.
4Run or schedule the catalog source job.
5Optionally, if the catalog source job generates referenced asset objects, you can assign a connection to referenced source system assets.
You can view the lineage with object references without performing connection assignment. After connection assignment, you can view the objects.
Run the catalog source again after you assign connections to referenced source system assets.
After you run the catalog source job, you view the results in Data Governance and Catalog.
About the Databricks catalog source
You can use the Databricks catalog source to extract metadata from Databricks Delta Lake and Databricks Notebooks source systems.
Databricks combines data warehouses and data lakes into an AI-driven Databricks Lakehouse platform. Databricks Delta Lake is an open source data format and a transactional data management system on the Databricks platform. Databricks Notebooks is a web-based interface to a document that contains code that you can run, visualizations, and narrative text.
You can run connection-aware scans on Databricks sources.
You can use SQL warehouse or all-purpose clusters to extract metadata.
You can extract metadata from Databricks Unity Catalog. Additionally, you can retrieve lineage captured by Databricks Unity Catalog.
Note: Databricks Unity Catalog retains lineage data for 90 days.
Extracted metadata
You can use the Databricks catalog source to extract metadata from Databricks Delta Lake and Databricks Notebooks source systems.
You can extract the following objects from a Databricks Notebooks source system:
•Calculation
•Command
•Folder
•Job Parameter
•Live Table
•Live View
•Notebook Definition
•Notebook Instance
•Notebook Parameter
•Notebook Task
•Pipeline Definition
•Result
•Run Job Task
•Streaming Table
•Streaming View
•Task Parameter
•Workflow Job Definition
•Workflow Job Instance
You can extract the following objects from a Databricks Delta Lake source system:
•File
•File System
•Database
•Schema
•Table
•External Table
•External Column
•Column
- Primary Key
- Foreign Key
You can extract the following complex data type columns along with their nested fields from Databricks Delta Lake source systems:
•Map
•Struct
•Array
Compatible functionalities
Databricks offers integration with a diverse range of modules and programming languages.
You can use Databricks with the following programming languages:
•Python
•SQL
•Markdown
You can use Databricks with the following Python functionalities:
•Standard language constructions
•Standard built-in functions
•Partially-compatible modules:
Note: Data Governance and Catalog processes only a subset of library functions of partially-compatible modules.
- abs
- adal
- argparse
- array
- ast
- azure
- base64
- binascii
- calendar
- codecs
- collections
- concurrent
- contextlib
- contextvars
- copy
- copyreg
- csv
- dataclasses
- datetime
- dbutils
- decimal
- delta
- difflib
- distutils
- email
- enum
- errno
- fnmatch
- fractions
- functools
- gc
- genericpath
- gettext
- glob
- graphframes
- hashlib
- heapq
- hmac
- importlib
- inspect
- io
- itertools
- json
- keyword
- locale
- logging
- math
- matplotlib
- nt
- numbers
- numpy
- operator
- os
- pandas
- pathlib
- pickle
- pkgutil
- posix
- posixpath
- pprint
- py4j
- pyodbc
- pyspark
- pytz
- random
- re
- reprlib
- requests
- seaborn
- secrets
- shutil
- simplejson
- six
- sklearn
- smtplib
- socket
- ssl
- stat
- string
- struct
- subprocess
- sys
- teradatasql
- textwrap
- threading
- time
- traceback
- types
- typing
- urllib
- urllib3
- uuid
- warnings
- weakref
- xml
- yaml
- zipfile
- zlib
•Custom libraries
Note: Custom libraries are libraries created by a user.
Note: If Databricks catalog source detects an incompatible function or library, it can't process the statement. It skips the statement and continues to process the next one.
You can also use DeltaLake SQL in SQL commands and PySpark calls.
Data profiling for Databricks objects
Configure data profiling to run profiles on the metadata extracted from a Databricks Delta Lake source system. You can use all-purpose clusters or SQL warehouse to run profiles. You can also run profiles on Databricks Unity Catalog objects. You can view the profiling statistics in Data Governance and Catalog.
You can run profiles on the following Databricks Delta Lake objects:
•Delta Table
•External Table in Delta format
•View
You can view the profiling statistics in Data Governance and Catalog. The data profiling task runs profiles on the following data types:
•Bigint
•Boolean
•Date
•Decimal
•Double
•Float
•Int
•Smallint
•String
•Tinyint
•Timestamp
Compatible connectors
Before you configure a Databricks catalog source, you must connect to Databricks Notebooks and Databricks Delta Lake source systems.
Use the Databricks Delta connector to connect to Databricks Notebooks and Databricks Delta Lake source systems.
Note: To enable profiling, perform the following prerequisite steps before you configure a Databricks Delta connector in Administrator:
1Create a folder with the name informatica.databricksdelta in the <Secure Agent installation directory>/ext/connectors/thirdparty/ folder.
2Copy the SparkJDBC42.jar file to the informatica.databricksdelta folder. The minimum version of the Spark server must be 3.x.
3Run the ./infaagent startup script located in the <Secure Agent installation directory>/apps/agentcore folder to restart the Secure Agent.