You can use the Databricks catalog source to extract metadata from Databricks source system.
Databricks combines data warehouses and data lakes into an AI-driven Databricks Lakehouse platform.
You can run connection-aware scans on Databricks sources.
You can use SQL warehouse or all-purpose clusters to extract metadata.
Extracted metadata
You can use the Databricks catalog source to extract metadata from Databricks source system.
You can extract AI Model Core and AI Model Core Version metadata from Databricks Unity Catalog source systems. Additionally, you can retrieve lineage captured by Databricks Unity Catalog.
Note:
Databricks Unity Catalog retains lineage data for 90 days.
You can extract metadata from Databricks notebooks if they use the following technologies:
•Markdown
•Python
•SQL
You can extract metadata of the following objects from a Databricks workspace:
•Calculation
•Command
•Experiment
- Experiment Run
•Folder
•Job Parameter
•Live View
•Materialized View
•Notebook Definition
•Notebook Instance
•Notebook Parameter
•Notebook Task
•Pipeline Definition (SQL based)
- Live View
- Materialized View
- Streaming Table
- Streaming View
•Result
•Run Job Task
•Streaming Table
•Streaming View
•Task Parameter
•Workflow Job Definition
•Workflow Job Instance
You can extract metadata of the following objects from Databricks Unity Catalog:
•Column
- Primary Key
- Foreign Key
•Dashboard
•Database
•External Column
•External Table
•File
•File System
•Function
•Schema
•Table
•View
•Volume
You can extract the following complex data type columns along with their nested fields from Databricks source systems:
•Map
•Struct
•Array
Compatible functionalities
Databricks offers integration with a diverse range of modules and programming languages.
You can use Databricks with the following programming languages:
•Python
•SQL
•Markdown
You can use Databricks with the following Python functionalities:
•Standard language constructions
•Standard built-in functions
•Partially-compatible modules:
Note: Data Governance and Catalog
processes only a subset of library functions of partially-compatible modules.
- abs
- adal
- argparse
- array
- ast
- azure
- base64
- binascii
- calendar
- codecs
- collections
- concurrent
- contextlib
- contextvars
- copy
- copyreg
- csv
- dataclasses
- datetime
- dbutils
- decimal
- delta
- difflib
- distutils
- email
- enum
- errno
- fnmatch
- fractions
- functools
- gc
- genericpath
- gettext
- glob
- graphframes
- hashlib
- heapq
- hmac
- importlib
- inspect
- io
- itertools
- json
- keyword
- locale
- logging
- math
- matplotlib
- nt
- numbers
- numpy
- operator
- os
- pandas
- pathlib
- pickle
- pkgutil
- posix
- posixpath
- pprint
- py4j
- pyodbc
- pyspark
- pytz
- random
- re
- reprlib
- requests
- seaborn
- secrets
- shutil
- simplejson
- six
- sklearn
- smtplib
- socket
- ssl
- stat
- string
- struct
- subprocess
- sys
- teradatasql
- textwrap
- threading
- time
- traceback
- types
- typing
- urllib
- urllib3
- uuid
- warnings
- weakref
- xml
- yaml
- zipfile
- zlib
•Custom libraries
Note:
Custom libraries are libraries created by a user.
Note:
If
Databricks
catalog source detects an incompatible function or library, it can't process the statement. It skips the statement and continues to process the next one.
You can also use DeltaLake SQL in SQL commands and PySpark calls.
Data profiling for Databricks objects
Configure data profiling to run profiles on the metadata extracted from a Databricks source system. You can use all-purpose clusters or SQL warehouse to run profiles. You can also run profiles on Databricks Unity Catalog objects. You can view the profiling statistics in Data Governance and Catalog.
You can run profiles on the following Databricks Unity Catalog objects:
•Delta Table
•External Table in Delta, Parquet, and CSV formats
•View
You can view the profiling statistics in Data Governance and Catalog. The data profiling task runs profiles on the following data types:
•Bigint
•Boolean
•Date
•Decimal
•Double
•Float
•Int
•Smallint
•String
•Tinyint
•Timestamp
Compatible connectors
Before you configure a Databricks catalog source, you must connect to Databricks source system.
Use the Databricks connector to connect to Databricks source system.
Note:
To enable profiling, perform the following prerequisite steps before you configure a Databricks connector in Administrator:
1Create a folder with the name informatica.databricksdelta in the <Secure Agent installation directory>/ext/connectors/thirdparty/ folder.
2Copy the SparkJDBC42.jar file to the informatica.databricksdelta folder. The minimum version of the Spark server must be 3.x.
3Run the ./infaagent.sh startup script located in the <Secure Agent installation directory>/apps/agentcore folder to restart the Secure Agent.