Microsoft Fabric OneLake Sources > Create catalog sources in Metadata Command Center > Step 2. Configure capabilities
  

Step 2. Configure capabilities

When you configure the Microsoft Fabric OneLake catalog source, you define the settings for the metadata extraction capability and other optional capabilities.
The metadata extraction capability extracts source metadata from external source systems. You can also configure other capabilities that the catalog source includes.
You can save the catalog source configuration at any point after you enter the connection information. After you save the catalog source, you can choose to run the catalog source job. To run the job once, click Run. To run metadata extraction and other capabilities on a recurring schedule, configure schedules on the Schedule tab.

Configure metadata extraction

When you configure a Microsoft Fabric OneLake catalog source, you choose a runtime environment, define filters, and enter configuration parameters for metadata extraction.
    1In the Connection and Runtime area, choose a serverless runtime environment or the Secure Agent group where you want to run catalog source jobs.
    Note:
    Serverless runtime environment options are available if the catalog source works with a serverless runtime environment.
    2Choose to retain, delete, or deprecate objects that are deleted from the source system in the catalog with the Metadata Change Option.
    Note:
    You can also change the configured metadata change option when you run a catalog source.
    3In the Filters area, define one or more filter conditions to apply for metadata extraction:
    1. aSelect Yes to view filter options.
    2. bFrom the Include/Exclude list, choose to include or exclude metadata based on the filter parameters.
    3. cFrom the Object type list, select an object type, depending on the object that you want to extract metadata from.
    4. dEnter the path to the object as the filter value.
    5. The following image shows the filter condition options:
      The image shows the filter conditions for the catalog source. It shows 'File', 'Folder', and 'Path' as the object types.
    6. eTo define an additional filter with an OR condition, click the Add icon.
    4Optional. In the Configuration Parameters area, enter properties to override default context values and job parameters.
    Note:
    Click
    Show Advanced
    to view all configuration parameters.
    The following table describes the properties that you enter for Catalog Source Configuration Options:
    Parameter
    Description
    Extract Group Elements from Hierarchical File
    Select Yes to extract group elements from hierarchical files and to view the complete hierarchy of hierarchical fields. You can't run profiles on hierarchical fields if you choose to extract group elements.
    Select No to extract only the leaf elements from hierarchical files.
    Enable Extension-Based File Type Detection
    Choose to use the extension of the file for file type detection:
    • - Select Yes to use file extensions to detect file types.
    • - Select No to parse the file contents to detect file types.
    Use First Row as Header of Delimited Files
    Select one of the following options to use the first row as the header or detect headers automatically for delimited files:
    • - Yes.
    • Detects column headers based on the following rules:
      • - Duplicate headers get suffixed with ‘#’ followed by a number, for example, ABC#1, ABC#2. The detection is not case-sensitive.
      • - Empty column header values appear as <UnknownColumn><position>, for example UnknownColumn2.
      • - The header row in the file is detected even if it has a different number of columns than the data rows.
    • - No. Detects headers automatically for delimited files.
    Headers of Delimited Files
    Specify the list of header values of delimited files and separate the values by commas. If any header value from the list is found in the first row of the delimited file, then the first row is used as the header.
    Treat Files Without Extension As
    Select one of the following options:
    • - Parquet
    • - Avro
    • - Json
    Enter File Delimiter
    Specify a new delimiter enclosed in single quotes if your file uses a delimiter other than the following default delimiters: < ',', '\t', ';', ':', '|'>.
    Note:
    Adding a custom delimiter overrides the default list of delimiters. If you specify a delimiter, characters from the default list are not considered as delimiters.
    Files to be excluded during partition discovery
    The regular expression of the files that you want to exclude during partition discovery. Separate Multiple Regex with a comma (,). Enclose each regex with double quotes ("). For example, ".*json","Customer.csv","Parquet.*"
    The following table describes the properties that you enter for Partitioned File Configuration:
    Property
    Description
    Custom Partition Detection Configuration File
    Select a configuration file in the JSON format to specify the pattern of the non-hive style custom partitions in your source system.
    Note:
    The system automatically detects partitions if the date format of the partition key is in any of the following formats: For example, the system interprets dd-MM-yyyy as the partition format and extracts as a hierarchical file instead of a folder in the following partition pattern:
    The system doesn't automatically detect partitions with other patterns, such as MM-dd-yyyy. If the pattern is different, you need to define it in JSON format. For example,
    To detect the epoch time format, define it in JSON format as:
    Partitioned Pruning Configuration Options
    Select a configuration file in the JSON format. In the file, specify additional information to identify partitions for determining the relevant schema in your source system.
    Enable Partition Detection
    Select Yes or No.
    Enable Partition Pruning
    Select Yes or No.
    The following table describes the property that you can enter for Additional Settings:
    Note:
    The Additional settings section appears when you click
    Show Advanced
    .
    Property
    Description
    Expert parameters
    Enter additional configuration options to be passed at runtime. Required if you need to troubleshoot the catalog source job.
    Caution:
    Use expert parameters when it is recommended by Informatica Global Customer Support.
    5Configure additional capabilities for the catalog source by clicking on the tabs.

Filter guidelines and examples

You can add metadata extraction filters when you configure the catalog source. To create a filter, you can use choose from file names, folder names, or paths.
Consider the following rules and guidelines when you enter filter values:
Important:
It is recommended that you either use only a path filter or use a combination of a folder and a file filter.

Examples

You can include or exclude metadata from folders, files, or paths.
Folder filters
Folder filters apply to folders included in the source system.
For example:
File filters
File filters apply to the files included in folders that you filter. The file filter is recursive. If you don’t provide any folder filters, the file filters apply to the entire folder hierarchy.
For example:
Path filters
Path filters apply to the files and folders in the path that you filter. The path filter is non-recursive. If you provide only the file or folder names, the path filters apply to the first level files or directories.
For example:

Configure data profiling and quality

Enable the data profiling capability to evaluate the quality of metadata extracted from the Microsoft Fabric OneLake source system.
    1Click the Data Profiling and Quality tab.
    2Expand Data Profiling and select Enable Data Profiling.
    Note:
    Ensure that you have permissions on all the staging connections that you use in your data profiling configuration. You can't run the job if you don't have permissions on the connections that you use. Select connections that you have access to, or ask the administrator to grant the necessary permissions on the connections that you want to use.
    3In the Connection and Runtime area, choose the Secure Agent group where you want to run catalog source jobs.
    4Optional. In the Filters area, specify additional filters in addition to metadata filters:
    1. aSelect Yes to view filter options.
    2. bFrom the Include or Exclude metadata list, choose to include or exclude metadata based on the filter parameters.
    3. cFrom the object type list, select Files or Folders depending on the object that you want to extract metadata from.
    4. dEnter the path to the object as the filter value.
    5. Examples:
      To include or exclude multiple objects, click the Add icon to add filters with the OR condition.
    5In the Parameters area, configure the following parameters based on your requirements:
    Parameter
    Description
    Modes of Run
    Determines the type of data that you want the data profiling task to collect.
    Choose one of the following options:
    • - Keep signatures only. Collects only aggregate information such as data types, average, standard deviation, and patterns.
    • - Keep signatures and values. Collects both signatures and data values.
    Profiling Scope
    Determines whether you want to run data profiling only on the changes made to the source system or on the entire source system.
    Choose one of the following options:
    • - Incremental. Includes only source metadata that is changed or updated since the last profile run.
    • - Full. Includes the entire metadata that is extracted based on the filters applied for extraction.
    Sampling Type
    Determines the sample rows on which you want to run the data profiling task. You can run data profiling on all rows in the metadata.
    Elastic Runtime Environment
    Select an elastic runtime environment for Parquet complex file type.
    Staging Connection
    The staging connection where data profiling results are stored temporarily during the execution.
    Applicable for data profiling on complex file types.
    Maximum Precision of String Fields
    The maximum precision value to be used for profiling fields that include the string data type. Default is 50.
    Text Qualifier
    The character that defines string boundaries. If you select a quote character, profiling ignores delimiters within the quotes. Select a qualifier from the list. Default is Double Quote.
    Code Page for Delimited Files
    The code page for delimited files.
    Escape Character for Delimited Files
    The escape character for delimited files.
    Read Multiple Line JSON Files
    By default, the data profiling job reads each JSON schema as a single line. Select Yes to read input that spans across multiple lines.
    6Expand Data Quality and select Enable Data Quality.
    Note:
    You can click
    Use Data Profiling Parameters
    to use the same parameters as in the
    Data Profiling
    section.
    Note:
    Ensure that you have permissions on all the staging and flat file connections that you use in your data quality configuration. You can't run the job if you don't have permissions on the connections that you use. Select connections that you have access to, or ask the administrator to grant the necessary permissions on the connections that you want to use.
    7In the Connection and Runtime area, choose the Secure Agent group where you want to run catalog source jobs.
    8In the Parameters area, configure the following parameters based on your requirements:
    Parameter
    Description
    Data Quality Rule Automation
    Enable the option to automatically create or update rule occurrences for data elements in the catalog source.
    Choose one of the following options:
    • - Apply on Data Elements linked with Business Dataset. Creates rule occurrences for all data elements that are linked with business data sets in the catalog source.
    • - Apply on all Data Elements. Creates rule occurrences for all data elements in the catalog source.
    Cache Result
    Select Agent Cache if you want to generate a cache file in the runtime environment and to preview the cached results faster in subsequent data preview runs. The results are cached for seven days by default after the first run in the runtime environment. Select No Cache if you don't want to cache the preview results and view the live results.
    Run Rule Occurrence Frequency
    Specify whether you want to run data quality rules based on the frequency defined for the rule occurrence in Data Governance and Catalog.
    Sampling Type
    Determines the sample rows on which you want to run the data quality task. You can run data quality tasks on all rows in the metadata.
    Elastic Runtime Environment
    Select an elastic runtime environment for Parquet complex file type.
    Staging Connection
    The staging connection where data quality results are stored temporarily during the execution.
    Applicable for data profiling on complex file types.
    Maximum Precision of String Fields
    The maximum precision value to be used for profiling fields that include the string data type. Default is 50.
    Text Qualifier
    The character that defines string boundaries. If you select a quote character, profiling ignores delimiters within the quotes. Select a qualifier from the list. Default is Double Quote.
    Code Page for Delimited Files
    The code page for the delimited files.
    Escape Character for Delimited Files
    The escape character for the delimited files.
    Read Multiple Line JSON Files
    By default, the data quality job reads each JSON schema as a single line. Select Yes to read input that spans across multiple lines.
    9To enable the data observability capability, expand Data Observability and select Enable Data Observability.

Configure data classification

Enable the data classification capability to identify and organize data into relevant categories based on the functional meaning of the data.
    1Click the Data Classification tab.
    2Select Enable Data Classification.
    3Choose one or both of the following options:

Configure glossary association

Enable the glossary association capability to associate glossary terms with technical assets, or to get recommendations for glossary terms that you can manually associate with technical assets in Data Governance and Catalog.
Metadata Command Center considers all published business terms in the glossary while making recommendations to associate your technical assets.
    1Click the Glossary Association tab.
    2Select Enable Glossary Association.
    3Select Enable auto-acceptance to automatically accept glossary association recommendations.
    4Specify the Confidence Score Threshold for Auto-Acceptance to set a threshold limit based on which the glossary association capability automatically accepts the recommended glossary terms.
    Note:
    Specify a percentage from 80 to 100. If the score is higher than the specified limit, the glossary association capability automatically assigns a matching glossary term to the data element.
    5Select Enable Below-threshold Recommendations to receive glossary association recommendations below the auto-acceptance threshold. If you enable auto-acceptance, you can enable below-threshold recommendations to receive glossary recommendations below the auto-acceptance threshold.
    6Specify the Confidence Score Threshold for Recommendations to set a threshold based on which the glossary association capability makes recommendations
    If you enable auto-acceptance, specify a percentage from 80 to the selected auto-acceptance threshold. You can accept or reject the recommended glossary terms that fall within this range in Data Governance and Catalog.
    If you disable auto-acceptance, specify a percentage from 80 to 100 inclusive.
    7Choose to automatically assign business names and descriptions to technical assets. You can then choose to retain existing assignments and only assign business names and descriptions to assets that don't have assignments, or allow overwrite of existing assignments.
    By default, existing assignments are retained.
    8Optional. Choose to ignore specific parts of data elements when making recommendations. Select Yes and enter prefix and suffix keyword values as needed.
    Click Select to enter a keyword. You can enter multiple unique prefix and suffix keywords. Keyword values are case insensitive.
    9Optional. Choose specific top-level business glossary assets to associate with technical assets. Selecting a top-level asset selects its child assets as well. Select Top-level Glossary Assets and specify the assets on the Select Assets page.
    10Optional. Choose to use abbreviations and synonym definitions from lookup tables for accurate glossary association. Select Yes to enable, and then click Select to upload a lookup table.
    11Click Next.
    The Associations page appears.