Profiling Data Overview

A profile is a set of metadata that describes the content and structure of a dataset.

Data profiling is often the first step in a project. You can run a profile to evaluate the structure of data and verify that data columns are populated with the types of information you expect. If a profile reveals problems in data, you can define steps in your project to fix those problems. For example, if a profile reveals that a column contains values of greater than expected length, you can design data quality processes to remove or fix the problem values.

A profile that analyzes the data quality of selected columns is called a column profile.

Note: You can also use the Developer tool to discover primary key, foreign key, and functional dependency relationships, and to analyze join conditions on data columns.

A column profile provides the following facts about data:

•The number of unique and null values in each column, expressed as a number and a percentage.

•The patterns of data in each column, and the frequencies with which these values occur.

•Statistics about the column values, such as the maximum and minimum lengths of values and the first and last values in each column.

•For join analysis profiles, the degree of overlap between two data columns, displayed as a Venn diagram and as a percentage value. Use join analysis profiles to identify possible problems with column join conditions.

You can run a column profile at any stage in a project to measure data quality and to verify that changes to the data meet your project objectives. You can run a column profile on a transformation in a mapping to indicate the effect that the transformation will have on data.

Story

HypoStores wants to verify that customer data is free from errors, inconsistencies, and duplicate information. Before HypoStores designs the processes to deliver the data quality objectives, it needs to measure the quality of its source data files and confirm that the data is ready to process.

Objectives

In this lesson, you complete the following tasks:

•Perform a join analysis on the Boston_Customers data source and the LA_Customers data source.

•View the results of the join analysis to determine whether or not you can successfully merge data from the two offices.

•Run a column profile on the All_Customers data source.

•View the column profiling results to observe the values and patterns contained in the data.

Prerequisites

Before you start this lesson, verify the following prerequisite:

•You have completed lessons 1 and 2 in this tutorial.

Time Required

•Set aside 20 minutes to complete this lesson.