The accuracy of a match model depends on the quality of the training data set. The training data set provides the examples that the machine learning (ML) model learns from. It is important that the training data set accurately reflects your business scenarios.
To select training data sets for training the ML model, review the following recommendations:
•Start by analyzing the characteristics of your data and the specific match requirements. Understand the data from where you want to select a training data set. Based on the understanding of your data, select a suitable training data set.
•For optimal training results, the data set sample must be similar to the production data. You can manually create a subset of production data by using the field values. For example, use the Country, State, or City code, or a combination of these to create a subset of production data.
•Ensure that the training data set contains the attributes that satisfy your matching requirements. For a relevant data set training, the data set sample must contain attributes same as the production data. For example, if the Phone Number and Email ID attributes are available in the production data, the same attributes must be available in the sample data set that is used for training.
•Ensure that the data sets contain some records that match and some records that do not match so that you have a clear idea of true positives and true negatives when you view the training results.
•For optimal training results, you can avoid duplicating data sets. If a data set has the same name, address, or any other key attributes in more than 25 records, it results in many duplicate data sets. Exclude some of the data to reduce duplicates.
•During the training process, the ML model learns the patterns and characteristics of the data set. Therefore, the training data set must be large enough for the ML model to understand the data patterns for making match predictions. For example, if you have a complex matching scenario, the ML model might need to use a larger number of match fields to effectively make match predictions. The need for a large number of match fields means that the model requires a larger data set.
•A data set must contain a minimum of 10,000 records. The optimal size data set can contain 50,000 to 100,000 records. A data set must contain less than 200,000 records.