Best Practices and Recommendations

Folder structure for model maintenance

All the models you train for natural language processing and deep learning are saved in the "model" folder (as defined in the configuration file). See chapter Configuration and Operation under the section Configuration of CLAIRE recommendation service.

  • If you train a model using the NLP approach, the generated model is a file named after the structure system you trained for and with the extension ".sav". For example, if you train on the structure "FashionUnlimited", the generated model will be called "FashionUnlimited.sav".

  • If you train a model using the DL approach, the generated model is a folder named after the structure system you trained for. For example, if you train on the structure "FashionUnlimited", the generated model is the folder "FashionUnlimited" within your model directory.

In addition to the generated models, a JSON file is also created for each initial training. This JSON file contains information about the trained model (structure, accuracy, ML approach) and is used to provide such insights to the user via the Desktop UI (see chapter Configuration and Operation under the section Display available models). The JSON file name has the following pattern:

<Structure_System_name> + "_" + <ML_approach> + "_metrics.json" (i.e. FashionUnlimited_dl_metrics.json)

In case a model has been trained on different environment it is possible to copy it into the "model" folder later on. In order to display it as available CLAIRE model on the Desktop UI, it is required to copy the corresponding generated JSON file as well. Say you train a model in a separate environment; you can copy the model file(s) as well as the JSON belonging to it into the "model" directory of your destination to make use of it properly.

If you want to delete a trained model from the "model" directory, delete its corresponding JSON file as well in order to avoid showing outdated information about a deleted model on the Desktop UI in the dialog " Available CLAIRE models".

Recommendations for best results

The accelerator offers two approaches to train models for product classification:

  • Natural language processing: a pipeline based on classic algorithms from the NLP domain. Very fast to train, but in some cases, accuracy can be relatively low. These models can be quite big in size. The exact amount of space they need will depend on a particular dataset, but in our tests, some of them took as much as 30 GB. This might seem a bit excessive, but on the other hand, they should be more accurate compared to the models where the size is artificially inhibited.

  • Deep learning: preferred approach, based on state-of-the-art research in deep learning. Such models are highly accurate, but are also time-consuming to train, especially, on big datasets.

See Chapter Configuration and Operation under the section Natural language processing and deep learning explained for details. Ordinarily, both require the user to be an expert in machine learning, but we kept all the details behind the scenes, to make the process as simple as possible. We would like to propose the following guidelines, to make sure you are getting the maximum out of this accelerator:

  • Keep the number of products in the training dataset to one million or below. Our system and hardware requirements are tailored for datasets of this size. Also, the models we use do not need extremely big datasets to give high accuracy and going beyond one million is not likely to give a big boost.

  • For deep learning, set training time to around 10 hours. In our tests, this timeframe resulted in models of good accuracy, and at the same time there was a probability of wasting too much time unnecessarily training much longer. Please keep in mind, that some additional time will be required for preprocessing the input data and for calculating performance metrics before the final model is available after training. For a dataset of a million records, it can take up to one hour, in addition to the time set for training. During training you can check the server window to see KPIs on the model and end the process earlier if you wish.

  • Avoid product categories that contain too few products. The algorithms need a decent amount of data points in each category, in order to be successful. Before training even starts, we only consider categories that contain at least twenty products, the rest are discarded. It's being done because smaller categories will probably not be learned anyway, and at the same time they can harm performance on other categories.

  • Don't use inputs that are longer than 1,000 words. All selected fields are merged into a single text input, and its size is truncated to 1,000 words. This is done to speed up training without losing accuracy.

  • Try to keep the number of product categories in 100s. There is no hard limit on the number of product categories, but we find that a dataset of a million records can represent hundreds of categories quite well, but not thousands. Dealing with the number of classes in thousands and more usually requires much more advanced methods that are likely to require a lot of manual tuning by a data scientist, which is why we haven't further considered them in this accelerator.

  • For both natural language processing and deep learning, use small datasets when applying trained models in batch mode. The actual speed would depend on configuration of the server, but internal tests on reasonable server hardware have shown an average performance of about 600 products/minute in our tests.

Model accuracy measurement

Before the training process starts, we split the input data into a training and a validation set (80/20 proportion). The model accuracy we report, is measured only on latter, to emulate how well the resulting model will perform on unseen data.

A good model performs well on two aspects:

  1. If an item belongs to a certain category, the model assigns it to this category (also called recall).

  2. If a model thinks an item belongs to a certain category, it does belong to it indeed (also called precision).

Both are very important. For instance, if a model just classifies anything as a watch, it will have high recall on watches, because all watches are being correctly identified. However, its precision will be low, as there will be many items from other categories incorrectly identified as watches.

This is why we use a measure called F1, which combines both recall and precision (this is a well-established and well-known metric used in many applications of machine learning). We calculate it for every product class in validation set separately, and then report the average of these scores (the technical term for that is macro-F1). It can take values between zero (bad) and one (good). Bigger values represent higher precision and recall across all product categories.

A word of caution: it is not classification accuracy and can't be compared with it directly!

Here are some guidelines on how to interpret F1 scores:

Value interval

Remarks

0.80-1.00

These values are quite rare, and usually can be achieved with a small number of classes only in the model (tens, not hundreds).

0.60-0.80

In our tests, these models gave good accuracy across product categories, but occasionally performed a bit less successfully on a small number of categories.

0.40-0.60

Perform well on the majority of product categories.

0.00-0.40

We wouldn't recommend using such models, as their performance can be unreliable. If you're seeing such scores, try reducing the number of classes and/or supplying more training data.