Best Practices and Recommendations

Folder Structure for Model Maintenance

In each of the "data" and "model" folders which are configured in the config.ini file under the section "[ATTRIBUTE EXTRACTION]" in the server.ai folder, there is a folder named "attribute_extraction' that is created to store related files.

When training is initiated, the following sub-folders are created:

  • data/attribute_extraction/<model_extraction>/training/<language>

    • Example: data/attribute_extraction/S1/training/en

  • model/attribute_extraction/<model_extraction>/training/<language>

    • Example: model/attribute_extraction/S1/training/en

where <model_extraction> are either S1 or S2 (specified in the configuration), and <language> is the parameter specifying the language code sent to the extraction service. After the training is completed, a JSON file which contains information about the trained model (accuracy, algorithm, and language) is created inside the "model/attribute_extraction" folder. The name of the JSON file has the following pattern:

<model_extraction> + "_" + <language> + "_metrics.json" (e.g., S1_en_metrics.json)

After the training has been completed, the trained model is saved in the folder "model/attribute_extraction/<model_extraction>/<language>". For each extraction model (e.g., S1) and language (e.g., en) combination, there is only one JSON file. If you were to train a new model on a new dataset of the same combination, the newly trained model and JSON file are only saved if its accuracy is higher than that of the already existing trained model.

Recommendations for Best Results

For brand extraction, we have developed 2 solutions named S1 and S2. Model S1 (default model) is faster to train while model S2 is slower to train but often result in slightly higher accuracy. We have tested our solutions on various datasets and based on our experimental results, we recommend the followings:

  • It is often better to use a larger training set to train models, especially when data is very heterogenous (e.g., brands can appear anywhere in the texts; the length of the texts vary widely). Therefore, we recommend using as much data as possible for training.

  • If some products titles or descriptions do not carry brand names, include them in the training data as well. This is to teach the models to learn that the texts do not always have brands in them.

  • The accuracy tends to be higher when the text from which the brand is extracted is shorter. Our solution is limited to handle texts of maximum 1,000,000 characters. Longer texts may cause memory allocation issues.

  • We recommend using the S1 model for training as it is faster to train than S2. If the accuracy of the trained model is low (i.e., < 80), consider switching to S2 for better accuracy. Please note that the training time for S2 can be significantly longer than that of S1 (days vs hours). In one of our experiments, we trained an S1 model and an S2 model on a standard laptop CPU (MacBook Pro 2.3 GHz 8-Core Intel Core i9) using a dataset of 10K records and the average length of the texts was roughly 200 words. The training time for model S1 was about 3 hours while that for model S2 was almost 3 days. When we evaluated the trained models on a test dataset of 7K records, S2 gave about 4% improvement in accuracy compared to S1. The time it takes to train a model heavily depends on the size of the training data, the length of the texts from which the brands are extracted, the consistency of the data (e.g., brands tend to appear mostly at the beginning of the texts or vary), and the computing resources available.

We have extensively tested our solutions for English, German, Swedish and Finnish datasets, however, we also support other languages such as: Dutch, French, Italian, Norwegian, Portuguese, Spanish and Swedish.