Computation of import profile matching rates

If one or more files are added in the New Import Dialog, each import profile will get a so called matching rate which is presented in the second column of the import profile view.

How the matching rate is computed depends on the type of import files (CSV, Excel or XML).

Table-based file (CSV or Excel)

mapped = Number of columns which are mapped and which occur in the input file as well as in the import profile

additionallyInProfile = Number of columns which are mapped and which occur only in the import profile

matchinRate = mapped / ( mapped + additionallyInProfile )

Example:

Input file contains columns A, B, C and D

Import profile refers to columns A, B, E; columns A,B and E are mapped

matchingRate = 2 / ( 2 + 1) = 2 / 3 = 0.66

XML file

The structure of an XML file can be determined only by a full parse of the file. As a full parse might take too long to provide feedback to the user within an acceptable time span, in the case of a large file only the beginning of the file is parsed. The maximum parsing time is limited to 500ms. For comparison, not only the tag itself is considered but also its parent tags.

As the full structure might not be available, a different algorithm has to be used for the matching rate to deliver meaningful matching rates

matching = Number of tags which occur in the file and in the import profile.

additionallyInFile = Number of tags which occur only in the input XML file profile

matchingRate = matching / ( matching + additionallyInFile )

Example:

XML file looks like

    <A>
       <B>Content1</B>
       <C>Content2</C>
       <D>
         <B>Content3</B> 
       </D>
    </A>

Import profile refers to tags A, A/B, A/C and A/D (A/B means tag B with parent A)

matchingRate = 3 / ( 3 + 2 ) = 3 / 5 = 0.6 (matching = { A, A/B,A/C }, additionallyInFile = { C,C/B} )

Multiple table-based files (CSV or Excel)

Value is computed identically to the case with one file. The number of columns is the sum of the number of columns of the single files.

Performance of matcher algorithm / Large number of mappings

At server start a fingerprint of all available mappings will be created. These fingerprints will be later compared in the best matches algorithm wich increases the performance of this algorithm enormously. However creating these fingerprints means that the server start takes longer time. In the table below you can see how long the creation of the fingerprints took on our test system (Intel(R) Core(TM) i7-2620M CPU 2 * 2.70 GHz, 8,00 GB RAM Windows Professional 64-Bit). For a more realistic scenario, the different mappings of the Benchmark (simple, normal, complex) are created randomly.

Number of Mappings	500	1000	1500	2000	2500	3000	10000
Duration in seconds	30.02	61.62	84.62	112.11	141.55	160.65	532.41

The results show that on the average about 19 fingerprints are created each second on the test system.

UI Impacts

Having a large number of mappings also affects the UI in different ways. On our test system we worked well with 500 mappings. Having more mappings, we found a performance lag when changing all mappings at once (e.g. changing category). Despite the fact, that we don´t assume that any customer have this large amount of mappings, its quite unrealistic that all will be changed at once.