Guided Dataset Selection

ASCMO-DYNAMIC offers the possibility to choose the category of a dataset, i.e., to mark a dataset as training dataset, validation dataset, or test dataset. For many datasets, this task is difficult. The guided dataset selection feature helps the user to select the dataset category so that the model training produces good models, i.e. no overfitting and good generalization properties of the model. This is done by ordering the datasets according to the Hausdorff distance, i.e. the distance of different datasets.

The Hausdorff distance is the maximum Euclidean distance of all data points of one dataset to the data points of all other datasets.

Guided data selection is started with the Data → Guided Dataset Selection menu option in the main window, or with the Guided Dataset Selection button in the "Manage Datasets" window (Data → Manage Datasets).

The procedure first asks for the lookback length. Consecutive data points up to the lookback length are used to take gradients into account when calculating the distances.

Then the datasets are sorted according to their importance (normalized Hausdorff distance). In the first sorting step, the user can decide how many datasets are chosen as training datasets. A dataset with distance 0 would add no new information to the model training. Some datasets with new information should remain as validation datasets, to prevent overfitting. In the second sorting step, the remaining datasets are once more sorted and the same visualization is used to select validation datasets. All remaining datasets become test datasets.

When the sorting is finished, the new dataset categories are set in the project, and a new model training should be done.