Using data functionalities#

Running Trainer.load_data or DataModule.load_data will process the dataset in the following order:

  1. Data splitting (training/validation/testing sets): See “Data splitters”

  2. Data imputation: See “Data imputers”

  3. Data augmentation (for features): See “Data derivers”

  4. Data processing (orderless except for data scaling): See “Data processors”

    • Data augmentation (for data points)

    • Data filtering

    • Feature selection

    • Categorical encoding

    • Data scaling

    • etc.

  5. Data augmentation (for features, especially multi-modal features and physics-informed features): See “Data derivers”.

In this part, we will introduce the usage of “data splitters”, “data imputers”, “data processors”, and “data derivers”. Implementing new functionalities is left as a section in “Advanced Usage”.

[1]:
import torch
from tabensemb.trainer import Trainer
from tabensemb.model import *
import tabensemb
import os

prefix = "../../../../"
tabensemb.setting["default_output_path"] = prefix + "output"
tabensemb.setting["default_config_path"] = prefix + "configs"
tabensemb.setting["default_data_path"] = prefix + "data"

trainer = Trainer(device="cpu")

Data splitters#

Data splitters are used to split the whole dataset into training, validation, and testing sets. They inherit tabensemb.data.AbstractSplitter, and implement _split (the main method) and _next_cv (to generate the next fold for a k-fold CV process).

Remark: If AbstractSplitter.support_cv=False, the data splitter does not support k-fold CV.

There are several ways to specify the used data splitter. These ways can also be used to specify other configurations.

  1. Modify the configuration file, configs/sample.py for example:

cfg = {
    "data_splitter": "RandomSplitter",
    # Some other configurations...
}
  1. Use the manual_config argument of Trainer.load_config.

trainer.load_config("sample", manual_config={"data_splitter": "RandomSplitter"})
  1. After Trainer.load_config is called and if one does not want to call it again, use DataModule.set_data_splitter.

[2]:
trainer.load_config("sample")
trainer.datamodule.set_data_splitter("RandomSplitter", ratio=[7, 1.5, 1.5])
trainer.load_data()
The project will be saved to ../../../../output/sample/2023-09-23-20-36-52-0_sample
Dataset size: 178 39 39
Data saved to ../../../../output/sample/2023-09-23-20-36-52-0_sample (data.csv and tabular_data.csv).

The ratio argument can also be given in the configuration file, manual_config, or set_data_splitter as:

cfg = {
    # This will overwrite the `split_ratio` configuration.
    "data_splitter": ["RandomSplitter", {"train_val_test": [7, 1.5, 1.5]}],
    # Some other configurations...
}
trainer.load_config("sample", manual_config={"data_splitter": ["RandomSplitter", {"train_val_test": [7, 1.5, 1.5]}]})
trainer.datamodule.set_data_splitter(["RandomSplitter", {"train_val_test": [7, 1.5, 1.5]}])

Available data splitters can be seen using:

[3]:
from tabensemb.data.datasplitter import splitter_mapping
splitter_mapping
[3]:
{'AbstractSplitter': tabensemb.data.base.AbstractSplitter,
 'RandomSplitter': tabensemb.data.datasplitter.RandomSplitter}

Data imputers#

Imputation is necessary when NaNs exist in the dataset. tabensemb provides several methods incorporating other packages like miceforest and scikit-learn. The configuration for an imputer contains two parts: the name of the imputer and its arguments. Data imputers can be set similarly to data splitters in the following ways:

  1. Modify the configuration file, configs/sample.py for example:

cfg = {
    "data_imputer": ["MiceImputer", {"max_iter": 10}],
    # "data_imputer": "MiceImputer", (If no kwargs is given)
    # Some other configurations...
}
  1. Use the manual_config argument of Trainer.load_config.

trainer.load_config("sample", manual_config={"data_imputer": ["MiceImputer", {"max_iter": 10}]})
trainer.load_config("sample", manual_config={"data_imputer": "MiceImputer"})
  1. Use DataModule.set_data_imputer

[4]:
trainer.load_config("sample")
trainer.datamodule.set_data_imputer(["MiceImputer", {"max_iter": 10}])
trainer.load_data()
The project will be saved to ../../../../output/sample/2023-09-23-20-36-52-0_sample-I1
Dataset size: 153 51 52
Data saved to ../../../../output/sample/2023-09-23-20-36-52-0_sample-I1 (data.csv and tabular_data.csv).
[5]:
trainer.df.isna().any()
[5]:
cont_0                False
cont_1                False
cont_2                False
cont_3                False
cont_4                False
cont_5                False
cont_6                False
cont_7                False
cont_8                False
cont_9                False
cat_0                 False
cat_1                 False
cat_2                 False
cat_3                 False
cat_4                 False
cat_5                 False
cat_6                 False
cat_7                 False
cat_8                 False
cat_9                 False
target                False
target_binary         False
target_multi_class    False
dtype: bool

Available data splitters can be seen using:

[6]:
from tabensemb.data.dataimputer import imputer_mapping, get_data_imputer
imputer_mapping
[6]:
{'AbstractImputer': tabensemb.data.base.AbstractImputer,
 'AbstractSklearnImputer': tabensemb.data.base.AbstractSklearnImputer,
 'GainImputer': tabensemb.data.dataimputer.GainImputer,
 'MeanImputer': tabensemb.data.dataimputer.MeanImputer,
 'MedianImputer': tabensemb.data.dataimputer.MedianImputer,
 'MiceImputer': tabensemb.data.dataimputer.MiceImputer,
 'MiceLightgbmImputer': tabensemb.data.dataimputer.MiceLightgbmImputer,
 'MissForestImputer': tabensemb.data.dataimputer.MissForestImputer,
 'ModeImputer': tabensemb.data.dataimputer.ModeImputer}

Arguments can be seen in API docs or in docstrings

[7]:
print(get_data_imputer("MeanImputer").__doc__)

    Imputation with average values implemented using sklearn's SimpleImputer.

    Parameters
    ----------
    **kwargs
        Arguments for ``sklearn.impute.SimpleImputer`` (except for ``strategy``)

Data processors#

As listed in Step 4 above, data processing includes filtering, augmentation, feature selection, and much more. tabensemb provides a unified framework for implementing various data processing steps. The data imputation and processing procedure is quite similar to the Pipeline structure in sklearn, but is fully compatible with the other two modules introduced in this part and all four modules automatically do all preparations before training for the user.

Configuration for a processor also contains two parts: the name of the processor and arguments. Here we provide several examples:

  • CategoricalOrdinalEncoder: similar to the OrdinalEncoder in sklearn

  • NaNFeatureRemover: remove features that are all NaNs

  • VarianceFeatureSelector: same as the VarianceThreshold from sklearn

  • FeatureValueSelector: select data points that have a certain value of a feature

  • CorrFeatureSelector: remove highly correlated features

  • IQRRemover: remove outliers found by the 1.5*IQR criteria

  • StdRemover: remove outliers found by 3*std criteria

  • SampleDataAugmentor: just an example to show the data augmentation capability (it copies the last two data points in the validation set)

  • StandardScaler: same as the StandardScaler from sklearn

Remark: Data scalers like a StandardScaler must be the last data processor.

Remark: Data points derived from data augmenters (inheriting AbstractAugmenter) will be added to the training set.

[8]:
processor_configs = [
    ["CategoricalOrdinalEncoder", {}],
    ["NaNFeatureRemover", {}],
    ["VarianceFeatureSelector", {"thres": 0.1}],
    ["FeatureValueSelector", {"feature": "cat_1", "value": 0}],
    ["CorrFeatureSelector", {"thres": 0.1}],
    ["IQRRemover", {}],
    ["StdRemover", {}],
    ["SampleDataAugmenter", {}],
    ["StandardScaler", {}],
]
  1. Modify the configuration file:

cfg = {
    "data_processors": processor_configs,
    # Some other configurations...
}
  1. Use the manual_config argument of Trainer.load_config.

trainer.load_config("sample", manual_config={"data_processors": processor_configs})
  1. Use DataModule.set_data_processors

[9]:
import warnings
import numba
trainer.load_config("sample")
trainer.datamodule.set_data_processors(processor_configs)
with warnings.catch_warnings():
    warnings.filterwarnings("ignore", category=numba.NumbaDeprecationWarning)
    trainer.load_data()
The project will be saved to ../../../../output/sample/2023-09-23-20-36-52-0_sample-I2
Correlated features (Ranked by SHAP):
{
        'cont_2': 13.650428051938668,
        'cont_1': 8.98106859262871
}
1 features removed: ['cont_1']. 7 features retained: ['cont_0', 'cont_3', 'cont_4', 'cont_2', 'cat_0', 'cat_1', 'cat_2'].
Removing outliers by IQR. Original size: 36, Final size: 36.
Removing outliers by std. Original size: 36, Final size: 36.
Dataset size: 27 11 12
Data saved to ../../../../output/sample/2023-09-23-20-36-52-0_sample-I2 (data.csv and tabular_data.csv).

Let’s check the effectiveness of these processors. Categorical features are encoded by CategoricalOrdinalEncoder:

[10]:
trainer.datamodule.categorical_data.head()
[10]:
cat_0 cat_1 cat_2
0 3 0 2
1 3 0 1
2 3 0 4
3 0 0 0
4 4 0 2

The original categorical features can be accessed using

[11]:
trainer.datamodule.categories_inverse_transform(trainer.datamodule.categorical_data).head()
[11]:
cat_0 cat_1 cat_2
0 category_3 0 2
1 category_3 0 1
2 category_3 0 4
3 category_0 0 0
4 category_4 0 2

One feature is removed by CorrFeatureSelector. It removes the feature with the lowest feature importance (ranked using shap in the example) in the correlation chain.

[12]:
trainer.cont_feature_names
[12]:
['cont_0', 'cont_2', 'cont_3', 'cont_4']

The specific cat_1 feature value is selected by the FeatureValueSelector. Some outliers are removed by the IQRRemover. Original indices of the removed data points can be seen using

[13]:
trainer.datamodule.dropped_indices
[13]:
array([  0,   1,   2,   3,   4,   5,   8,   9,  10,  11,  12,  13,  14,
        15,  18,  19,  20,  22,  24,  25,  26,  27,  28,  29,  30,  31,
        32,  35,  37,  38,  39,  40,  41,  42,  43,  45,  46,  48,  49,
        50,  51,  52,  53,  54,  55,  56,  57,  58,  59,  60,  61,  62,
        63,  64,  65,  66,  67,  68,  69,  70,  72,  74,  75,  76,  77,
        80,  81,  82,  83,  85,  86,  89,  90,  93,  94,  96,  97,  98,
        99, 100, 102, 103, 104, 105, 106, 107, 109, 110, 111, 112, 113,
       114, 115, 116, 118, 119, 120, 121, 123, 124, 125, 127, 130, 131,
       132, 133, 135, 136, 137, 139, 140, 141, 142, 143, 144, 145, 146,
       147, 148, 150, 151, 152, 153, 157, 158, 159, 161, 163, 165, 166,
       167, 168, 170, 172, 173, 175, 176, 177, 178, 179, 180, 182, 183,
       184, 185, 186, 187, 188, 189, 190, 192, 193, 194, 195, 196, 197,
       199, 200, 201, 202, 203, 207, 208, 209, 210, 211, 212, 213, 214,
       217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229,
       230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242,
       243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255])

The SampleDataAugmentor copies the last two data points in the validation set as a showcase. DataModule.augmented_indices represents the indices of these data points before dropping DataModule.dropped_indices. We can see augmented data points using

[14]:
trainer.df.loc[trainer.datamodule.augmented_indices-len(trainer.datamodule.dropped_indices), :]
[14]:
cont_0 cont_1 cont_2 cont_3 cont_4 cont_5 cont_6 cont_7 cont_8 cont_9 ... cat_3 cat_4 cat_5 cat_6 cat_7 cat_8 cat_9 target target_binary target_multi_class
48 -0.505358 -0.104343 -0.507518 -0.988002 -0.815792 -1.284552 -1.051880 0.564009 2.49720 -2.245322 ... 4 4 category_4 3 1 1 2 -246.101543 1 3
49 -2.115056 0.138315 1.618054 0.541008 1.405365 -1.449118 -0.824409 -0.813794 0.42258 0.547481 ... 0 0 category_2 4 3 3 1 -156.813059 0 3

2 rows × 23 columns

[15]:
trainer.df.loc[trainer.datamodule.val_indices[-2:], :]
[15]:
cont_0 cont_1 cont_2 cont_3 cont_4 cont_5 cont_6 cont_7 cont_8 cont_9 ... cat_3 cat_4 cat_5 cat_6 cat_7 cat_8 cat_9 target target_binary target_multi_class
16 -0.505358 -0.104343 -0.507518 -0.988002 -0.815792 -1.284552 -1.051880 0.564009 2.49720 -2.245322 ... 4 4 category_4 3 1 1 2 -246.101543 1 3
0 -2.115056 0.138315 1.618054 0.541008 1.405365 -1.449118 -0.824409 -0.813794 0.42258 0.547481 ... 0 0 category_2 4 3 3 1 -156.813059 0 3

2 rows × 23 columns

Finally, StandardScaler scales the dataset. DataModule.df is the unscaled data frame, and scaled_df is the scaled one.

[16]:
trainer.datamodule.df[trainer.cont_feature_names].describe()
[16]:
cont_0 cont_2 cont_3 cont_4
count 50.000000 50.000000 50.000000 50.000000
mean -0.153242 -0.086822 0.083808 0.194227
std 0.961960 0.947694 0.762640 1.026297
min -2.115056 -1.945703 -1.665956 -1.884586
25% -0.666141 -0.809715 -0.437357 -0.697263
50% -0.087749 0.082401 -0.059459 0.246986
75% 0.329382 0.765257 0.541008 1.110425
max 2.929096 1.618054 1.576299 2.285601
[17]:
trainer.datamodule.scaled_df[trainer.cont_feature_names].describe()
[17]:
cont_0 cont_2 cont_3 cont_4
count 50.000000 50.000000 50.000000 50.000000
mean 0.069544 0.049781 0.002835 0.034550
std 1.072721 0.975998 0.971265 1.002871
min -2.118156 -1.864617 -2.225588 -1.996813
25% -0.502411 -0.694703 -0.660898 -0.836591
50% 0.142578 0.224057 -0.179623 0.086104
75% 0.607738 0.927307 0.585106 0.929836
max 3.506786 1.805574 1.903607 2.078188

Remark: All modules are fitted on training and validation sets and transform the testing set.

[18]:
import numpy as np
trainer.datamodule.scaled_df.loc[np.append(trainer.train_indices, trainer.val_indices), trainer.cont_feature_names].describe()
[18]:
cont_0 cont_2 cont_3 cont_4
count 3.800000e+01 3.800000e+01 3.800000e+01 3.800000e+01
mean 4.674623e-17 5.843279e-18 -4.090295e-17 -2.921640e-17
std 1.013423e+00 1.013423e+00 1.013423e+00 1.013423e+00
min -2.118156e+00 -1.864617e+00 -2.225588e+00 -1.996813e+00
25% -5.160307e-01 -6.772407e-01 -7.744541e-01 -8.633614e-01
50% 1.425780e-01 2.926057e-02 -5.491499e-02 6.457453e-02
75% 6.018802e-01 8.600076e-01 5.851057e-01 9.298360e-01
max 1.814086e+00 1.805574e+00 1.903607e+00 2.005587e+00

Data derivers#

Existing features in the dataset may not be sufficient to represent the inner relations between features and the target. Extending more features that can be strongly correlated with the target using existing ones can be helpful. Data derivers can be used to extend continuous features (stacked in the tabular dataset, Step 3 above) or multi-modal features (unstacked, Step 5 above).

Configurations are similar. Necessary and shared arguments are:

  • stacked: Should the derived feature stack in the processed DataFrame?

  • intermediate: Is the derived stacked feature excluded from continuous features?

  • derived_name: What is the name of the feature?

Here we give three examples:

  • RelativeDeriver calculates the result of dividing absolute_col by relative2_col;

  • SampleWeightDeriver calculates the degree to which a data point is an outlier (it is just an example and there isn’t detailed research on it);

  • UnscaledDataDeriver records all continuous features before scaling (standard scaling by default).

[19]:
deriver_configs = [
    ("RelativeDeriver", {
        "stacked": True,
        "absolute_col": "cont_0",
        "relative2_col": "cont_1",
        "intermediate": False,
        "derived_name": "derived_cont",
    }),
    ("SampleWeightDeriver", {
        "stacked": True,
        "intermediate": True,
        "derived_name": "sample_weight",
    }),
    ("UnscaledDataDeriver", {"derived_name": "unscaled", "stacked": False}),
]
  1. Modify the configuration file:

cfg = {
    "data_derivers": deriver_configs,
    # Some other configurations...
}
  1. Use the manual_config argument of Trainer.load_config.

trainer.load_config("sample", manual_config={"data_derivers": deriver_configs})
  1. Use DataModule.set_data_derivers

[20]:
trainer.load_config("sample")
trainer.datamodule.set_data_derivers(deriver_configs)
trainer.load_data()
The project will be saved to ../../../../output/sample/2023-09-23-20-36-53-0_sample
Dataset size: 153 51 52
Data saved to ../../../../output/sample/2023-09-23-20-36-53-0_sample (data.csv and tabular_data.csv).

Two stacked features can be found in Trainer.df or Trainer.datamodule.df. derived_cont is a continuous feature because intermediate=False, but sample_weight is not.

[21]:
trainer.df[["derived_cont", "sample_weight"]]
[21]:
derived_cont sample_weight
0 5.884222 1.045746
1 -9.058123 1.063506
2 -3.650394 0.959582
3 1.678893 0.974096
4 -4.099185 1.000761
... ... ...
251 -1.355422 0.958380
252 1.088160 0.978138
253 0.374183 0.969419
254 1.199032 0.967882
255 -1.521539 0.979158

256 rows × 2 columns

[22]:
"derived_cont" in trainer.cont_feature_names, "sample_weight" in trainer.cont_feature_names
[22]:
(True, False)

The unstacked feature unscaled can be found in Trainer.derived_data

[23]:
trainer.derived_data["unscaled"]
[23]:
array([[-1.30652685, -0.22203901, -0.11816405, -0.15957344,  1.65813068,
         5.88422203],
       [ 2.01125668, -0.22203901,  0.1950697 ,  0.52700421, -0.04459543,
        -9.05812296],
       [-1.21607658,  0.33313567, -0.74367217,  0.73018353,  0.1406719 ,
        -3.65039438],
       ...,
       [-0.0698565 , -0.18669092, -1.02191332, -1.1436414 ,  0.25011391,
         0.37418262],
       [-1.03148246, -0.86026245, -0.06163805,  0.3283013 , -1.42999126,
         1.19903229],
       [-1.46173269,  0.9606934 ,  0.3675449 ,  1.32906285, -0.68343977,
        -1.52153922]])

Available derivers can be seen by

[24]:
from tabensemb.data.dataderiver import deriver_mapping, get_data_deriver
deriver_mapping
[24]:
{'AbstractDeriver': tabensemb.data.base.AbstractDeriver,
 'RelativeDeriver': tabensemb.data.dataderiver.RelativeDeriver,
 'SampleWeightDeriver': tabensemb.data.dataderiver.SampleWeightDeriver,
 'UnscaledDataDeriver': tabensemb.data.dataderiver.UnscaledDataDeriver}

Arguments can be found in API docs or in the docstring.

[25]:
print(get_data_deriver("RelativeDeriver").__doc__)

    Dividing a feature by another to derive a new feature. Required arguments are:

    absolute_col: str
        The feature that needs to be divided.
    relative2_col: str
        The feature that acts as the denominator.

Access the processed dataset#

All these data can be found in the DataModule instance in the trainer, along with many modified data structures for further usage:

  • Continuous features

    • DataModule.feature_data: scaled

    • DataModule.unscaled_feature_data: not scaled

    • DataModule.X_train/X_val/X_test[trainer.cont_feature_names]: scaled and divided into three partitions

    • DataModule.tensors[0]: scaled and transformed into torch.Tensor.

  • Categorical features

    • DataModule.categorical_data: ordinal-encoded

    • DataModule.X_train/X_val/X_test[trainer.cat_feature_names]: ordinal-encoded and divided into three partitions

    • DataModule.derived_data["categorical"]: ordinal-encoded

    • trainer.datamodule.tensors[list(trainer.datamodule.derived_data.keys()).index("categorical")+1]: ordinal-encoded and transformed into torch.Tensor.

  • Derived unstacked features

    • DataModule.derived_data: include unstacked features, categorical features, and the signal for each data point representing whether it is an augmented one.

    • DataModule.tensors[1:-1]: same as DataModule.derived_data, but are torch.Tensors.

Remark: Currently, derived unstacked features are not used in the supported external model bases. But it can be easily accessed using the above approaches, even easier for a customized PyTorch-based model base class TorchModel, which will be introduced in the “Advanced Usage” sections.

Remark: Stacked (continuous) derived features are derived after imputation but before data processing. These features will also be imputed. Unstacked derived features are derived after all other steps are finished.