Using data functionalities#
Running Trainer.load_data or DataModule.load_data will process the dataset in the following order:
Data splitting (training/validation/testing sets): See “Data splitters”
Data imputation: See “Data imputers”
Data augmentation (for features): See “Data derivers”
Data processing (orderless except for data scaling): See “Data processors”
Data augmentation (for data points)
Data filtering
Feature selection
Categorical encoding
Data scaling
etc.
Data augmentation (for features, especially multi-modal features and physics-informed features): See “Data derivers”.
In this part, we will introduce the usage of “data splitters”, “data imputers”, “data processors”, and “data derivers”. Implementing new functionalities is left as a section in “Advanced Usage”.
[1]:
import torch
from tabensemb.trainer import Trainer
from tabensemb.model import *
import tabensemb
import os
prefix = "../../../../"
tabensemb.setting["default_output_path"] = prefix + "output"
tabensemb.setting["default_config_path"] = prefix + "configs"
tabensemb.setting["default_data_path"] = prefix + "data"
trainer = Trainer(device="cpu")
Data splitters#
Data splitters are used to split the whole dataset into training, validation, and testing sets. They inherit tabensemb.data.AbstractSplitter, and implement _split (the main method) and _next_cv (to generate the next fold for a k-fold CV process).
Remark: If AbstractSplitter.support_cv=False, the data splitter does not support k-fold CV.
There are several ways to specify the used data splitter. These ways can also be used to specify other configurations.
Modify the configuration file,
configs/sample.pyfor example:
cfg = {
"data_splitter": "RandomSplitter",
# Some other configurations...
}
Use the
manual_configargument ofTrainer.load_config.
trainer.load_config("sample", manual_config={"data_splitter": "RandomSplitter"})
After
Trainer.load_configis called and if one does not want to call it again, useDataModule.set_data_splitter.
[2]:
trainer.load_config("sample")
trainer.datamodule.set_data_splitter("RandomSplitter", ratio=[7, 1.5, 1.5])
trainer.load_data()
The project will be saved to ../../../../output/sample/2023-09-23-20-36-52-0_sample
Dataset size: 178 39 39
Data saved to ../../../../output/sample/2023-09-23-20-36-52-0_sample (data.csv and tabular_data.csv).
The ratio argument can also be given in the configuration file, manual_config, or set_data_splitter as:
cfg = {
# This will overwrite the `split_ratio` configuration.
"data_splitter": ["RandomSplitter", {"train_val_test": [7, 1.5, 1.5]}],
# Some other configurations...
}
trainer.load_config("sample", manual_config={"data_splitter": ["RandomSplitter", {"train_val_test": [7, 1.5, 1.5]}]})
trainer.datamodule.set_data_splitter(["RandomSplitter", {"train_val_test": [7, 1.5, 1.5]}])
Available data splitters can be seen using:
[3]:
from tabensemb.data.datasplitter import splitter_mapping
splitter_mapping
[3]:
{'AbstractSplitter': tabensemb.data.base.AbstractSplitter,
'RandomSplitter': tabensemb.data.datasplitter.RandomSplitter}
Data imputers#
Imputation is necessary when NaNs exist in the dataset. tabensemb provides several methods incorporating other packages like miceforest and scikit-learn. The configuration for an imputer contains two parts: the name of the imputer and its arguments. Data imputers can be set similarly to data splitters in the following ways:
Modify the configuration file,
configs/sample.pyfor example:
cfg = {
"data_imputer": ["MiceImputer", {"max_iter": 10}],
# "data_imputer": "MiceImputer", (If no kwargs is given)
# Some other configurations...
}
Use the
manual_configargument ofTrainer.load_config.
trainer.load_config("sample", manual_config={"data_imputer": ["MiceImputer", {"max_iter": 10}]})
trainer.load_config("sample", manual_config={"data_imputer": "MiceImputer"})
Use
DataModule.set_data_imputer
[4]:
trainer.load_config("sample")
trainer.datamodule.set_data_imputer(["MiceImputer", {"max_iter": 10}])
trainer.load_data()
The project will be saved to ../../../../output/sample/2023-09-23-20-36-52-0_sample-I1
Dataset size: 153 51 52
Data saved to ../../../../output/sample/2023-09-23-20-36-52-0_sample-I1 (data.csv and tabular_data.csv).
[5]:
trainer.df.isna().any()
[5]:
cont_0 False
cont_1 False
cont_2 False
cont_3 False
cont_4 False
cont_5 False
cont_6 False
cont_7 False
cont_8 False
cont_9 False
cat_0 False
cat_1 False
cat_2 False
cat_3 False
cat_4 False
cat_5 False
cat_6 False
cat_7 False
cat_8 False
cat_9 False
target False
target_binary False
target_multi_class False
dtype: bool
Available data splitters can be seen using:
[6]:
from tabensemb.data.dataimputer import imputer_mapping, get_data_imputer
imputer_mapping
[6]:
{'AbstractImputer': tabensemb.data.base.AbstractImputer,
'AbstractSklearnImputer': tabensemb.data.base.AbstractSklearnImputer,
'GainImputer': tabensemb.data.dataimputer.GainImputer,
'MeanImputer': tabensemb.data.dataimputer.MeanImputer,
'MedianImputer': tabensemb.data.dataimputer.MedianImputer,
'MiceImputer': tabensemb.data.dataimputer.MiceImputer,
'MiceLightgbmImputer': tabensemb.data.dataimputer.MiceLightgbmImputer,
'MissForestImputer': tabensemb.data.dataimputer.MissForestImputer,
'ModeImputer': tabensemb.data.dataimputer.ModeImputer}
Arguments can be seen in API docs or in docstrings
[7]:
print(get_data_imputer("MeanImputer").__doc__)
Imputation with average values implemented using sklearn's SimpleImputer.
Parameters
----------
**kwargs
Arguments for ``sklearn.impute.SimpleImputer`` (except for ``strategy``)
Data processors#
As listed in Step 4 above, data processing includes filtering, augmentation, feature selection, and much more. tabensemb provides a unified framework for implementing various data processing steps. The data imputation and processing procedure is quite similar to the Pipeline structure in sklearn, but is fully compatible with the other two modules introduced in this part and all four modules automatically do all preparations before training for the user.
Configuration for a processor also contains two parts: the name of the processor and arguments. Here we provide several examples:
CategoricalOrdinalEncoder: similar to theOrdinalEncoderinsklearnNaNFeatureRemover: remove features that are all NaNsVarianceFeatureSelector: same as theVarianceThresholdfromsklearnFeatureValueSelector: select data points that have a certain value of a featureCorrFeatureSelector: remove highly correlated featuresIQRRemover: remove outliers found by the 1.5*IQR criteriaStdRemover: remove outliers found by 3*std criteriaSampleDataAugmentor: just an example to show the data augmentation capability (it copies the last two data points in the validation set)StandardScaler: same as theStandardScalerfromsklearn
Remark: Data scalers like a StandardScaler must be the last data processor.
Remark: Data points derived from data augmenters (inheriting AbstractAugmenter) will be added to the training set.
[8]:
processor_configs = [
["CategoricalOrdinalEncoder", {}],
["NaNFeatureRemover", {}],
["VarianceFeatureSelector", {"thres": 0.1}],
["FeatureValueSelector", {"feature": "cat_1", "value": 0}],
["CorrFeatureSelector", {"thres": 0.1}],
["IQRRemover", {}],
["StdRemover", {}],
["SampleDataAugmenter", {}],
["StandardScaler", {}],
]
Modify the configuration file:
cfg = {
"data_processors": processor_configs,
# Some other configurations...
}
Use the
manual_configargument ofTrainer.load_config.
trainer.load_config("sample", manual_config={"data_processors": processor_configs})
Use
DataModule.set_data_processors
[9]:
import warnings
import numba
trainer.load_config("sample")
trainer.datamodule.set_data_processors(processor_configs)
with warnings.catch_warnings():
warnings.filterwarnings("ignore", category=numba.NumbaDeprecationWarning)
trainer.load_data()
The project will be saved to ../../../../output/sample/2023-09-23-20-36-52-0_sample-I2
Correlated features (Ranked by SHAP):
{
'cont_2': 13.650428051938668,
'cont_1': 8.98106859262871
}
1 features removed: ['cont_1']. 7 features retained: ['cont_0', 'cont_3', 'cont_4', 'cont_2', 'cat_0', 'cat_1', 'cat_2'].
Removing outliers by IQR. Original size: 36, Final size: 36.
Removing outliers by std. Original size: 36, Final size: 36.
Dataset size: 27 11 12
Data saved to ../../../../output/sample/2023-09-23-20-36-52-0_sample-I2 (data.csv and tabular_data.csv).
Let’s check the effectiveness of these processors. Categorical features are encoded by CategoricalOrdinalEncoder:
[10]:
trainer.datamodule.categorical_data.head()
[10]:
| cat_0 | cat_1 | cat_2 | |
|---|---|---|---|
| 0 | 3 | 0 | 2 |
| 1 | 3 | 0 | 1 |
| 2 | 3 | 0 | 4 |
| 3 | 0 | 0 | 0 |
| 4 | 4 | 0 | 2 |
The original categorical features can be accessed using
[11]:
trainer.datamodule.categories_inverse_transform(trainer.datamodule.categorical_data).head()
[11]:
| cat_0 | cat_1 | cat_2 | |
|---|---|---|---|
| 0 | category_3 | 0 | 2 |
| 1 | category_3 | 0 | 1 |
| 2 | category_3 | 0 | 4 |
| 3 | category_0 | 0 | 0 |
| 4 | category_4 | 0 | 2 |
One feature is removed by CorrFeatureSelector. It removes the feature with the lowest feature importance (ranked using shap in the example) in the correlation chain.
[12]:
trainer.cont_feature_names
[12]:
['cont_0', 'cont_2', 'cont_3', 'cont_4']
The specific cat_1 feature value is selected by the FeatureValueSelector. Some outliers are removed by the IQRRemover. Original indices of the removed data points can be seen using
[13]:
trainer.datamodule.dropped_indices
[13]:
array([ 0, 1, 2, 3, 4, 5, 8, 9, 10, 11, 12, 13, 14,
15, 18, 19, 20, 22, 24, 25, 26, 27, 28, 29, 30, 31,
32, 35, 37, 38, 39, 40, 41, 42, 43, 45, 46, 48, 49,
50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62,
63, 64, 65, 66, 67, 68, 69, 70, 72, 74, 75, 76, 77,
80, 81, 82, 83, 85, 86, 89, 90, 93, 94, 96, 97, 98,
99, 100, 102, 103, 104, 105, 106, 107, 109, 110, 111, 112, 113,
114, 115, 116, 118, 119, 120, 121, 123, 124, 125, 127, 130, 131,
132, 133, 135, 136, 137, 139, 140, 141, 142, 143, 144, 145, 146,
147, 148, 150, 151, 152, 153, 157, 158, 159, 161, 163, 165, 166,
167, 168, 170, 172, 173, 175, 176, 177, 178, 179, 180, 182, 183,
184, 185, 186, 187, 188, 189, 190, 192, 193, 194, 195, 196, 197,
199, 200, 201, 202, 203, 207, 208, 209, 210, 211, 212, 213, 214,
217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229,
230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242,
243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255])
The SampleDataAugmentor copies the last two data points in the validation set as a showcase. DataModule.augmented_indices represents the indices of these data points before dropping DataModule.dropped_indices. We can see augmented data points using
[14]:
trainer.df.loc[trainer.datamodule.augmented_indices-len(trainer.datamodule.dropped_indices), :]
[14]:
| cont_0 | cont_1 | cont_2 | cont_3 | cont_4 | cont_5 | cont_6 | cont_7 | cont_8 | cont_9 | ... | cat_3 | cat_4 | cat_5 | cat_6 | cat_7 | cat_8 | cat_9 | target | target_binary | target_multi_class | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 48 | -0.505358 | -0.104343 | -0.507518 | -0.988002 | -0.815792 | -1.284552 | -1.051880 | 0.564009 | 2.49720 | -2.245322 | ... | 4 | 4 | category_4 | 3 | 1 | 1 | 2 | -246.101543 | 1 | 3 |
| 49 | -2.115056 | 0.138315 | 1.618054 | 0.541008 | 1.405365 | -1.449118 | -0.824409 | -0.813794 | 0.42258 | 0.547481 | ... | 0 | 0 | category_2 | 4 | 3 | 3 | 1 | -156.813059 | 0 | 3 |
2 rows × 23 columns
[15]:
trainer.df.loc[trainer.datamodule.val_indices[-2:], :]
[15]:
| cont_0 | cont_1 | cont_2 | cont_3 | cont_4 | cont_5 | cont_6 | cont_7 | cont_8 | cont_9 | ... | cat_3 | cat_4 | cat_5 | cat_6 | cat_7 | cat_8 | cat_9 | target | target_binary | target_multi_class | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 16 | -0.505358 | -0.104343 | -0.507518 | -0.988002 | -0.815792 | -1.284552 | -1.051880 | 0.564009 | 2.49720 | -2.245322 | ... | 4 | 4 | category_4 | 3 | 1 | 1 | 2 | -246.101543 | 1 | 3 |
| 0 | -2.115056 | 0.138315 | 1.618054 | 0.541008 | 1.405365 | -1.449118 | -0.824409 | -0.813794 | 0.42258 | 0.547481 | ... | 0 | 0 | category_2 | 4 | 3 | 3 | 1 | -156.813059 | 0 | 3 |
2 rows × 23 columns
Finally, StandardScaler scales the dataset. DataModule.df is the unscaled data frame, and scaled_df is the scaled one.
[16]:
trainer.datamodule.df[trainer.cont_feature_names].describe()
[16]:
| cont_0 | cont_2 | cont_3 | cont_4 | |
|---|---|---|---|---|
| count | 50.000000 | 50.000000 | 50.000000 | 50.000000 |
| mean | -0.153242 | -0.086822 | 0.083808 | 0.194227 |
| std | 0.961960 | 0.947694 | 0.762640 | 1.026297 |
| min | -2.115056 | -1.945703 | -1.665956 | -1.884586 |
| 25% | -0.666141 | -0.809715 | -0.437357 | -0.697263 |
| 50% | -0.087749 | 0.082401 | -0.059459 | 0.246986 |
| 75% | 0.329382 | 0.765257 | 0.541008 | 1.110425 |
| max | 2.929096 | 1.618054 | 1.576299 | 2.285601 |
[17]:
trainer.datamodule.scaled_df[trainer.cont_feature_names].describe()
[17]:
| cont_0 | cont_2 | cont_3 | cont_4 | |
|---|---|---|---|---|
| count | 50.000000 | 50.000000 | 50.000000 | 50.000000 |
| mean | 0.069544 | 0.049781 | 0.002835 | 0.034550 |
| std | 1.072721 | 0.975998 | 0.971265 | 1.002871 |
| min | -2.118156 | -1.864617 | -2.225588 | -1.996813 |
| 25% | -0.502411 | -0.694703 | -0.660898 | -0.836591 |
| 50% | 0.142578 | 0.224057 | -0.179623 | 0.086104 |
| 75% | 0.607738 | 0.927307 | 0.585106 | 0.929836 |
| max | 3.506786 | 1.805574 | 1.903607 | 2.078188 |
Remark: All modules are fitted on training and validation sets and transform the testing set.
[18]:
import numpy as np
trainer.datamodule.scaled_df.loc[np.append(trainer.train_indices, trainer.val_indices), trainer.cont_feature_names].describe()
[18]:
| cont_0 | cont_2 | cont_3 | cont_4 | |
|---|---|---|---|---|
| count | 3.800000e+01 | 3.800000e+01 | 3.800000e+01 | 3.800000e+01 |
| mean | 4.674623e-17 | 5.843279e-18 | -4.090295e-17 | -2.921640e-17 |
| std | 1.013423e+00 | 1.013423e+00 | 1.013423e+00 | 1.013423e+00 |
| min | -2.118156e+00 | -1.864617e+00 | -2.225588e+00 | -1.996813e+00 |
| 25% | -5.160307e-01 | -6.772407e-01 | -7.744541e-01 | -8.633614e-01 |
| 50% | 1.425780e-01 | 2.926057e-02 | -5.491499e-02 | 6.457453e-02 |
| 75% | 6.018802e-01 | 8.600076e-01 | 5.851057e-01 | 9.298360e-01 |
| max | 1.814086e+00 | 1.805574e+00 | 1.903607e+00 | 2.005587e+00 |
Data derivers#
Existing features in the dataset may not be sufficient to represent the inner relations between features and the target. Extending more features that can be strongly correlated with the target using existing ones can be helpful. Data derivers can be used to extend continuous features (stacked in the tabular dataset, Step 3 above) or multi-modal features (unstacked, Step 5 above).
Configurations are similar. Necessary and shared arguments are:
stacked: Should the derived feature stack in the processedDataFrame?intermediate: Is the derivedstackedfeature excluded from continuous features?derived_name: What is the name of the feature?
Here we give three examples:
RelativeDerivercalculates the result of dividingabsolute_colbyrelative2_col;SampleWeightDerivercalculates the degree to which a data point is an outlier (it is just an example and there isn’t detailed research on it);UnscaledDataDeriverrecords all continuous features before scaling (standard scaling by default).
[19]:
deriver_configs = [
("RelativeDeriver", {
"stacked": True,
"absolute_col": "cont_0",
"relative2_col": "cont_1",
"intermediate": False,
"derived_name": "derived_cont",
}),
("SampleWeightDeriver", {
"stacked": True,
"intermediate": True,
"derived_name": "sample_weight",
}),
("UnscaledDataDeriver", {"derived_name": "unscaled", "stacked": False}),
]
Modify the configuration file:
cfg = {
"data_derivers": deriver_configs,
# Some other configurations...
}
Use the
manual_configargument ofTrainer.load_config.
trainer.load_config("sample", manual_config={"data_derivers": deriver_configs})
Use
DataModule.set_data_derivers
[20]:
trainer.load_config("sample")
trainer.datamodule.set_data_derivers(deriver_configs)
trainer.load_data()
The project will be saved to ../../../../output/sample/2023-09-23-20-36-53-0_sample
Dataset size: 153 51 52
Data saved to ../../../../output/sample/2023-09-23-20-36-53-0_sample (data.csv and tabular_data.csv).
Two stacked features can be found in Trainer.df or Trainer.datamodule.df. derived_cont is a continuous feature because intermediate=False, but sample_weight is not.
[21]:
trainer.df[["derived_cont", "sample_weight"]]
[21]:
| derived_cont | sample_weight | |
|---|---|---|
| 0 | 5.884222 | 1.045746 |
| 1 | -9.058123 | 1.063506 |
| 2 | -3.650394 | 0.959582 |
| 3 | 1.678893 | 0.974096 |
| 4 | -4.099185 | 1.000761 |
| ... | ... | ... |
| 251 | -1.355422 | 0.958380 |
| 252 | 1.088160 | 0.978138 |
| 253 | 0.374183 | 0.969419 |
| 254 | 1.199032 | 0.967882 |
| 255 | -1.521539 | 0.979158 |
256 rows × 2 columns
[22]:
"derived_cont" in trainer.cont_feature_names, "sample_weight" in trainer.cont_feature_names
[22]:
(True, False)
The unstacked feature unscaled can be found in Trainer.derived_data
[23]:
trainer.derived_data["unscaled"]
[23]:
array([[-1.30652685, -0.22203901, -0.11816405, -0.15957344, 1.65813068,
5.88422203],
[ 2.01125668, -0.22203901, 0.1950697 , 0.52700421, -0.04459543,
-9.05812296],
[-1.21607658, 0.33313567, -0.74367217, 0.73018353, 0.1406719 ,
-3.65039438],
...,
[-0.0698565 , -0.18669092, -1.02191332, -1.1436414 , 0.25011391,
0.37418262],
[-1.03148246, -0.86026245, -0.06163805, 0.3283013 , -1.42999126,
1.19903229],
[-1.46173269, 0.9606934 , 0.3675449 , 1.32906285, -0.68343977,
-1.52153922]])
Available derivers can be seen by
[24]:
from tabensemb.data.dataderiver import deriver_mapping, get_data_deriver
deriver_mapping
[24]:
{'AbstractDeriver': tabensemb.data.base.AbstractDeriver,
'RelativeDeriver': tabensemb.data.dataderiver.RelativeDeriver,
'SampleWeightDeriver': tabensemb.data.dataderiver.SampleWeightDeriver,
'UnscaledDataDeriver': tabensemb.data.dataderiver.UnscaledDataDeriver}
Arguments can be found in API docs or in the docstring.
[25]:
print(get_data_deriver("RelativeDeriver").__doc__)
Dividing a feature by another to derive a new feature. Required arguments are:
absolute_col: str
The feature that needs to be divided.
relative2_col: str
The feature that acts as the denominator.
Access the processed dataset#
All these data can be found in the DataModule instance in the trainer, along with many modified data structures for further usage:
Continuous features
DataModule.feature_data: scaledDataModule.unscaled_feature_data: not scaledDataModule.X_train/X_val/X_test[trainer.cont_feature_names]: scaled and divided into three partitionsDataModule.tensors[0]: scaled and transformed into torch.Tensor.
Categorical features
DataModule.categorical_data: ordinal-encodedDataModule.X_train/X_val/X_test[trainer.cat_feature_names]: ordinal-encoded and divided into three partitionsDataModule.derived_data["categorical"]: ordinal-encodedtrainer.datamodule.tensors[list(trainer.datamodule.derived_data.keys()).index("categorical")+1]: ordinal-encoded and transformed into torch.Tensor.
Derived unstacked features
DataModule.derived_data: include unstacked features, categorical features, and the signal for each data point representing whether it is an augmented one.DataModule.tensors[1:-1]: same asDataModule.derived_data, but aretorch.Tensors.
Remark: Currently, derived unstacked features are not used in the supported external model bases. But it can be easily accessed using the above approaches, even easier for a customized PyTorch-based model base class TorchModel, which will be introduced in the “Advanced Usage” sections.
Remark: Stacked (continuous) derived features are derived after imputation but before data processing. These features will also be imputed. Unstacked derived features are derived after all other steps are finished.