New data processors#
Data processors are the core functionalities in the processing procedure. They can
Add new data points by inheriting
tabensemb.data.AbstractAugmenter;Remove data points by inheriting
tabensemb.data.AbstractProcessor;Change values of features by inheriting
tabensemb.data.AbstractTransformerortabensemb.data.AbstractScaler;Reduce the number of features by inheriting
tabensemb.data.AbstractFeatureSelector.
The above-mentioned classes are all subclasses of tabensemb.data.AbstractProcessor. A subclass of AbstractProcessor should have _fit_transform and _transform implemented. _fit_transform is used to fit the processor and transform the training set and the validation set. _transform will be called to transform the testing set or an upcoming dataset using the fitted processor. For all these classes, you can implement _required_kwargs and _defaults as we did in “New
data derivers” because they all inherit tabensemb.data.AbstractDataStep.
The usage of processors is already introduced in “Using data functionalities”.
The implemented processors should be registered as follows to be recognized by DataModule.set_data_processors automatically.
from tabensemb.data.dataprocessor import processor_mapping
processor_mapping["ADataProcessor"] = ADataProcessor
from tabensemb.data import AbstractAugmenter, AbstractProcessor, AbstractTransformer, AbstractScaler, AbstractFeatureSelector, DataModule
import pandas as pd
import numpy as np
AbstractAugmenter#
We provide an example of data augmentation in the package, which simply copies the last two data points of the input DataFrame that contains the training set and the validation set. The method _get_augmented, which returns a DataFrame containing new data points, is the only method that needs to be implemented.
class SampleDataAugmenter(AbstractAugmenter):
def _get_augmented(
self, data: pd.DataFrame, datamodule: DataModule
) -> pd.DataFrame:
augmented = data.loc[data.index[-2:], :].copy()
return augmented
AbstractProcessor#
It is the base class for data processors. Other mentioned classes implement these two methods and provide higher-level methods for simplification. Currently, only processors that remove some data points are still implemented under AbstractProcessor directly. Take tabensemb.data.dataprocessors.FeatureValueSelector as an example.
FeatureValueSelector is used to select data points that have the specific value (the argument “value”) of a certain feature (the argument “feature”). These two arguments are defined in _required_kwargs.
class FeatureValueSelector(AbstractProcessor):
def _required_kwargs(self):
return ["feature", "value"]
It directly removes unwanted data points in the DataFrame.
Remark: DO NOT reset the index of the returned DataFrame, which is used to update the indices of training/validation/testing sets.
def _fit_transform(self, data: pd.DataFrame, datamodule: DataModule):
feature = self.kwargs["feature"]
value = self.kwargs["value"]
where_value = data.index[np.where(data[feature] == value)[0]]
data = data.loc[where_value, :]
self.feature, self.value = feature, value
return data
FeatureValueSelector behaves differently when processing the dataset at hand (datamodule.training==True) and the upcoming dataset (datamodule.training==False) respectively. In the latter case, data points should not be removed when making inferences. However, data points can be removed from the validation or testing sets (datamodule.training==True) because we only want the specific value of the feature in the entire dataset.
def _transform(self, data: pd.DataFrame, datamodule: DataModule):
if datamodule.training:
if self.value not in list(data[self.feature]):
raise Exception(
f"Value {self.value} not available for feature {self.feature}. Select from {data[self.feature].unique()}"
)
where_value = data.index[np.where(data[self.feature] == self.value)[0]]
data = data.loc[where_value, :]
else:
if self.value not in list(data[self.feature]):
warnings.warn(
f"Value {self.value} not available for feature {self.feature} selected by "
f"{self.__class__.__name__}."
)
return data
AbstractFeatureSelector#
AbstractFeatureSelector is used to select tabular features and thus reduce the dimension of the problem. The only necessary method is _get_feature_names_out which returns a list of selected features. Take tabensemb.data.dataprocessors.VarianceFeatureSelector that uses sklearn.feature_selection.VarianceThreshold as an example. A parameter thres can be given. The input DataFrame is the training and validation set.
from sklearn.feature_selection import VarianceThreshold
class FeatureSelector(AbstractFeatureSelector):
def _defaults(self):
return dict(thres=0.8)
def _get_feature_names_out(self, data, datamodule):
thres = self.kwargs["thres"]
sel = VarianceThreshold(threshold=(thres * (1 - thres)))
sel.fit(
data[datamodule.all_feature_names],
data[datamodule.label_name].values.flatten()
if len(datamodule.label_name) == 1
else data[datamodule.label_name].values, # Ignored.
)
retain_features = list(sel.get_feature_names_out())
return retain_features
AbstractTransformer#
AbstractTransformer is used to modify the values of features. Its implementation is exactly the same as that of AbstractProcessor. It is mostly a classification criteria to tell the user what it will do, and so does the following AbstractScaler that inherits it. A typical example is tabensemb.data.dataprocessors.CategoricalOrdinalEncoder which turns categorical features containing meaningful strings into numerical representations.
A method called DataModule.get_var_change can calculate what a specific value of a specific feature will become after going through all AbstractTransformers used. It can be useful when the zero values are needed to be unchanged.
AbstractScaler#
It inherits AbstractTransformer. The last data processor defined in an DataModule must be a AbstractScaler. As shown in “Customized model base”, some representations of the dataset in the DataModule are stored in the unscaled form, which means they have gone through all data processors except for the last one. Call datamodule.data_transform(df, scaler_only=True) to scale them by calling the last data processor (the AbstractScaler). The implementation is similar to
AbstractProcessor. Take tabensemb.data.dataprocessors.StandardScaler which uses sklearn.preprocessing.StandardScaler as an example:
from sklearn.preprocessing import StandardScaler as skStandardScaler
class StandardScaler(AbstractScaler):
def _fit_transform(self, data: pd.DataFrame, datamodule: DataModule):
scaler = skStandardScaler()
if len(datamodule.cont_feature_names) > 0:
data[datamodule.cont_feature_names] = scaler.fit_transform(
data[datamodule.cont_feature_names]
).astype(np.float64)
self.transformer = scaler
return data
def _transform(self, data: pd.DataFrame, datamodule: DataModule):
if len(datamodule.cont_feature_names) > 0:
data[datamodule.cont_feature_names] = self.transformer.transform(
data[datamodule.cont_feature_names]
).astype(np.float64)
return data
Remark: It is highly recommended to use the 64-bit float (double) precision to avoid inconsistent result between _fit_transform and _transform.
Remark: There can be no continuous and/or categorical features. Please confirm that your AbstractProcessors support empty datamodule.cont_feature_names and/or datamodule.cat_feature_names.