New data derivers#

In this package, a very limited number of derivers are currently provided. A deriver can be used to calculate new features (continuous or categorical) based on existing features, or load images, text, etc. as multimodal data. The source code of the integrated tabensemb.data.dataderiver.RelativeDeriver is extended here to demonstrate the implementation procedure.

[1]:
from tabensemb.data.dataderiver import AbstractDeriver

Data derivers inherit tabensemb.data.AbstractDervier and four methods should be implemented:

  • _required_cols: Arguments for columns that must exist in the tabular dataset. The following code means that the arguments absolute_col and relative2_col should be given in the configuration, such as "data_derivers": [("MyRelativeDeriver", {"absolute_col": "cont_0", "relative2_col": "cont_1"})]

class MyRelativeDeriver(AbstractDeriver):
    def _required_cols(self):
        return ["absolute_col", "relative2_col"]
  • _required_kwargs: Parameters that must be specified in the configuration. The following code means that the parameter some_param should be given in the configuration, such as "data_derivers": [("MyRelativeDeriver", {"some_param": 1.5})]

def _required_kwargs(self):
    return ["some_param"]

Remark: “stacked”, “intermediate”, “derived_name”, and “is_continuous” are shared necessary kwargs and do not need to be added to _required_kwargs.

  • _defaults: Default values of those in _required_cols, _required_kwargs, and ["stacked", "intermediate", "derived_name", "is_continuous"]. If default values are given, no error will be raised if the argument is not set in the configuration.

def _defaults(self):
    return dict(stacked=True, intermediate=False, is_continuous=True)
  • _derive: The main derivation step. It receives the tabular data (a DataFrame) and a DataModule and should return an np.ndarray. The returned array can not be 1d. Arguments are checked and recorded in self.kwargs when initializing.

def _derive(self, df, datamodule):
    absolute_col = self.kwargs["absolute_col"]
    relative2_col = self.kwargs["relative2_col"]
    some_param = self.kwargs["some_param"]
    stacked = self.kwargs["stacked"]

    relative = df[absolute_col] / df[relative2_col]
    relative = relative.values.reshape(-1, 1)
    return relative
[2]:
class MyRelativeDeriver(AbstractDeriver):
    def _required_cols(self):
        return ["absolute_col", "relative2_col"]

    def _required_kwargs(self):
        return ["some_param"]

    def _defaults(self):
        return dict(stacked=True, intermediate=False, is_continuous=True)

    def _derive(self, df, datamodule):
        absolute_col = self.kwargs["absolute_col"]
        relative2_col = self.kwargs["relative2_col"]
        some_param = self.kwargs["some_param"]
        stacked = self.kwargs["stacked"]

        relative = df[absolute_col] / df[relative2_col]
        relative = relative.values.reshape(-1, 1)
        return relative

The implemented splitter should be registered as follows to be recognized by DataModule.set_data_derivers automatically.

[3]:
from tabensemb.data.dataderiver import deriver_mapping
deriver_mapping["MyRelativeDeriver"] = MyRelativeDeriver
[4]:
from tabensemb.trainer import Trainer
import tabensemb

prefix = "../../../../"
tabensemb.setting["default_output_path"] = prefix + "output"
tabensemb.setting["default_config_path"] = prefix + "configs"
tabensemb.setting["default_data_path"] = prefix + "data"

trainer = Trainer(device="cpu")

trainer.load_config("sample")
The project will be saved to ../../../../output/sample/2023-09-18-18-15-00-0_sample

If stacked is True:

[5]:
trainer.datamodule.set_data_derivers([("MyRelativeDeriver", {"absolute_col": "cont_0", "relative2_col": "cont_1", "derived_name": "cont_0_relative2_cont_1", "some_param": 1.0, "stacked": True})])
trainer.load_data()
print(f"cont_0_relative2_cont_1 in continuous features?: {'cont_0_relative2_cont_1' in trainer.cont_feature_names}")
trainer.df
Dataset size: 153 51 52
Data saved to ../../../../output/sample/2023-09-18-18-15-00-0_sample (data.csv and tabular_data.csv).
cont_0_relative2_cont_1 in continuous features?: True
[5]:
cont_0 cont_1 cont_2 cont_3 cont_4 cont_5 cont_6 cont_7 cont_8 cont_9 ... cat_4 cat_5 cat_6 cat_7 cat_8 cat_9 target target_binary target_multi_class cont_0_relative2_cont_1
0 -1.306527 0.065895 -0.118164 -0.159573 1.658131 -1.346718 -0.680178 -1.334258 0.666383 -0.460720 ... 2 category_4 3 4 4 3 -71.084217 0 1 -19.827301
1 2.011257 0.117717 0.195070 0.527004 -0.044595 0.616887 -1.781563 0.354758 -0.729045 0.196557 ... 3 category_3 3 1 3 2 13.415675 1 2 17.085552
2 -1.216077 0.065895 -0.743672 0.730184 0.140672 1.272954 -0.159012 -0.475175 0.240057 0.100159 ... 4 category_3 4 1 0 2 -47.492280 0 2 -18.454666
3 0.559299 0.117717 -0.431096 -0.809627 -1.063696 -0.860153 0.572751 -0.467441 0.677557 1.307184 ... 1 category_3 4 2 0 0 -94.482614 1 2 4.751225
4 0.910179 -0.213096 0.786328 -0.042257 0.317218 0.379152 -0.466419 -0.017020 -0.944446 -0.410050 ... 0 category_2 0 2 3 0 195.819531 1 3 -4.271217
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
251 0.280442 -0.206904 0.841631 0.880179 -0.993124 -1.570623 -0.249459 0.643314 0.049495 0.493837 ... 2 category_2 2 3 0 2 -171.249549 0 0 -1.355422
252 -1.165150 -1.070753 0.465662 1.054452 0.900826 -0.179925 -1.536244 1.178780 1.488252 1.895889 ... 2 category_4 4 2 1 1 23.708442 0 2 1.088160
253 -0.069856 -0.186691 -1.021913 -1.143641 0.250114 1.040239 -1.150438 0.258798 -0.836111 0.642211 ... 3 category_3 2 2 2 2 -33.414215 1 1 0.374183
254 -1.031482 -0.860262 -0.061638 0.328301 -1.429991 -1.048170 -1.432735 0.607112 0.087531 0.938747 ... 0 category_3 4 1 4 4 -359.199191 0 4 1.199032
255 -1.461733 0.960693 0.367545 1.329063 -0.683440 -1.184687 0.190312 -0.521580 -0.851729 1.822724 ... 1 category_3 4 1 1 4 -135.199100 1 2 -1.521539

256 rows × 24 columns

If stacked is True but intermediate is True:

[6]:
trainer.datamodule.set_data_derivers([("MyRelativeDeriver", {"absolute_col": "cont_0", "relative2_col": "cont_1", "derived_name": "cont_0_relative2_cont_1", "some_param": 1.0, "stacked": True, "intermediate": True})])
trainer.load_data()
print(f"cont_0_relative2_cont_1 in continuous features?: {'cont_0_relative2_cont_1' in trainer.cont_feature_names}")
trainer.df
Using previously used data path ../../../../data/sample.csv
Dataset size: 153 51 52
Data saved to ../../../../output/sample/2023-09-18-18-15-00-0_sample (data.csv and tabular_data.csv).
cont_0_relative2_cont_1 in continuous features?: False
[6]:
cont_0 cont_1 cont_2 cont_3 cont_4 cont_5 cont_6 cont_7 cont_8 cont_9 ... cat_4 cat_5 cat_6 cat_7 cat_8 cat_9 target target_binary target_multi_class cont_0_relative2_cont_1
0 -1.306527 -0.409756 -0.118164 -0.159573 1.658131 -1.346718 -0.680178 -1.334258 0.666383 -0.460720 ... 2 category_4 3 4 4 3 -71.084217 0 1 3.188552
1 2.011257 -0.409756 0.195070 0.527004 -0.044595 0.616887 -1.781563 0.354758 -0.729045 0.196557 ... 3 category_3 3 1 3 2 13.415675 1 2 -4.908431
2 -1.216077 0.104704 -0.743672 0.730184 0.140672 1.272954 -0.159012 -0.475175 0.240057 0.100159 ... 4 category_3 4 1 0 2 -47.492280 0 2 -11.614467
3 0.559299 0.104704 -0.431096 -0.809627 -1.063696 -0.860153 0.572751 -0.467441 0.677557 1.307184 ... 1 category_3 4 2 0 0 -94.482614 1 2 5.341736
4 0.910179 -0.409756 0.786328 -0.042257 0.317218 0.379152 -0.466419 -0.017020 -0.944446 -0.410050 ... 0 category_2 0 2 3 0 195.819531 1 3 -2.221273
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
251 0.280442 -0.206904 0.841631 0.880179 -0.993124 -1.570623 -0.249459 0.643314 0.049495 0.493837 ... 2 category_2 2 3 0 2 -171.249549 0 0 -1.355422
252 -1.165150 -1.070753 0.465662 1.054452 0.900826 -0.179925 -1.536244 1.178780 1.488252 1.895889 ... 2 category_4 4 2 1 1 23.708442 0 2 1.088160
253 -0.069856 -0.186691 -1.021913 -1.143641 0.250114 1.040239 -1.150438 0.258798 -0.836111 0.642211 ... 3 category_3 2 2 2 2 -33.414215 1 1 0.374183
254 -1.031482 -0.860262 -0.061638 0.328301 -1.429991 -1.048170 -1.432735 0.607112 0.087531 0.938747 ... 0 category_3 4 1 4 4 -359.199191 0 4 1.199032
255 -1.461733 0.960693 0.367545 1.329063 -0.683440 -1.184687 0.190312 -0.521580 -0.851729 1.822724 ... 1 category_3 4 1 1 4 -135.199100 1 2 -1.521539

256 rows × 24 columns

If stacked is False:

[7]:
trainer.datamodule.set_data_derivers([("MyRelativeDeriver", {"absolute_col": "cont_0", "relative2_col": "cont_1", "derived_name": "cont_0_relative2_cont_1", "some_param": 1.0, "stacked": False})])
trainer.load_data()
print(f"cont_0_relative2_cont_1 in continuous features?: {'cont_0_relative2_cont_1' in trainer.cont_feature_names}")
trainer.df
Using previously used data path ../../../../data/sample.csv
Dataset size: 153 51 52
Data saved to ../../../../output/sample/2023-09-18-18-15-00-0_sample (data.csv and tabular_data.csv).
cont_0_relative2_cont_1 in continuous features?: False
[7]:
cont_0 cont_1 cont_2 cont_3 cont_4 cont_5 cont_6 cont_7 cont_8 cont_9 ... cat_3 cat_4 cat_5 cat_6 cat_7 cat_8 cat_9 target target_binary target_multi_class
0 -1.306527 0.138315 -0.118164 -0.159573 1.658131 -1.346718 -0.680178 -1.334258 0.666383 -0.460720 ... 0 2 category_4 3 4 4 3 -71.084217 0 1
1 2.011257 -0.006111 0.195070 0.527004 -0.044595 0.616887 -1.781563 0.354758 -0.729045 0.196557 ... 4 3 category_3 3 1 3 2 13.415675 1 2
2 -1.216077 0.138315 -0.743672 0.730184 0.140672 1.272954 -0.159012 -0.475175 0.240057 0.100159 ... 0 4 category_3 4 1 0 2 -47.492280 0 2
3 0.559299 -0.006111 -0.431096 -0.809627 -1.063696 -0.860153 0.572751 -0.467441 0.677557 1.307184 ... 4 1 category_3 4 2 0 0 -94.482614 1 2
4 0.910179 -0.006111 0.786328 -0.042257 0.317218 0.379152 -0.466419 -0.017020 -0.944446 -0.410050 ... 1 0 category_2 0 2 3 0 195.819531 1 3
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
251 0.280442 -0.206904 0.841631 0.880179 -0.993124 -1.570623 -0.249459 0.643314 0.049495 0.493837 ... 1 2 category_2 2 3 0 2 -171.249549 0 0
252 -1.165150 -1.070753 0.465662 1.054452 0.900826 -0.179925 -1.536244 1.178780 1.488252 1.895889 ... 4 2 category_4 4 2 1 1 23.708442 0 2
253 -0.069856 -0.186691 -1.021913 -1.143641 0.250114 1.040239 -1.150438 0.258798 -0.836111 0.642211 ... 0 3 category_3 2 2 2 2 -33.414215 1 1
254 -1.031482 -0.860262 -0.061638 0.328301 -1.429991 -1.048170 -1.432735 0.607112 0.087531 0.938747 ... 0 0 category_3 4 1 4 4 -359.199191 0 4
255 -1.461733 0.960693 0.367545 1.329063 -0.683440 -1.184687 0.190312 -0.521580 -0.851729 1.822724 ... 2 1 category_3 4 1 1 4 -135.199100 1 2

256 rows × 23 columns

[8]:
trainer.derived_data.keys()
[8]:
dict_keys(['cont_0_relative2_cont_1', 'categorical'])