{ "cells": [ { "cell_type": "markdown", "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } }, "source": [ "# New data processors\n", "\n", "Data processors are the core functionalities in the processing procedure. They can\n", "\n", "1. Add new data points by inheriting `tabensemb.data.AbstractAugmenter`;\n", "2. Remove data points by inheriting `tabensemb.data.AbstractProcessor`;\n", "3. Change values of features by inheriting `tabensemb.data.AbstractTransformer` or `tabensemb.data.AbstractScaler`;\n", "4. Reduce the number of features by inheriting `tabensemb.data.AbstractFeatureSelector`.\n", "\n", "The above-mentioned classes are all subclasses of `tabensemb.data.AbstractProcessor`. A subclass of `AbstractProcessor` should have `_fit_transform` and `_transform` implemented. `_fit_transform` is used to fit the processor and transform the training set and the validation set. `_transform` will be called to transform the testing set or an upcoming dataset using the fitted processor. For all these classes, you can implement `_required_kwargs` and `_defaults` as we did in \"New data derivers\" because they all inherit `tabensemb.data.AbstractDataStep`.\n", "\n", "The usage of processors is already introduced in \"Using data functionalities\".\n", "\n", "The implemented processors should be registered as follows to be recognized by `DataModule.set_data_processors` automatically.\n", "\n", "```python\n", "from tabensemb.data.dataprocessor import processor_mapping\n", "processor_mapping[\"ADataProcessor\"] = ADataProcessor\n", "```" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } }, "source": [ "```python\n", "from tabensemb.data import AbstractAugmenter, AbstractProcessor, AbstractTransformer, AbstractScaler, AbstractFeatureSelector, DataModule\n", "import pandas as pd\n", "import numpy as np\n", "```" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } }, "source": [ "## `AbstractAugmenter`\n", "\n", "We provide an example of data augmentation in the package, which simply copies the last two data points of the input `DataFrame` that contains the training set and the validation set. The method `_get_augmented`, which returns a `DataFrame` containing new data points, is the only method that needs to be implemented.\n", "\n", "```python\n", "class SampleDataAugmenter(AbstractAugmenter):\n", " def _get_augmented(\n", " self, data: pd.DataFrame, datamodule: DataModule\n", " ) -> pd.DataFrame:\n", " augmented = data.loc[data.index[-2:], :].copy()\n", " return augmented\n", "```" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } }, "source": [ "## `AbstractProcessor`\n", "\n", "It is the base class for data processors. Other mentioned classes implement these two methods and provide higher-level methods for simplification. Currently, only processors that remove some data points are still implemented under `AbstractProcessor` directly. Take `tabensemb.data.dataprocessors.FeatureValueSelector` as an example.\n", "\n", "`FeatureValueSelector` is used to select data points that have the specific value (the argument \"value\") of a certain feature (the argument \"feature\"). These two arguments are defined in `_required_kwargs`.\n", "\n", "```python\n", "class FeatureValueSelector(AbstractProcessor):\n", " def _required_kwargs(self):\n", " return [\"feature\", \"value\"]\n", "```\n", "\n", "It directly removes unwanted data points in the `DataFrame`.\n", "\n", "**Remark**: **DO NOT** reset the index of the returned `DataFrame`, which is used to update the indices of training/validation/testing sets.\n", "\n", "```python\n", " def _fit_transform(self, data: pd.DataFrame, datamodule: DataModule):\n", " feature = self.kwargs[\"feature\"]\n", " value = self.kwargs[\"value\"]\n", " where_value = data.index[np.where(data[feature] == value)[0]]\n", " data = data.loc[where_value, :]\n", " self.feature, self.value = feature, value\n", " return data\n", "```\n", "\n", "`FeatureValueSelector` behaves differently when processing the dataset at hand (`datamodule.training==True`) and the upcoming dataset (`datamodule.training==False`) respectively. In the latter case, data points should not be removed when making inferences. However, data points can be removed from the validation or testing sets (`datamodule.training==True`) because we only want the specific value of the feature in the entire dataset.\n", "\n", "```python\n", " def _transform(self, data: pd.DataFrame, datamodule: DataModule):\n", " if datamodule.training:\n", " if self.value not in list(data[self.feature]):\n", " raise Exception(\n", " f\"Value {self.value} not available for feature {self.feature}. Select from {data[self.feature].unique()}\"\n", " )\n", " where_value = data.index[np.where(data[self.feature] == self.value)[0]]\n", " data = data.loc[where_value, :]\n", " else:\n", " if self.value not in list(data[self.feature]):\n", " warnings.warn(\n", " f\"Value {self.value} not available for feature {self.feature} selected by \"\n", " f\"{self.__class__.__name__}.\"\n", " )\n", " return data\n", "```\n", "\n", "## `AbstractFeatureSelector`\n", "\n", "`AbstractFeatureSelector` is used to select tabular features and thus reduce the dimension of the problem. The only necessary method is `_get_feature_names_out` which returns a list of selected features. Take `tabensemb.data.dataprocessors.VarianceFeatureSelector` that uses `sklearn.feature_selection.VarianceThreshold` as an example. A parameter `thres` can be given. The input `DataFrame` is the training and validation set.\n", "\n", "```python\n", "from sklearn.feature_selection import VarianceThreshold\n", "\n", "class FeatureSelector(AbstractFeatureSelector):\n", " def _defaults(self):\n", " return dict(thres=0.8)\n", "\n", " def _get_feature_names_out(self, data, datamodule):\n", " thres = self.kwargs[\"thres\"]\n", " sel = VarianceThreshold(threshold=(thres * (1 - thres)))\n", " sel.fit(\n", " data[datamodule.all_feature_names],\n", " data[datamodule.label_name].values.flatten()\n", " if len(datamodule.label_name) == 1\n", " else data[datamodule.label_name].values, # Ignored.\n", " )\n", " retain_features = list(sel.get_feature_names_out())\n", " return retain_features\n", "```\n", "\n", "## `AbstractTransformer`\n", "\n", "`AbstractTransformer` is used to modify the values of features. Its implementation is exactly the same as that of `AbstractProcessor`. It is mostly a classification criteria to tell the user what it will do, and so does the following `AbstractScaler` that inherits it. A typical example is `tabensemb.data.dataprocessors.CategoricalOrdinalEncoder` which turns categorical features containing meaningful strings into numerical representations.\n", "\n", "A method called `DataModule.get_var_change` can calculate what a specific value of a specific feature will become after going through all `AbstractTransformer`s used. It can be useful when the zero values are needed to be unchanged.\n", "\n", "## `AbstractScaler`\n", "\n", "It inherits `AbstractTransformer`. The last data processor defined in an `DataModule` must be a `AbstractScaler`. As shown in \"Customized model base\", some representations of the dataset in the `DataModule` are stored in the unscaled form, which means they have gone through all data processors except for the last one. Call `datamodule.data_transform(df, scaler_only=True)` to scale them by calling the last data processor (the `AbstractScaler`). The implementation is similar to `AbstractProcessor`. Take `tabensemb.data.dataprocessors.StandardScaler` which uses `sklearn.preprocessing.StandardScaler` as an example:\n", "\n", "```python\n", "from sklearn.preprocessing import StandardScaler as skStandardScaler\n", "\n", "class StandardScaler(AbstractScaler):\n", " def _fit_transform(self, data: pd.DataFrame, datamodule: DataModule):\n", " scaler = skStandardScaler()\n", " if len(datamodule.cont_feature_names) > 0:\n", " data[datamodule.cont_feature_names] = scaler.fit_transform(\n", " data[datamodule.cont_feature_names]\n", " ).astype(np.float64)\n", "\n", " self.transformer = scaler\n", " return data\n", "\n", " def _transform(self, data: pd.DataFrame, datamodule: DataModule):\n", " if len(datamodule.cont_feature_names) > 0:\n", " data[datamodule.cont_feature_names] = self.transformer.transform(\n", " data[datamodule.cont_feature_names]\n", " ).astype(np.float64)\n", " return data\n", "```\n", "\n", "**Remark**: It is highly recommended to use the 64-bit float (double) precision to avoid inconsistent result between `_fit_transform` and `_transform`.\n", "\n", "**Remark**: There can be no continuous and/or categorical features. Please confirm that your `AbstractProcessor`s support empty `datamodule.cont_feature_names` and/or `datamodule.cat_feature_names`." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.6" } }, "nbformat": 4, "nbformat_minor": 0 }