{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "# New data processors\n",
    "\n",
    "Data processors are the core functionalities in the processing procedure. They can\n",
    "\n",
    "1. Add new data points by inheriting `tabensemb.data.AbstractAugmenter`;\n",
    "2. Remove data points by inheriting `tabensemb.data.AbstractProcessor`;\n",
    "3. Change values of features by inheriting `tabensemb.data.AbstractTransformer` or `tabensemb.data.AbstractScaler`;\n",
    "4. Reduce the number of features by inheriting `tabensemb.data.AbstractFeatureSelector`.\n",
    "\n",
    "The above-mentioned classes are all subclasses of `tabensemb.data.AbstractProcessor`. A subclass of `AbstractProcessor` should have `_fit_transform` and `_transform` implemented. `_fit_transform` is used to fit the processor and transform the training set and the validation set. `_transform` will be called to transform the testing set or an upcoming dataset using the fitted processor. For all these classes, you can implement `_required_kwargs` and `_defaults` as we did in \"New data derivers\" because they all inherit `tabensemb.data.AbstractDataStep`.\n",
    "\n",
    "The usage of processors is already introduced in \"Using data functionalities\".\n",
    "\n",
    "The implemented processors should be registered as follows to be recognized by `DataModule.set_data_processors` automatically.\n",
    "\n",
    "```python\n",
    "from tabensemb.data.dataprocessor import processor_mapping\n",
    "processor_mapping[\"ADataProcessor\"] = ADataProcessor\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "```python\n",
    "from tabensemb.data import AbstractAugmenter, AbstractProcessor, AbstractTransformer, AbstractScaler, AbstractFeatureSelector, DataModule\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## `AbstractAugmenter`\n",
    "\n",
    "We provide an example of data augmentation in the package, which simply copies the last two data points of the input `DataFrame` that contains the training set and the validation set. The method `_get_augmented`, which returns a `DataFrame` containing new data points, is the only method that needs to be implemented.\n",
    "\n",
    "```python\n",
    "class SampleDataAugmenter(AbstractAugmenter):\n",
    "    def _get_augmented(\n",
    "        self, data: pd.DataFrame, datamodule: DataModule\n",
    "    ) -> pd.DataFrame:\n",
    "        augmented = data.loc[data.index[-2:], :].copy()\n",
    "        return augmented\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## `AbstractProcessor`\n",
    "\n",
    "It is the base class for data processors. Other mentioned classes implement these two methods and provide higher-level methods for simplification. Currently, only processors that remove some data points are still implemented under `AbstractProcessor` directly. Take `tabensemb.data.dataprocessors.FeatureValueSelector` as an example.\n",
    "\n",
    "`FeatureValueSelector` is used to select data points that have the specific value (the argument \"value\") of a certain feature (the argument \"feature\"). These two arguments are defined in `_required_kwargs`.\n",
    "\n",
    "```python\n",
    "class FeatureValueSelector(AbstractProcessor):\n",
    "    def _required_kwargs(self):\n",
    "        return [\"feature\", \"value\"]\n",
    "```\n",
    "\n",
    "It directly removes unwanted data points in the `DataFrame`.\n",
    "\n",
    "**Remark**: **DO NOT** reset the index of the returned `DataFrame`, which is used to update the indices of training/validation/testing sets.\n",
    "\n",
    "```python\n",
    "    def _fit_transform(self, data: pd.DataFrame, datamodule: DataModule):\n",
    "        feature = self.kwargs[\"feature\"]\n",
    "        value = self.kwargs[\"value\"]\n",
    "        where_value = data.index[np.where(data[feature] == value)[0]]\n",
    "        data = data.loc[where_value, :]\n",
    "        self.feature, self.value = feature, value\n",
    "        return data\n",
    "```\n",
    "\n",
    "`FeatureValueSelector` behaves differently when processing the dataset at hand (`datamodule.training==True`) and the upcoming dataset (`datamodule.training==False`) respectively. In the latter case, data points should not be removed when making inferences. However, data points can be removed from the validation or testing sets (`datamodule.training==True`) because we only want the specific value of the feature in the entire dataset.\n",
    "\n",
    "```python\n",
    "    def _transform(self, data: pd.DataFrame, datamodule: DataModule):\n",
    "        if datamodule.training:\n",
    "            if self.value not in list(data[self.feature]):\n",
    "                raise Exception(\n",
    "                    f\"Value {self.value} not available for feature {self.feature}. Select from {data[self.feature].unique()}\"\n",
    "                )\n",
    "            where_value = data.index[np.where(data[self.feature] == self.value)[0]]\n",
    "            data = data.loc[where_value, :]\n",
    "        else:\n",
    "            if self.value not in list(data[self.feature]):\n",
    "                warnings.warn(\n",
    "                    f\"Value {self.value} not available for feature {self.feature} selected by \"\n",
    "                    f\"{self.__class__.__name__}.\"\n",
    "                )\n",
    "        return data\n",
    "```\n",
    "\n",
    "## `AbstractFeatureSelector`\n",
    "\n",
    "`AbstractFeatureSelector` is used to select tabular features and thus reduce the dimension of the problem. The only necessary method is `_get_feature_names_out` which returns a list of selected features. Take `tabensemb.data.dataprocessors.VarianceFeatureSelector` that uses `sklearn.feature_selection.VarianceThreshold` as an example. A parameter `thres` can be given. The input `DataFrame` is the training and validation set.\n",
    "\n",
    "```python\n",
    "from sklearn.feature_selection import VarianceThreshold\n",
    "\n",
    "class FeatureSelector(AbstractFeatureSelector):\n",
    "    def _defaults(self):\n",
    "        return dict(thres=0.8)\n",
    "\n",
    "    def _get_feature_names_out(self, data, datamodule):\n",
    "        thres = self.kwargs[\"thres\"]\n",
    "        sel = VarianceThreshold(threshold=(thres * (1 - thres)))\n",
    "        sel.fit(\n",
    "            data[datamodule.all_feature_names],\n",
    "            data[datamodule.label_name].values.flatten()\n",
    "            if len(datamodule.label_name) == 1\n",
    "            else data[datamodule.label_name].values,  # Ignored.\n",
    "        )\n",
    "        retain_features = list(sel.get_feature_names_out())\n",
    "        return retain_features\n",
    "```\n",
    "\n",
    "## `AbstractTransformer`\n",
    "\n",
    "`AbstractTransformer` is used to modify the values of features. Its implementation is exactly the same as that of `AbstractProcessor`. It is mostly a classification criteria to tell the user what it will do, and so does the following `AbstractScaler` that inherits it. A typical example is `tabensemb.data.dataprocessors.CategoricalOrdinalEncoder` which turns categorical features containing meaningful strings into numerical representations.\n",
    "\n",
    "A method called `DataModule.get_var_change` can calculate what a specific value of a specific feature will become after going through all `AbstractTransformer`s used. It can be useful when the zero values are needed to be unchanged.\n",
    "\n",
    "## `AbstractScaler`\n",
    "\n",
    "It inherits `AbstractTransformer`. The last data processor defined in an `DataModule` must be a `AbstractScaler`. As shown in \"Customized model base\", some representations of the dataset in the `DataModule` are stored in the unscaled form, which means they have gone through all data processors except for the last one. Call `datamodule.data_transform(df, scaler_only=True)` to scale them by calling the last data processor (the `AbstractScaler`). The implementation is similar to `AbstractProcessor`. Take `tabensemb.data.dataprocessors.StandardScaler` which uses `sklearn.preprocessing.StandardScaler` as an example:\n",
    "\n",
    "```python\n",
    "from sklearn.preprocessing import StandardScaler as skStandardScaler\n",
    "\n",
    "class StandardScaler(AbstractScaler):\n",
    "    def _fit_transform(self, data: pd.DataFrame, datamodule: DataModule):\n",
    "        scaler = skStandardScaler()\n",
    "        if len(datamodule.cont_feature_names) > 0:\n",
    "            data[datamodule.cont_feature_names] = scaler.fit_transform(\n",
    "                data[datamodule.cont_feature_names]\n",
    "            ).astype(np.float64)\n",
    "\n",
    "        self.transformer = scaler\n",
    "        return data\n",
    "\n",
    "    def _transform(self, data: pd.DataFrame, datamodule: DataModule):\n",
    "        if len(datamodule.cont_feature_names) > 0:\n",
    "            data[datamodule.cont_feature_names] = self.transformer.transform(\n",
    "                data[datamodule.cont_feature_names]\n",
    "            ).astype(np.float64)\n",
    "        return data\n",
    "```\n",
    "\n",
    "**Remark**: It is highly recommended to use the 64-bit float (double) precision to avoid inconsistent result between `_fit_transform` and `_transform`.\n",
    "\n",
    "**Remark**: There can be no continuous and/or categorical features. Please confirm that your `AbstractProcessor`s support empty `datamodule.cont_feature_names` and/or `datamodule.cat_feature_names`."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}