{
 "cells": [
  {
   "cell_type": "markdown",
   "source": [
    "# Using data functionalities\n",
    "\n",
    "Running `Trainer.load_data` or `DataModule.load_data` will process the dataset in the following order:\n",
    "\n",
    "1. Data splitting (training/validation/testing sets): See \"Data splitters\"\n",
    "2. Data imputation: See \"Data imputers\"\n",
    "3. Data augmentation (for features): See \"Data derivers\"\n",
    "4. Data processing **(orderless except for data scaling)**: See \"Data processors\"\n",
    "    * Data augmentation (for data points)\n",
    "    * Data filtering\n",
    "    * Feature selection\n",
    "    * Categorical encoding\n",
    "    * Data scaling\n",
    "    * etc.\n",
    "5. Data augmentation (for features, especially multi-modal features and physics-informed features): See \"Data derivers\".\n",
    "\n",
    "In this part, we will introduce the usage of \"data splitters\", \"data imputers\", \"data processors\", and \"data derivers\". Implementing new functionalities is left as a section in \"Advanced Usage\"."
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "outputs": [],
   "source": [
    "import torch\n",
    "from tabensemb.trainer import Trainer\n",
    "from tabensemb.model import *\n",
    "import tabensemb\n",
    "import os\n",
    "\n",
    "prefix = \"../../../../\"\n",
    "tabensemb.setting[\"default_output_path\"] = prefix + \"output\"\n",
    "tabensemb.setting[\"default_config_path\"] = prefix + \"configs\"\n",
    "tabensemb.setting[\"default_data_path\"] = prefix + \"data\"\n",
    "\n",
    "trainer = Trainer(device=\"cpu\")"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "## Data splitters\n",
    "\n",
    "Data splitters are used to split the whole dataset into training, validation, and testing sets. They inherit `tabensemb.data.AbstractSplitter`, and implement `_split` (the main method) and `_next_cv` (to generate the next fold for a k-fold CV process).\n",
    "\n",
    "**Remark**: If `AbstractSplitter.support_cv=False`, the data splitter does not support k-fold CV.\n",
    "\n",
    "There are several ways to specify the used data splitter. These ways can also be used to specify other configurations.\n",
    "\n",
    "\n",
    "1. Modify the configuration file, `configs/sample.py` for example:\n",
    "\n",
    "```python\n",
    "cfg = {\n",
    "    \"data_splitter\": \"RandomSplitter\",\n",
    "    # Some other configurations...\n",
    "}\n",
    "```\n",
    "\n",
    "2. Use the `manual_config` argument of `Trainer.load_config`.\n",
    "\n",
    "```python\n",
    "trainer.load_config(\"sample\", manual_config={\"data_splitter\": \"RandomSplitter\"})\n",
    "```\n",
    "\n",
    "3. After `Trainer.load_config` is called and if one does not want to call it again, use `DataModule.set_data_splitter`."
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The project will be saved to ../../../../output/sample/2023-09-23-20-36-52-0_sample\n",
      "Dataset size: 178 39 39\n",
      "Data saved to ../../../../output/sample/2023-09-23-20-36-52-0_sample (data.csv and tabular_data.csv).\n"
     ]
    }
   ],
   "source": [
    "trainer.load_config(\"sample\")\n",
    "trainer.datamodule.set_data_splitter(\"RandomSplitter\", ratio=[7, 1.5, 1.5])\n",
    "trainer.load_data()"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "The `ratio` argument can also be given in the configuration file, `manual_config`, or `set_data_splitter` as:\n",
    "\n",
    "```python\n",
    "cfg = {\n",
    "    # This will overwrite the `split_ratio` configuration.\n",
    "    \"data_splitter\": [\"RandomSplitter\", {\"train_val_test\": [7, 1.5, 1.5]}],\n",
    "    # Some other configurations...\n",
    "}\n",
    "```\n",
    "\n",
    "```python\n",
    "trainer.load_config(\"sample\", manual_config={\"data_splitter\": [\"RandomSplitter\", {\"train_val_test\": [7, 1.5, 1.5]}]})\n",
    "```\n",
    "\n",
    "```python\n",
    "trainer.datamodule.set_data_splitter([\"RandomSplitter\", {\"train_val_test\": [7, 1.5, 1.5]}])\n",
    "```\n",
    "\n",
    "Available data splitters can be seen using:"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "outputs": [
    {
     "data": {
      "text/plain": "{'AbstractSplitter': tabensemb.data.base.AbstractSplitter,\n 'RandomSplitter': tabensemb.data.datasplitter.RandomSplitter}"
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from tabensemb.data.datasplitter import splitter_mapping\n",
    "splitter_mapping"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "## Data imputers\n",
    "\n",
    "Imputation is necessary when NaNs exist in the dataset. `tabensemb` provides several methods incorporating other packages like `miceforest` and `scikit-learn`. The configuration for an imputer contains two parts: the name of the imputer and its arguments. Data imputers can be set similarly to data splitters in the following ways:\n",
    "\n",
    "1. Modify the configuration file, `configs/sample.py` for example:\n",
    "\n",
    "```python\n",
    "cfg = {\n",
    "    \"data_imputer\": [\"MiceImputer\", {\"max_iter\": 10}],\n",
    "    # \"data_imputer\": \"MiceImputer\", (If no kwargs is given)\n",
    "    # Some other configurations...\n",
    "}\n",
    "```\n",
    "\n",
    "2. Use the `manual_config` argument of `Trainer.load_config`.\n",
    "\n",
    "```python\n",
    "trainer.load_config(\"sample\", manual_config={\"data_imputer\": [\"MiceImputer\", {\"max_iter\": 10}]})\n",
    "trainer.load_config(\"sample\", manual_config={\"data_imputer\": \"MiceImputer\"})\n",
    "```\n",
    "\n",
    "3. Use `DataModule.set_data_imputer`"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The project will be saved to ../../../../output/sample/2023-09-23-20-36-52-0_sample-I1\n",
      "Dataset size: 153 51 52\n",
      "Data saved to ../../../../output/sample/2023-09-23-20-36-52-0_sample-I1 (data.csv and tabular_data.csv).\n"
     ]
    }
   ],
   "source": [
    "trainer.load_config(\"sample\")\n",
    "trainer.datamodule.set_data_imputer([\"MiceImputer\", {\"max_iter\": 10}])\n",
    "trainer.load_data()"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "outputs": [
    {
     "data": {
      "text/plain": "cont_0                False\ncont_1                False\ncont_2                False\ncont_3                False\ncont_4                False\ncont_5                False\ncont_6                False\ncont_7                False\ncont_8                False\ncont_9                False\ncat_0                 False\ncat_1                 False\ncat_2                 False\ncat_3                 False\ncat_4                 False\ncat_5                 False\ncat_6                 False\ncat_7                 False\ncat_8                 False\ncat_9                 False\ntarget                False\ntarget_binary         False\ntarget_multi_class    False\ndtype: bool"
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "trainer.df.isna().any()"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "Available data splitters can be seen using:"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "outputs": [
    {
     "data": {
      "text/plain": "{'AbstractImputer': tabensemb.data.base.AbstractImputer,\n 'AbstractSklearnImputer': tabensemb.data.base.AbstractSklearnImputer,\n 'GainImputer': tabensemb.data.dataimputer.GainImputer,\n 'MeanImputer': tabensemb.data.dataimputer.MeanImputer,\n 'MedianImputer': tabensemb.data.dataimputer.MedianImputer,\n 'MiceImputer': tabensemb.data.dataimputer.MiceImputer,\n 'MiceLightgbmImputer': tabensemb.data.dataimputer.MiceLightgbmImputer,\n 'MissForestImputer': tabensemb.data.dataimputer.MissForestImputer,\n 'ModeImputer': tabensemb.data.dataimputer.ModeImputer}"
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from tabensemb.data.dataimputer import imputer_mapping, get_data_imputer\n",
    "imputer_mapping"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "Arguments can be seen in API docs or in docstrings"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "    Imputation with average values implemented using sklearn's SimpleImputer.\n",
      "\n",
      "    Parameters\n",
      "    ----------\n",
      "    **kwargs\n",
      "        Arguments for ``sklearn.impute.SimpleImputer`` (except for ``strategy``)\n",
      "    \n"
     ]
    }
   ],
   "source": [
    "print(get_data_imputer(\"MeanImputer\").__doc__)"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "## Data processors\n",
    "\n",
    "As listed in Step 4 above, data processing includes filtering, augmentation, feature selection, and much more. `tabensemb` provides a unified framework for implementing various data processing steps. The data imputation and processing procedure is quite similar to the `Pipeline` structure in `sklearn`, but is fully compatible with the other two modules introduced in this part and all four modules automatically do all preparations before training for the user.\n",
    "\n",
    "Configuration for a processor also contains two parts: the name of the processor and arguments. Here we provide several examples:\n",
    "\n",
    "* `CategoricalOrdinalEncoder`: similar to the `OrdinalEncoder` in `sklearn`\n",
    "* `NaNFeatureRemover`: remove features that are all NaNs\n",
    "* `VarianceFeatureSelector`: same as the `VarianceThreshold` from `sklearn`\n",
    "* `FeatureValueSelector`: select data points that have a certain value of a feature\n",
    "* `CorrFeatureSelector`: remove highly correlated features\n",
    "* `IQRRemover`: remove outliers found by the 1.5*IQR criteria\n",
    "* `StdRemover`: remove outliers found by 3*std criteria\n",
    "* `SampleDataAugmentor`: just an example to show the data augmentation capability (it copies the last two data points in the validation set)\n",
    "* `StandardScaler`: same as the `StandardScaler` from `sklearn`\n",
    "\n",
    "**Remark**: Data scalers like a `StandardScaler` must be the last data processor.\n",
    "\n",
    "**Remark**: Data points derived from data augmenters (inheriting `AbstractAugmenter`) will be added to the training set."
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "outputs": [],
   "source": [
    "processor_configs = [\n",
    "    [\"CategoricalOrdinalEncoder\", {}],\n",
    "    [\"NaNFeatureRemover\", {}],\n",
    "    [\"VarianceFeatureSelector\", {\"thres\": 0.1}],\n",
    "    [\"FeatureValueSelector\", {\"feature\": \"cat_1\", \"value\": 0}],\n",
    "    [\"CorrFeatureSelector\", {\"thres\": 0.1}],\n",
    "    [\"IQRRemover\", {}],\n",
    "    [\"StdRemover\", {}],\n",
    "    [\"SampleDataAugmenter\", {}],\n",
    "    [\"StandardScaler\", {}],\n",
    "]"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "\n",
    "1. Modify the configuration file:\n",
    "\n",
    "```python\n",
    "cfg = {\n",
    "    \"data_processors\": processor_configs,\n",
    "    # Some other configurations...\n",
    "}\n",
    "```\n",
    "\n",
    "2. Use the `manual_config` argument of `Trainer.load_config`.\n",
    "\n",
    "```python\n",
    "trainer.load_config(\"sample\", manual_config={\"data_processors\": processor_configs})\n",
    "```\n",
    "\n",
    "3. Use `DataModule.set_data_processors`\n"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The project will be saved to ../../../../output/sample/2023-09-23-20-36-52-0_sample-I2\n",
      "Correlated features (Ranked by SHAP):\n",
      "{\n",
      "\t'cont_2': 13.650428051938668,\n",
      "\t'cont_1': 8.98106859262871\n",
      "}\n",
      "1 features removed: ['cont_1']. 7 features retained: ['cont_0', 'cont_3', 'cont_4', 'cont_2', 'cat_0', 'cat_1', 'cat_2'].\n",
      "Removing outliers by IQR. Original size: 36, Final size: 36.\n",
      "Removing outliers by std. Original size: 36, Final size: 36.\n",
      "Dataset size: 27 11 12\n",
      "Data saved to ../../../../output/sample/2023-09-23-20-36-52-0_sample-I2 (data.csv and tabular_data.csv).\n"
     ]
    }
   ],
   "source": [
    "import warnings\n",
    "import numba\n",
    "trainer.load_config(\"sample\")\n",
    "trainer.datamodule.set_data_processors(processor_configs)\n",
    "with warnings.catch_warnings():\n",
    "    warnings.filterwarnings(\"ignore\", category=numba.NumbaDeprecationWarning)\n",
    "    trainer.load_data()"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "Let's check the effectiveness of these processors. Categorical features are encoded by `CategoricalOrdinalEncoder`:"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "outputs": [
    {
     "data": {
      "text/plain": "   cat_0  cat_1  cat_2\n0      3      0      2\n1      3      0      1\n2      3      0      4\n3      0      0      0\n4      4      0      2",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>cat_0</th>\n      <th>cat_1</th>\n      <th>cat_2</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>3</td>\n      <td>0</td>\n      <td>2</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>3</td>\n      <td>0</td>\n      <td>1</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>3</td>\n      <td>0</td>\n      <td>4</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>0</td>\n      <td>0</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>4</td>\n      <td>0</td>\n      <td>2</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "trainer.datamodule.categorical_data.head()"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "The original categorical features can be accessed using"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "outputs": [
    {
     "data": {
      "text/plain": "        cat_0  cat_1  cat_2\n0  category_3      0      2\n1  category_3      0      1\n2  category_3      0      4\n3  category_0      0      0\n4  category_4      0      2",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>cat_0</th>\n      <th>cat_1</th>\n      <th>cat_2</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>category_3</td>\n      <td>0</td>\n      <td>2</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>category_3</td>\n      <td>0</td>\n      <td>1</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>category_3</td>\n      <td>0</td>\n      <td>4</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>category_0</td>\n      <td>0</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>category_4</td>\n      <td>0</td>\n      <td>2</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "trainer.datamodule.categories_inverse_transform(trainer.datamodule.categorical_data).head()"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "One feature is removed by `CorrFeatureSelector`. It removes the feature with the lowest feature importance (ranked using `shap` in the example) in the correlation chain."
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "outputs": [
    {
     "data": {
      "text/plain": "['cont_0', 'cont_2', 'cont_3', 'cont_4']"
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "trainer.cont_feature_names"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "The specific `cat_1` feature value is selected by the `FeatureValueSelector`. Some outliers are removed by the `IQRRemover`. Original indices of the removed data points can be seen using"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "outputs": [
    {
     "data": {
      "text/plain": "array([  0,   1,   2,   3,   4,   5,   8,   9,  10,  11,  12,  13,  14,\n        15,  18,  19,  20,  22,  24,  25,  26,  27,  28,  29,  30,  31,\n        32,  35,  37,  38,  39,  40,  41,  42,  43,  45,  46,  48,  49,\n        50,  51,  52,  53,  54,  55,  56,  57,  58,  59,  60,  61,  62,\n        63,  64,  65,  66,  67,  68,  69,  70,  72,  74,  75,  76,  77,\n        80,  81,  82,  83,  85,  86,  89,  90,  93,  94,  96,  97,  98,\n        99, 100, 102, 103, 104, 105, 106, 107, 109, 110, 111, 112, 113,\n       114, 115, 116, 118, 119, 120, 121, 123, 124, 125, 127, 130, 131,\n       132, 133, 135, 136, 137, 139, 140, 141, 142, 143, 144, 145, 146,\n       147, 148, 150, 151, 152, 153, 157, 158, 159, 161, 163, 165, 166,\n       167, 168, 170, 172, 173, 175, 176, 177, 178, 179, 180, 182, 183,\n       184, 185, 186, 187, 188, 189, 190, 192, 193, 194, 195, 196, 197,\n       199, 200, 201, 202, 203, 207, 208, 209, 210, 211, 212, 213, 214,\n       217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229,\n       230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242,\n       243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255])"
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "trainer.datamodule.dropped_indices"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "The `SampleDataAugmentor` copies the last two data points in the validation set as a showcase. `DataModule.augmented_indices` represents the indices of these data points before dropping `DataModule.dropped_indices`. We can see augmented data points using"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "outputs": [
    {
     "data": {
      "text/plain": "      cont_0    cont_1    cont_2    cont_3    cont_4    cont_5    cont_6  \\\n48 -0.505358 -0.104343 -0.507518 -0.988002 -0.815792 -1.284552 -1.051880   \n49 -2.115056  0.138315  1.618054  0.541008  1.405365 -1.449118 -0.824409   \n\n      cont_7   cont_8    cont_9  ...  cat_3  cat_4       cat_5  cat_6  cat_7  \\\n48  0.564009  2.49720 -2.245322  ...      4      4  category_4      3      1   \n49 -0.813794  0.42258  0.547481  ...      0      0  category_2      4      3   \n\n   cat_8  cat_9      target  target_binary  target_multi_class  \n48     1      2 -246.101543              1                   3  \n49     3      1 -156.813059              0                   3  \n\n[2 rows x 23 columns]",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>cont_0</th>\n      <th>cont_1</th>\n      <th>cont_2</th>\n      <th>cont_3</th>\n      <th>cont_4</th>\n      <th>cont_5</th>\n      <th>cont_6</th>\n      <th>cont_7</th>\n      <th>cont_8</th>\n      <th>cont_9</th>\n      <th>...</th>\n      <th>cat_3</th>\n      <th>cat_4</th>\n      <th>cat_5</th>\n      <th>cat_6</th>\n      <th>cat_7</th>\n      <th>cat_8</th>\n      <th>cat_9</th>\n      <th>target</th>\n      <th>target_binary</th>\n      <th>target_multi_class</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>48</th>\n      <td>-0.505358</td>\n      <td>-0.104343</td>\n      <td>-0.507518</td>\n      <td>-0.988002</td>\n      <td>-0.815792</td>\n      <td>-1.284552</td>\n      <td>-1.051880</td>\n      <td>0.564009</td>\n      <td>2.49720</td>\n      <td>-2.245322</td>\n      <td>...</td>\n      <td>4</td>\n      <td>4</td>\n      <td>category_4</td>\n      <td>3</td>\n      <td>1</td>\n      <td>1</td>\n      <td>2</td>\n      <td>-246.101543</td>\n      <td>1</td>\n      <td>3</td>\n    </tr>\n    <tr>\n      <th>49</th>\n      <td>-2.115056</td>\n      <td>0.138315</td>\n      <td>1.618054</td>\n      <td>0.541008</td>\n      <td>1.405365</td>\n      <td>-1.449118</td>\n      <td>-0.824409</td>\n      <td>-0.813794</td>\n      <td>0.42258</td>\n      <td>0.547481</td>\n      <td>...</td>\n      <td>0</td>\n      <td>0</td>\n      <td>category_2</td>\n      <td>4</td>\n      <td>3</td>\n      <td>3</td>\n      <td>1</td>\n      <td>-156.813059</td>\n      <td>0</td>\n      <td>3</td>\n    </tr>\n  </tbody>\n</table>\n<p>2 rows × 23 columns</p>\n</div>"
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "trainer.df.loc[trainer.datamodule.augmented_indices-len(trainer.datamodule.dropped_indices), :]"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "outputs": [
    {
     "data": {
      "text/plain": "      cont_0    cont_1    cont_2    cont_3    cont_4    cont_5    cont_6  \\\n16 -0.505358 -0.104343 -0.507518 -0.988002 -0.815792 -1.284552 -1.051880   \n0  -2.115056  0.138315  1.618054  0.541008  1.405365 -1.449118 -0.824409   \n\n      cont_7   cont_8    cont_9  ...  cat_3  cat_4       cat_5  cat_6  cat_7  \\\n16  0.564009  2.49720 -2.245322  ...      4      4  category_4      3      1   \n0  -0.813794  0.42258  0.547481  ...      0      0  category_2      4      3   \n\n   cat_8  cat_9      target  target_binary  target_multi_class  \n16     1      2 -246.101543              1                   3  \n0      3      1 -156.813059              0                   3  \n\n[2 rows x 23 columns]",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>cont_0</th>\n      <th>cont_1</th>\n      <th>cont_2</th>\n      <th>cont_3</th>\n      <th>cont_4</th>\n      <th>cont_5</th>\n      <th>cont_6</th>\n      <th>cont_7</th>\n      <th>cont_8</th>\n      <th>cont_9</th>\n      <th>...</th>\n      <th>cat_3</th>\n      <th>cat_4</th>\n      <th>cat_5</th>\n      <th>cat_6</th>\n      <th>cat_7</th>\n      <th>cat_8</th>\n      <th>cat_9</th>\n      <th>target</th>\n      <th>target_binary</th>\n      <th>target_multi_class</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>16</th>\n      <td>-0.505358</td>\n      <td>-0.104343</td>\n      <td>-0.507518</td>\n      <td>-0.988002</td>\n      <td>-0.815792</td>\n      <td>-1.284552</td>\n      <td>-1.051880</td>\n      <td>0.564009</td>\n      <td>2.49720</td>\n      <td>-2.245322</td>\n      <td>...</td>\n      <td>4</td>\n      <td>4</td>\n      <td>category_4</td>\n      <td>3</td>\n      <td>1</td>\n      <td>1</td>\n      <td>2</td>\n      <td>-246.101543</td>\n      <td>1</td>\n      <td>3</td>\n    </tr>\n    <tr>\n      <th>0</th>\n      <td>-2.115056</td>\n      <td>0.138315</td>\n      <td>1.618054</td>\n      <td>0.541008</td>\n      <td>1.405365</td>\n      <td>-1.449118</td>\n      <td>-0.824409</td>\n      <td>-0.813794</td>\n      <td>0.42258</td>\n      <td>0.547481</td>\n      <td>...</td>\n      <td>0</td>\n      <td>0</td>\n      <td>category_2</td>\n      <td>4</td>\n      <td>3</td>\n      <td>3</td>\n      <td>1</td>\n      <td>-156.813059</td>\n      <td>0</td>\n      <td>3</td>\n    </tr>\n  </tbody>\n</table>\n<p>2 rows × 23 columns</p>\n</div>"
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "trainer.df.loc[trainer.datamodule.val_indices[-2:], :]"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "Finally, `StandardScaler` scales the dataset. `DataModule.df` is the unscaled data frame, and `scaled_df` is the scaled one."
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "outputs": [
    {
     "data": {
      "text/plain": "          cont_0     cont_2     cont_3     cont_4\ncount  50.000000  50.000000  50.000000  50.000000\nmean   -0.153242  -0.086822   0.083808   0.194227\nstd     0.961960   0.947694   0.762640   1.026297\nmin    -2.115056  -1.945703  -1.665956  -1.884586\n25%    -0.666141  -0.809715  -0.437357  -0.697263\n50%    -0.087749   0.082401  -0.059459   0.246986\n75%     0.329382   0.765257   0.541008   1.110425\nmax     2.929096   1.618054   1.576299   2.285601",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>cont_0</th>\n      <th>cont_2</th>\n      <th>cont_3</th>\n      <th>cont_4</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>count</th>\n      <td>50.000000</td>\n      <td>50.000000</td>\n      <td>50.000000</td>\n      <td>50.000000</td>\n    </tr>\n    <tr>\n      <th>mean</th>\n      <td>-0.153242</td>\n      <td>-0.086822</td>\n      <td>0.083808</td>\n      <td>0.194227</td>\n    </tr>\n    <tr>\n      <th>std</th>\n      <td>0.961960</td>\n      <td>0.947694</td>\n      <td>0.762640</td>\n      <td>1.026297</td>\n    </tr>\n    <tr>\n      <th>min</th>\n      <td>-2.115056</td>\n      <td>-1.945703</td>\n      <td>-1.665956</td>\n      <td>-1.884586</td>\n    </tr>\n    <tr>\n      <th>25%</th>\n      <td>-0.666141</td>\n      <td>-0.809715</td>\n      <td>-0.437357</td>\n      <td>-0.697263</td>\n    </tr>\n    <tr>\n      <th>50%</th>\n      <td>-0.087749</td>\n      <td>0.082401</td>\n      <td>-0.059459</td>\n      <td>0.246986</td>\n    </tr>\n    <tr>\n      <th>75%</th>\n      <td>0.329382</td>\n      <td>0.765257</td>\n      <td>0.541008</td>\n      <td>1.110425</td>\n    </tr>\n    <tr>\n      <th>max</th>\n      <td>2.929096</td>\n      <td>1.618054</td>\n      <td>1.576299</td>\n      <td>2.285601</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "trainer.datamodule.df[trainer.cont_feature_names].describe()"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "outputs": [
    {
     "data": {
      "text/plain": "          cont_0     cont_2     cont_3     cont_4\ncount  50.000000  50.000000  50.000000  50.000000\nmean    0.069544   0.049781   0.002835   0.034550\nstd     1.072721   0.975998   0.971265   1.002871\nmin    -2.118156  -1.864617  -2.225588  -1.996813\n25%    -0.502411  -0.694703  -0.660898  -0.836591\n50%     0.142578   0.224057  -0.179623   0.086104\n75%     0.607738   0.927307   0.585106   0.929836\nmax     3.506786   1.805574   1.903607   2.078188",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>cont_0</th>\n      <th>cont_2</th>\n      <th>cont_3</th>\n      <th>cont_4</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>count</th>\n      <td>50.000000</td>\n      <td>50.000000</td>\n      <td>50.000000</td>\n      <td>50.000000</td>\n    </tr>\n    <tr>\n      <th>mean</th>\n      <td>0.069544</td>\n      <td>0.049781</td>\n      <td>0.002835</td>\n      <td>0.034550</td>\n    </tr>\n    <tr>\n      <th>std</th>\n      <td>1.072721</td>\n      <td>0.975998</td>\n      <td>0.971265</td>\n      <td>1.002871</td>\n    </tr>\n    <tr>\n      <th>min</th>\n      <td>-2.118156</td>\n      <td>-1.864617</td>\n      <td>-2.225588</td>\n      <td>-1.996813</td>\n    </tr>\n    <tr>\n      <th>25%</th>\n      <td>-0.502411</td>\n      <td>-0.694703</td>\n      <td>-0.660898</td>\n      <td>-0.836591</td>\n    </tr>\n    <tr>\n      <th>50%</th>\n      <td>0.142578</td>\n      <td>0.224057</td>\n      <td>-0.179623</td>\n      <td>0.086104</td>\n    </tr>\n    <tr>\n      <th>75%</th>\n      <td>0.607738</td>\n      <td>0.927307</td>\n      <td>0.585106</td>\n      <td>0.929836</td>\n    </tr>\n    <tr>\n      <th>max</th>\n      <td>3.506786</td>\n      <td>1.805574</td>\n      <td>1.903607</td>\n      <td>2.078188</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "trainer.datamodule.scaled_df[trainer.cont_feature_names].describe()"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "**Remark**: All modules are fitted on training and validation sets and transform the testing set."
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "outputs": [
    {
     "data": {
      "text/plain": "             cont_0        cont_2        cont_3        cont_4\ncount  3.800000e+01  3.800000e+01  3.800000e+01  3.800000e+01\nmean   4.674623e-17  5.843279e-18 -4.090295e-17 -2.921640e-17\nstd    1.013423e+00  1.013423e+00  1.013423e+00  1.013423e+00\nmin   -2.118156e+00 -1.864617e+00 -2.225588e+00 -1.996813e+00\n25%   -5.160307e-01 -6.772407e-01 -7.744541e-01 -8.633614e-01\n50%    1.425780e-01  2.926057e-02 -5.491499e-02  6.457453e-02\n75%    6.018802e-01  8.600076e-01  5.851057e-01  9.298360e-01\nmax    1.814086e+00  1.805574e+00  1.903607e+00  2.005587e+00",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>cont_0</th>\n      <th>cont_2</th>\n      <th>cont_3</th>\n      <th>cont_4</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>count</th>\n      <td>3.800000e+01</td>\n      <td>3.800000e+01</td>\n      <td>3.800000e+01</td>\n      <td>3.800000e+01</td>\n    </tr>\n    <tr>\n      <th>mean</th>\n      <td>4.674623e-17</td>\n      <td>5.843279e-18</td>\n      <td>-4.090295e-17</td>\n      <td>-2.921640e-17</td>\n    </tr>\n    <tr>\n      <th>std</th>\n      <td>1.013423e+00</td>\n      <td>1.013423e+00</td>\n      <td>1.013423e+00</td>\n      <td>1.013423e+00</td>\n    </tr>\n    <tr>\n      <th>min</th>\n      <td>-2.118156e+00</td>\n      <td>-1.864617e+00</td>\n      <td>-2.225588e+00</td>\n      <td>-1.996813e+00</td>\n    </tr>\n    <tr>\n      <th>25%</th>\n      <td>-5.160307e-01</td>\n      <td>-6.772407e-01</td>\n      <td>-7.744541e-01</td>\n      <td>-8.633614e-01</td>\n    </tr>\n    <tr>\n      <th>50%</th>\n      <td>1.425780e-01</td>\n      <td>2.926057e-02</td>\n      <td>-5.491499e-02</td>\n      <td>6.457453e-02</td>\n    </tr>\n    <tr>\n      <th>75%</th>\n      <td>6.018802e-01</td>\n      <td>8.600076e-01</td>\n      <td>5.851057e-01</td>\n      <td>9.298360e-01</td>\n    </tr>\n    <tr>\n      <th>max</th>\n      <td>1.814086e+00</td>\n      <td>1.805574e+00</td>\n      <td>1.903607e+00</td>\n      <td>2.005587e+00</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import numpy as np\n",
    "trainer.datamodule.scaled_df.loc[np.append(trainer.train_indices, trainer.val_indices), trainer.cont_feature_names].describe()"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "## Data derivers\n",
    "\n",
    "Existing features in the dataset may not be sufficient to represent the inner relations between features and the target. Extending more features that can be strongly correlated with the target using existing ones can be helpful. Data derivers can be used to extend continuous features (stacked in the tabular dataset, Step 3 above) or **multi-modal** features (unstacked, Step 5 above).\n",
    "\n",
    "Configurations are similar. Necessary and shared arguments are:\n",
    "\n",
    "* `stacked`: Should the derived feature stack in the processed `DataFrame`?\n",
    "* `intermediate`: Is the derived `stacked` feature excluded from continuous features?\n",
    "* `derived_name`: What is the name of the feature?\n",
    "\n",
    "Here we give three examples:\n",
    "\n",
    "* `RelativeDeriver` calculates the result of dividing `absolute_col` by `relative2_col`;\n",
    "* `SampleWeightDeriver` calculates the degree to which a data point is an outlier (it is just an example and there isn't detailed research on it);\n",
    "* `UnscaledDataDeriver` records all continuous features before scaling (standard scaling by default)."
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "outputs": [],
   "source": [
    "deriver_configs = [\n",
    "    (\"RelativeDeriver\", {\n",
    "        \"stacked\": True,\n",
    "        \"absolute_col\": \"cont_0\",\n",
    "        \"relative2_col\": \"cont_1\",\n",
    "        \"intermediate\": False,\n",
    "        \"derived_name\": \"derived_cont\",\n",
    "    }),\n",
    "    (\"SampleWeightDeriver\", {\n",
    "        \"stacked\": True,\n",
    "        \"intermediate\": True,\n",
    "        \"derived_name\": \"sample_weight\",\n",
    "    }),\n",
    "    (\"UnscaledDataDeriver\", {\"derived_name\": \"unscaled\", \"stacked\": False}),\n",
    "]"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "1. Modify the configuration file:\n",
    "\n",
    "```python\n",
    "cfg = {\n",
    "    \"data_derivers\": deriver_configs,\n",
    "    # Some other configurations...\n",
    "}\n",
    "```\n",
    "\n",
    "2. Use the `manual_config` argument of `Trainer.load_config`.\n",
    "\n",
    "```python\n",
    "trainer.load_config(\"sample\", manual_config={\"data_derivers\": deriver_configs})\n",
    "```\n",
    "\n",
    "3. Use `DataModule.set_data_derivers`"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The project will be saved to ../../../../output/sample/2023-09-23-20-36-53-0_sample\n",
      "Dataset size: 153 51 52\n",
      "Data saved to ../../../../output/sample/2023-09-23-20-36-53-0_sample (data.csv and tabular_data.csv).\n"
     ]
    }
   ],
   "source": [
    "trainer.load_config(\"sample\")\n",
    "trainer.datamodule.set_data_derivers(deriver_configs)\n",
    "trainer.load_data()"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "Two `stacked` features can be found in `Trainer.df` or `Trainer.datamodule.df`. `derived_cont` is a continuous feature because `intermediate=False`, but `sample_weight` is not."
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "outputs": [
    {
     "data": {
      "text/plain": "     derived_cont  sample_weight\n0        5.884222       1.045746\n1       -9.058123       1.063506\n2       -3.650394       0.959582\n3        1.678893       0.974096\n4       -4.099185       1.000761\n..            ...            ...\n251     -1.355422       0.958380\n252      1.088160       0.978138\n253      0.374183       0.969419\n254      1.199032       0.967882\n255     -1.521539       0.979158\n\n[256 rows x 2 columns]",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>derived_cont</th>\n      <th>sample_weight</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>5.884222</td>\n      <td>1.045746</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>-9.058123</td>\n      <td>1.063506</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>-3.650394</td>\n      <td>0.959582</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>1.678893</td>\n      <td>0.974096</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>-4.099185</td>\n      <td>1.000761</td>\n    </tr>\n    <tr>\n      <th>...</th>\n      <td>...</td>\n      <td>...</td>\n    </tr>\n    <tr>\n      <th>251</th>\n      <td>-1.355422</td>\n      <td>0.958380</td>\n    </tr>\n    <tr>\n      <th>252</th>\n      <td>1.088160</td>\n      <td>0.978138</td>\n    </tr>\n    <tr>\n      <th>253</th>\n      <td>0.374183</td>\n      <td>0.969419</td>\n    </tr>\n    <tr>\n      <th>254</th>\n      <td>1.199032</td>\n      <td>0.967882</td>\n    </tr>\n    <tr>\n      <th>255</th>\n      <td>-1.521539</td>\n      <td>0.979158</td>\n    </tr>\n  </tbody>\n</table>\n<p>256 rows × 2 columns</p>\n</div>"
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "trainer.df[[\"derived_cont\", \"sample_weight\"]]"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "outputs": [
    {
     "data": {
      "text/plain": "(True, False)"
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "\"derived_cont\" in trainer.cont_feature_names, \"sample_weight\" in trainer.cont_feature_names"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "The unstacked feature `unscaled` can be found in `Trainer.derived_data`"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "outputs": [
    {
     "data": {
      "text/plain": "array([[-1.30652685, -0.22203901, -0.11816405, -0.15957344,  1.65813068,\n         5.88422203],\n       [ 2.01125668, -0.22203901,  0.1950697 ,  0.52700421, -0.04459543,\n        -9.05812296],\n       [-1.21607658,  0.33313567, -0.74367217,  0.73018353,  0.1406719 ,\n        -3.65039438],\n       ...,\n       [-0.0698565 , -0.18669092, -1.02191332, -1.1436414 ,  0.25011391,\n         0.37418262],\n       [-1.03148246, -0.86026245, -0.06163805,  0.3283013 , -1.42999126,\n         1.19903229],\n       [-1.46173269,  0.9606934 ,  0.3675449 ,  1.32906285, -0.68343977,\n        -1.52153922]])"
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "trainer.derived_data[\"unscaled\"]"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "Available derivers can be seen by"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "outputs": [
    {
     "data": {
      "text/plain": "{'AbstractDeriver': tabensemb.data.base.AbstractDeriver,\n 'RelativeDeriver': tabensemb.data.dataderiver.RelativeDeriver,\n 'SampleWeightDeriver': tabensemb.data.dataderiver.SampleWeightDeriver,\n 'UnscaledDataDeriver': tabensemb.data.dataderiver.UnscaledDataDeriver}"
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from tabensemb.data.dataderiver import deriver_mapping, get_data_deriver\n",
    "deriver_mapping"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "Arguments can be found in API docs or in the docstring."
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "    Dividing a feature by another to derive a new feature. Required arguments are:\n",
      "\n",
      "    absolute_col: str\n",
      "        The feature that needs to be divided.\n",
      "    relative2_col: str\n",
      "        The feature that acts as the denominator.\n",
      "    \n"
     ]
    }
   ],
   "source": [
    "print(get_data_deriver(\"RelativeDeriver\").__doc__)"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "## Access the processed dataset"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "All these data can be found in the `DataModule` instance in the trainer, along with many modified data structures for further usage:\n",
    "\n",
    "* Continuous features\n",
    "    * `DataModule.feature_data`: scaled\n",
    "    * `DataModule.unscaled_feature_data`: not scaled\n",
    "    * `DataModule.X_train/X_val/X_test[trainer.cont_feature_names]`: scaled and divided into three partitions\n",
    "    * `DataModule.tensors[0]`: scaled and transformed into torch.Tensor.\n",
    "* Categorical features\n",
    "    * `DataModule.categorical_data`: ordinal-encoded\n",
    "    * `DataModule.X_train/X_val/X_test[trainer.cat_feature_names]`: ordinal-encoded and divided into three partitions\n",
    "    * `DataModule.derived_data[\"categorical\"]`: ordinal-encoded\n",
    "    * `trainer.datamodule.tensors[list(trainer.datamodule.derived_data.keys()).index(\"categorical\")+1]`: ordinal-encoded and transformed into torch.Tensor.\n",
    "* Derived unstacked features\n",
    "    * `DataModule.derived_data`: include unstacked features, categorical features, and the signal for each data point representing whether it is an augmented one.\n",
    "    * `DataModule.tensors[1:-1]`: same as `DataModule.derived_data`, but are `torch.Tensor`s.\n",
    "\n",
    "**Remark**: Currently, derived unstacked features are not used in the supported external model bases. But it can be easily accessed using the above approaches, even easier for a customized `PyTorch`-based model base class `TorchModel`, which will be introduced in the \"Advanced Usage\" sections.\n",
    "\n",
    "**Remark**: Stacked (continuous) derived features are derived after imputation but before data processing. These features will also be imputed. Unstacked derived features are derived after all other steps are finished.\n"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}