{
"cells": [
{
"cell_type": "markdown",
"source": [
"# Using data functionalities\n",
"\n",
"Running `Trainer.load_data` or `DataModule.load_data` will process the dataset in the following order:\n",
"\n",
"1. Data splitting (training/validation/testing sets): See \"Data splitters\"\n",
"2. Data imputation: See \"Data imputers\"\n",
"3. Data augmentation (for features): See \"Data derivers\"\n",
"4. Data processing **(orderless except for data scaling)**: See \"Data processors\"\n",
" * Data augmentation (for data points)\n",
" * Data filtering\n",
" * Feature selection\n",
" * Categorical encoding\n",
" * Data scaling\n",
" * etc.\n",
"5. Data augmentation (for features, especially multi-modal features and physics-informed features): See \"Data derivers\".\n",
"\n",
"In this part, we will introduce the usage of \"data splitters\", \"data imputers\", \"data processors\", and \"data derivers\". Implementing new functionalities is left as a section in \"Advanced Usage\"."
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": 1,
"outputs": [],
"source": [
"import torch\n",
"from tabensemb.trainer import Trainer\n",
"from tabensemb.model import *\n",
"import tabensemb\n",
"import os\n",
"\n",
"prefix = \"../../../../\"\n",
"tabensemb.setting[\"default_output_path\"] = prefix + \"output\"\n",
"tabensemb.setting[\"default_config_path\"] = prefix + \"configs\"\n",
"tabensemb.setting[\"default_data_path\"] = prefix + \"data\"\n",
"\n",
"trainer = Trainer(device=\"cpu\")"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"## Data splitters\n",
"\n",
"Data splitters are used to split the whole dataset into training, validation, and testing sets. They inherit `tabensemb.data.AbstractSplitter`, and implement `_split` (the main method) and `_next_cv` (to generate the next fold for a k-fold CV process).\n",
"\n",
"**Remark**: If `AbstractSplitter.support_cv=False`, the data splitter does not support k-fold CV.\n",
"\n",
"There are several ways to specify the used data splitter. These ways can also be used to specify other configurations.\n",
"\n",
"\n",
"1. Modify the configuration file, `configs/sample.py` for example:\n",
"\n",
"```python\n",
"cfg = {\n",
" \"data_splitter\": \"RandomSplitter\",\n",
" # Some other configurations...\n",
"}\n",
"```\n",
"\n",
"2. Use the `manual_config` argument of `Trainer.load_config`.\n",
"\n",
"```python\n",
"trainer.load_config(\"sample\", manual_config={\"data_splitter\": \"RandomSplitter\"})\n",
"```\n",
"\n",
"3. After `Trainer.load_config` is called and if one does not want to call it again, use `DataModule.set_data_splitter`."
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": 2,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The project will be saved to ../../../../output/sample/2023-09-23-20-36-52-0_sample\n",
"Dataset size: 178 39 39\n",
"Data saved to ../../../../output/sample/2023-09-23-20-36-52-0_sample (data.csv and tabular_data.csv).\n"
]
}
],
"source": [
"trainer.load_config(\"sample\")\n",
"trainer.datamodule.set_data_splitter(\"RandomSplitter\", ratio=[7, 1.5, 1.5])\n",
"trainer.load_data()"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"The `ratio` argument can also be given in the configuration file, `manual_config`, or `set_data_splitter` as:\n",
"\n",
"```python\n",
"cfg = {\n",
" # This will overwrite the `split_ratio` configuration.\n",
" \"data_splitter\": [\"RandomSplitter\", {\"train_val_test\": [7, 1.5, 1.5]}],\n",
" # Some other configurations...\n",
"}\n",
"```\n",
"\n",
"```python\n",
"trainer.load_config(\"sample\", manual_config={\"data_splitter\": [\"RandomSplitter\", {\"train_val_test\": [7, 1.5, 1.5]}]})\n",
"```\n",
"\n",
"```python\n",
"trainer.datamodule.set_data_splitter([\"RandomSplitter\", {\"train_val_test\": [7, 1.5, 1.5]}])\n",
"```\n",
"\n",
"Available data splitters can be seen using:"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": 3,
"outputs": [
{
"data": {
"text/plain": "{'AbstractSplitter': tabensemb.data.base.AbstractSplitter,\n 'RandomSplitter': tabensemb.data.datasplitter.RandomSplitter}"
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from tabensemb.data.datasplitter import splitter_mapping\n",
"splitter_mapping"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"## Data imputers\n",
"\n",
"Imputation is necessary when NaNs exist in the dataset. `tabensemb` provides several methods incorporating other packages like `miceforest` and `scikit-learn`. The configuration for an imputer contains two parts: the name of the imputer and its arguments. Data imputers can be set similarly to data splitters in the following ways:\n",
"\n",
"1. Modify the configuration file, `configs/sample.py` for example:\n",
"\n",
"```python\n",
"cfg = {\n",
" \"data_imputer\": [\"MiceImputer\", {\"max_iter\": 10}],\n",
" # \"data_imputer\": \"MiceImputer\", (If no kwargs is given)\n",
" # Some other configurations...\n",
"}\n",
"```\n",
"\n",
"2. Use the `manual_config` argument of `Trainer.load_config`.\n",
"\n",
"```python\n",
"trainer.load_config(\"sample\", manual_config={\"data_imputer\": [\"MiceImputer\", {\"max_iter\": 10}]})\n",
"trainer.load_config(\"sample\", manual_config={\"data_imputer\": \"MiceImputer\"})\n",
"```\n",
"\n",
"3. Use `DataModule.set_data_imputer`"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": 4,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The project will be saved to ../../../../output/sample/2023-09-23-20-36-52-0_sample-I1\n",
"Dataset size: 153 51 52\n",
"Data saved to ../../../../output/sample/2023-09-23-20-36-52-0_sample-I1 (data.csv and tabular_data.csv).\n"
]
}
],
"source": [
"trainer.load_config(\"sample\")\n",
"trainer.datamodule.set_data_imputer([\"MiceImputer\", {\"max_iter\": 10}])\n",
"trainer.load_data()"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "code",
"execution_count": 5,
"outputs": [
{
"data": {
"text/plain": "cont_0 False\ncont_1 False\ncont_2 False\ncont_3 False\ncont_4 False\ncont_5 False\ncont_6 False\ncont_7 False\ncont_8 False\ncont_9 False\ncat_0 False\ncat_1 False\ncat_2 False\ncat_3 False\ncat_4 False\ncat_5 False\ncat_6 False\ncat_7 False\ncat_8 False\ncat_9 False\ntarget False\ntarget_binary False\ntarget_multi_class False\ndtype: bool"
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"trainer.df.isna().any()"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"Available data splitters can be seen using:"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": 6,
"outputs": [
{
"data": {
"text/plain": "{'AbstractImputer': tabensemb.data.base.AbstractImputer,\n 'AbstractSklearnImputer': tabensemb.data.base.AbstractSklearnImputer,\n 'GainImputer': tabensemb.data.dataimputer.GainImputer,\n 'MeanImputer': tabensemb.data.dataimputer.MeanImputer,\n 'MedianImputer': tabensemb.data.dataimputer.MedianImputer,\n 'MiceImputer': tabensemb.data.dataimputer.MiceImputer,\n 'MiceLightgbmImputer': tabensemb.data.dataimputer.MiceLightgbmImputer,\n 'MissForestImputer': tabensemb.data.dataimputer.MissForestImputer,\n 'ModeImputer': tabensemb.data.dataimputer.ModeImputer}"
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from tabensemb.data.dataimputer import imputer_mapping, get_data_imputer\n",
"imputer_mapping"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"Arguments can be seen in API docs or in docstrings"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": 7,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
" Imputation with average values implemented using sklearn's SimpleImputer.\n",
"\n",
" Parameters\n",
" ----------\n",
" **kwargs\n",
" Arguments for ``sklearn.impute.SimpleImputer`` (except for ``strategy``)\n",
" \n"
]
}
],
"source": [
"print(get_data_imputer(\"MeanImputer\").__doc__)"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"## Data processors\n",
"\n",
"As listed in Step 4 above, data processing includes filtering, augmentation, feature selection, and much more. `tabensemb` provides a unified framework for implementing various data processing steps. The data imputation and processing procedure is quite similar to the `Pipeline` structure in `sklearn`, but is fully compatible with the other two modules introduced in this part and all four modules automatically do all preparations before training for the user.\n",
"\n",
"Configuration for a processor also contains two parts: the name of the processor and arguments. Here we provide several examples:\n",
"\n",
"* `CategoricalOrdinalEncoder`: similar to the `OrdinalEncoder` in `sklearn`\n",
"* `NaNFeatureRemover`: remove features that are all NaNs\n",
"* `VarianceFeatureSelector`: same as the `VarianceThreshold` from `sklearn`\n",
"* `FeatureValueSelector`: select data points that have a certain value of a feature\n",
"* `CorrFeatureSelector`: remove highly correlated features\n",
"* `IQRRemover`: remove outliers found by the 1.5*IQR criteria\n",
"* `StdRemover`: remove outliers found by 3*std criteria\n",
"* `SampleDataAugmentor`: just an example to show the data augmentation capability (it copies the last two data points in the validation set)\n",
"* `StandardScaler`: same as the `StandardScaler` from `sklearn`\n",
"\n",
"**Remark**: Data scalers like a `StandardScaler` must be the last data processor.\n",
"\n",
"**Remark**: Data points derived from data augmenters (inheriting `AbstractAugmenter`) will be added to the training set."
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": 8,
"outputs": [],
"source": [
"processor_configs = [\n",
" [\"CategoricalOrdinalEncoder\", {}],\n",
" [\"NaNFeatureRemover\", {}],\n",
" [\"VarianceFeatureSelector\", {\"thres\": 0.1}],\n",
" [\"FeatureValueSelector\", {\"feature\": \"cat_1\", \"value\": 0}],\n",
" [\"CorrFeatureSelector\", {\"thres\": 0.1}],\n",
" [\"IQRRemover\", {}],\n",
" [\"StdRemover\", {}],\n",
" [\"SampleDataAugmenter\", {}],\n",
" [\"StandardScaler\", {}],\n",
"]"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"\n",
"1. Modify the configuration file:\n",
"\n",
"```python\n",
"cfg = {\n",
" \"data_processors\": processor_configs,\n",
" # Some other configurations...\n",
"}\n",
"```\n",
"\n",
"2. Use the `manual_config` argument of `Trainer.load_config`.\n",
"\n",
"```python\n",
"trainer.load_config(\"sample\", manual_config={\"data_processors\": processor_configs})\n",
"```\n",
"\n",
"3. Use `DataModule.set_data_processors`\n"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": 9,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The project will be saved to ../../../../output/sample/2023-09-23-20-36-52-0_sample-I2\n",
"Correlated features (Ranked by SHAP):\n",
"{\n",
"\t'cont_2': 13.650428051938668,\n",
"\t'cont_1': 8.98106859262871\n",
"}\n",
"1 features removed: ['cont_1']. 7 features retained: ['cont_0', 'cont_3', 'cont_4', 'cont_2', 'cat_0', 'cat_1', 'cat_2'].\n",
"Removing outliers by IQR. Original size: 36, Final size: 36.\n",
"Removing outliers by std. Original size: 36, Final size: 36.\n",
"Dataset size: 27 11 12\n",
"Data saved to ../../../../output/sample/2023-09-23-20-36-52-0_sample-I2 (data.csv and tabular_data.csv).\n"
]
}
],
"source": [
"import warnings\n",
"import numba\n",
"trainer.load_config(\"sample\")\n",
"trainer.datamodule.set_data_processors(processor_configs)\n",
"with warnings.catch_warnings():\n",
" warnings.filterwarnings(\"ignore\", category=numba.NumbaDeprecationWarning)\n",
" trainer.load_data()"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"Let's check the effectiveness of these processors. Categorical features are encoded by `CategoricalOrdinalEncoder`:"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": 10,
"outputs": [
{
"data": {
"text/plain": " cat_0 cat_1 cat_2\n0 3 0 2\n1 3 0 1\n2 3 0 4\n3 0 0 0\n4 4 0 2",
"text/html": "
\n\n
\n \n \n | \n cat_0 | \n cat_1 | \n cat_2 | \n
\n \n \n \n | 0 | \n 3 | \n 0 | \n 2 | \n
\n \n | 1 | \n 3 | \n 0 | \n 1 | \n
\n \n | 2 | \n 3 | \n 0 | \n 4 | \n
\n \n | 3 | \n 0 | \n 0 | \n 0 | \n
\n \n | 4 | \n 4 | \n 0 | \n 2 | \n
\n \n
\n
"
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"trainer.datamodule.categorical_data.head()"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"The original categorical features can be accessed using"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": 11,
"outputs": [
{
"data": {
"text/plain": " cat_0 cat_1 cat_2\n0 category_3 0 2\n1 category_3 0 1\n2 category_3 0 4\n3 category_0 0 0\n4 category_4 0 2",
"text/html": "\n\n
\n \n \n | \n cat_0 | \n cat_1 | \n cat_2 | \n
\n \n \n \n | 0 | \n category_3 | \n 0 | \n 2 | \n
\n \n | 1 | \n category_3 | \n 0 | \n 1 | \n
\n \n | 2 | \n category_3 | \n 0 | \n 4 | \n
\n \n | 3 | \n category_0 | \n 0 | \n 0 | \n
\n \n | 4 | \n category_4 | \n 0 | \n 2 | \n
\n \n
\n
"
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"trainer.datamodule.categories_inverse_transform(trainer.datamodule.categorical_data).head()"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"One feature is removed by `CorrFeatureSelector`. It removes the feature with the lowest feature importance (ranked using `shap` in the example) in the correlation chain."
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": 12,
"outputs": [
{
"data": {
"text/plain": "['cont_0', 'cont_2', 'cont_3', 'cont_4']"
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"trainer.cont_feature_names"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"The specific `cat_1` feature value is selected by the `FeatureValueSelector`. Some outliers are removed by the `IQRRemover`. Original indices of the removed data points can be seen using"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": 13,
"outputs": [
{
"data": {
"text/plain": "array([ 0, 1, 2, 3, 4, 5, 8, 9, 10, 11, 12, 13, 14,\n 15, 18, 19, 20, 22, 24, 25, 26, 27, 28, 29, 30, 31,\n 32, 35, 37, 38, 39, 40, 41, 42, 43, 45, 46, 48, 49,\n 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62,\n 63, 64, 65, 66, 67, 68, 69, 70, 72, 74, 75, 76, 77,\n 80, 81, 82, 83, 85, 86, 89, 90, 93, 94, 96, 97, 98,\n 99, 100, 102, 103, 104, 105, 106, 107, 109, 110, 111, 112, 113,\n 114, 115, 116, 118, 119, 120, 121, 123, 124, 125, 127, 130, 131,\n 132, 133, 135, 136, 137, 139, 140, 141, 142, 143, 144, 145, 146,\n 147, 148, 150, 151, 152, 153, 157, 158, 159, 161, 163, 165, 166,\n 167, 168, 170, 172, 173, 175, 176, 177, 178, 179, 180, 182, 183,\n 184, 185, 186, 187, 188, 189, 190, 192, 193, 194, 195, 196, 197,\n 199, 200, 201, 202, 203, 207, 208, 209, 210, 211, 212, 213, 214,\n 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229,\n 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242,\n 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255])"
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"trainer.datamodule.dropped_indices"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"The `SampleDataAugmentor` copies the last two data points in the validation set as a showcase. `DataModule.augmented_indices` represents the indices of these data points before dropping `DataModule.dropped_indices`. We can see augmented data points using"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": 14,
"outputs": [
{
"data": {
"text/plain": " cont_0 cont_1 cont_2 cont_3 cont_4 cont_5 cont_6 \\\n48 -0.505358 -0.104343 -0.507518 -0.988002 -0.815792 -1.284552 -1.051880 \n49 -2.115056 0.138315 1.618054 0.541008 1.405365 -1.449118 -0.824409 \n\n cont_7 cont_8 cont_9 ... cat_3 cat_4 cat_5 cat_6 cat_7 \\\n48 0.564009 2.49720 -2.245322 ... 4 4 category_4 3 1 \n49 -0.813794 0.42258 0.547481 ... 0 0 category_2 4 3 \n\n cat_8 cat_9 target target_binary target_multi_class \n48 1 2 -246.101543 1 3 \n49 3 1 -156.813059 0 3 \n\n[2 rows x 23 columns]",
"text/html": "\n\n
\n \n \n | \n cont_0 | \n cont_1 | \n cont_2 | \n cont_3 | \n cont_4 | \n cont_5 | \n cont_6 | \n cont_7 | \n cont_8 | \n cont_9 | \n ... | \n cat_3 | \n cat_4 | \n cat_5 | \n cat_6 | \n cat_7 | \n cat_8 | \n cat_9 | \n target | \n target_binary | \n target_multi_class | \n
\n \n \n \n | 48 | \n -0.505358 | \n -0.104343 | \n -0.507518 | \n -0.988002 | \n -0.815792 | \n -1.284552 | \n -1.051880 | \n 0.564009 | \n 2.49720 | \n -2.245322 | \n ... | \n 4 | \n 4 | \n category_4 | \n 3 | \n 1 | \n 1 | \n 2 | \n -246.101543 | \n 1 | \n 3 | \n
\n \n | 49 | \n -2.115056 | \n 0.138315 | \n 1.618054 | \n 0.541008 | \n 1.405365 | \n -1.449118 | \n -0.824409 | \n -0.813794 | \n 0.42258 | \n 0.547481 | \n ... | \n 0 | \n 0 | \n category_2 | \n 4 | \n 3 | \n 3 | \n 1 | \n -156.813059 | \n 0 | \n 3 | \n
\n \n
\n
2 rows × 23 columns
\n
"
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"trainer.df.loc[trainer.datamodule.augmented_indices-len(trainer.datamodule.dropped_indices), :]"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "code",
"execution_count": 15,
"outputs": [
{
"data": {
"text/plain": " cont_0 cont_1 cont_2 cont_3 cont_4 cont_5 cont_6 \\\n16 -0.505358 -0.104343 -0.507518 -0.988002 -0.815792 -1.284552 -1.051880 \n0 -2.115056 0.138315 1.618054 0.541008 1.405365 -1.449118 -0.824409 \n\n cont_7 cont_8 cont_9 ... cat_3 cat_4 cat_5 cat_6 cat_7 \\\n16 0.564009 2.49720 -2.245322 ... 4 4 category_4 3 1 \n0 -0.813794 0.42258 0.547481 ... 0 0 category_2 4 3 \n\n cat_8 cat_9 target target_binary target_multi_class \n16 1 2 -246.101543 1 3 \n0 3 1 -156.813059 0 3 \n\n[2 rows x 23 columns]",
"text/html": "\n\n
\n \n \n | \n cont_0 | \n cont_1 | \n cont_2 | \n cont_3 | \n cont_4 | \n cont_5 | \n cont_6 | \n cont_7 | \n cont_8 | \n cont_9 | \n ... | \n cat_3 | \n cat_4 | \n cat_5 | \n cat_6 | \n cat_7 | \n cat_8 | \n cat_9 | \n target | \n target_binary | \n target_multi_class | \n
\n \n \n \n | 16 | \n -0.505358 | \n -0.104343 | \n -0.507518 | \n -0.988002 | \n -0.815792 | \n -1.284552 | \n -1.051880 | \n 0.564009 | \n 2.49720 | \n -2.245322 | \n ... | \n 4 | \n 4 | \n category_4 | \n 3 | \n 1 | \n 1 | \n 2 | \n -246.101543 | \n 1 | \n 3 | \n
\n \n | 0 | \n -2.115056 | \n 0.138315 | \n 1.618054 | \n 0.541008 | \n 1.405365 | \n -1.449118 | \n -0.824409 | \n -0.813794 | \n 0.42258 | \n 0.547481 | \n ... | \n 0 | \n 0 | \n category_2 | \n 4 | \n 3 | \n 3 | \n 1 | \n -156.813059 | \n 0 | \n 3 | \n
\n \n
\n
2 rows × 23 columns
\n
"
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"trainer.df.loc[trainer.datamodule.val_indices[-2:], :]"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"Finally, `StandardScaler` scales the dataset. `DataModule.df` is the unscaled data frame, and `scaled_df` is the scaled one."
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": 16,
"outputs": [
{
"data": {
"text/plain": " cont_0 cont_2 cont_3 cont_4\ncount 50.000000 50.000000 50.000000 50.000000\nmean -0.153242 -0.086822 0.083808 0.194227\nstd 0.961960 0.947694 0.762640 1.026297\nmin -2.115056 -1.945703 -1.665956 -1.884586\n25% -0.666141 -0.809715 -0.437357 -0.697263\n50% -0.087749 0.082401 -0.059459 0.246986\n75% 0.329382 0.765257 0.541008 1.110425\nmax 2.929096 1.618054 1.576299 2.285601",
"text/html": "\n\n
\n \n \n | \n cont_0 | \n cont_2 | \n cont_3 | \n cont_4 | \n
\n \n \n \n | count | \n 50.000000 | \n 50.000000 | \n 50.000000 | \n 50.000000 | \n
\n \n | mean | \n -0.153242 | \n -0.086822 | \n 0.083808 | \n 0.194227 | \n
\n \n | std | \n 0.961960 | \n 0.947694 | \n 0.762640 | \n 1.026297 | \n
\n \n | min | \n -2.115056 | \n -1.945703 | \n -1.665956 | \n -1.884586 | \n
\n \n | 25% | \n -0.666141 | \n -0.809715 | \n -0.437357 | \n -0.697263 | \n
\n \n | 50% | \n -0.087749 | \n 0.082401 | \n -0.059459 | \n 0.246986 | \n
\n \n | 75% | \n 0.329382 | \n 0.765257 | \n 0.541008 | \n 1.110425 | \n
\n \n | max | \n 2.929096 | \n 1.618054 | \n 1.576299 | \n 2.285601 | \n
\n \n
\n
"
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"trainer.datamodule.df[trainer.cont_feature_names].describe()"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "code",
"execution_count": 17,
"outputs": [
{
"data": {
"text/plain": " cont_0 cont_2 cont_3 cont_4\ncount 50.000000 50.000000 50.000000 50.000000\nmean 0.069544 0.049781 0.002835 0.034550\nstd 1.072721 0.975998 0.971265 1.002871\nmin -2.118156 -1.864617 -2.225588 -1.996813\n25% -0.502411 -0.694703 -0.660898 -0.836591\n50% 0.142578 0.224057 -0.179623 0.086104\n75% 0.607738 0.927307 0.585106 0.929836\nmax 3.506786 1.805574 1.903607 2.078188",
"text/html": "\n\n
\n \n \n | \n cont_0 | \n cont_2 | \n cont_3 | \n cont_4 | \n
\n \n \n \n | count | \n 50.000000 | \n 50.000000 | \n 50.000000 | \n 50.000000 | \n
\n \n | mean | \n 0.069544 | \n 0.049781 | \n 0.002835 | \n 0.034550 | \n
\n \n | std | \n 1.072721 | \n 0.975998 | \n 0.971265 | \n 1.002871 | \n
\n \n | min | \n -2.118156 | \n -1.864617 | \n -2.225588 | \n -1.996813 | \n
\n \n | 25% | \n -0.502411 | \n -0.694703 | \n -0.660898 | \n -0.836591 | \n
\n \n | 50% | \n 0.142578 | \n 0.224057 | \n -0.179623 | \n 0.086104 | \n
\n \n | 75% | \n 0.607738 | \n 0.927307 | \n 0.585106 | \n 0.929836 | \n
\n \n | max | \n 3.506786 | \n 1.805574 | \n 1.903607 | \n 2.078188 | \n
\n \n
\n
"
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"trainer.datamodule.scaled_df[trainer.cont_feature_names].describe()"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"**Remark**: All modules are fitted on training and validation sets and transform the testing set."
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": 18,
"outputs": [
{
"data": {
"text/plain": " cont_0 cont_2 cont_3 cont_4\ncount 3.800000e+01 3.800000e+01 3.800000e+01 3.800000e+01\nmean 4.674623e-17 5.843279e-18 -4.090295e-17 -2.921640e-17\nstd 1.013423e+00 1.013423e+00 1.013423e+00 1.013423e+00\nmin -2.118156e+00 -1.864617e+00 -2.225588e+00 -1.996813e+00\n25% -5.160307e-01 -6.772407e-01 -7.744541e-01 -8.633614e-01\n50% 1.425780e-01 2.926057e-02 -5.491499e-02 6.457453e-02\n75% 6.018802e-01 8.600076e-01 5.851057e-01 9.298360e-01\nmax 1.814086e+00 1.805574e+00 1.903607e+00 2.005587e+00",
"text/html": "\n\n
\n \n \n | \n cont_0 | \n cont_2 | \n cont_3 | \n cont_4 | \n
\n \n \n \n | count | \n 3.800000e+01 | \n 3.800000e+01 | \n 3.800000e+01 | \n 3.800000e+01 | \n
\n \n | mean | \n 4.674623e-17 | \n 5.843279e-18 | \n -4.090295e-17 | \n -2.921640e-17 | \n
\n \n | std | \n 1.013423e+00 | \n 1.013423e+00 | \n 1.013423e+00 | \n 1.013423e+00 | \n
\n \n | min | \n -2.118156e+00 | \n -1.864617e+00 | \n -2.225588e+00 | \n -1.996813e+00 | \n
\n \n | 25% | \n -5.160307e-01 | \n -6.772407e-01 | \n -7.744541e-01 | \n -8.633614e-01 | \n
\n \n | 50% | \n 1.425780e-01 | \n 2.926057e-02 | \n -5.491499e-02 | \n 6.457453e-02 | \n
\n \n | 75% | \n 6.018802e-01 | \n 8.600076e-01 | \n 5.851057e-01 | \n 9.298360e-01 | \n
\n \n | max | \n 1.814086e+00 | \n 1.805574e+00 | \n 1.903607e+00 | \n 2.005587e+00 | \n
\n \n
\n
"
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import numpy as np\n",
"trainer.datamodule.scaled_df.loc[np.append(trainer.train_indices, trainer.val_indices), trainer.cont_feature_names].describe()"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"## Data derivers\n",
"\n",
"Existing features in the dataset may not be sufficient to represent the inner relations between features and the target. Extending more features that can be strongly correlated with the target using existing ones can be helpful. Data derivers can be used to extend continuous features (stacked in the tabular dataset, Step 3 above) or **multi-modal** features (unstacked, Step 5 above).\n",
"\n",
"Configurations are similar. Necessary and shared arguments are:\n",
"\n",
"* `stacked`: Should the derived feature stack in the processed `DataFrame`?\n",
"* `intermediate`: Is the derived `stacked` feature excluded from continuous features?\n",
"* `derived_name`: What is the name of the feature?\n",
"\n",
"Here we give three examples:\n",
"\n",
"* `RelativeDeriver` calculates the result of dividing `absolute_col` by `relative2_col`;\n",
"* `SampleWeightDeriver` calculates the degree to which a data point is an outlier (it is just an example and there isn't detailed research on it);\n",
"* `UnscaledDataDeriver` records all continuous features before scaling (standard scaling by default)."
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": 19,
"outputs": [],
"source": [
"deriver_configs = [\n",
" (\"RelativeDeriver\", {\n",
" \"stacked\": True,\n",
" \"absolute_col\": \"cont_0\",\n",
" \"relative2_col\": \"cont_1\",\n",
" \"intermediate\": False,\n",
" \"derived_name\": \"derived_cont\",\n",
" }),\n",
" (\"SampleWeightDeriver\", {\n",
" \"stacked\": True,\n",
" \"intermediate\": True,\n",
" \"derived_name\": \"sample_weight\",\n",
" }),\n",
" (\"UnscaledDataDeriver\", {\"derived_name\": \"unscaled\", \"stacked\": False}),\n",
"]"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"1. Modify the configuration file:\n",
"\n",
"```python\n",
"cfg = {\n",
" \"data_derivers\": deriver_configs,\n",
" # Some other configurations...\n",
"}\n",
"```\n",
"\n",
"2. Use the `manual_config` argument of `Trainer.load_config`.\n",
"\n",
"```python\n",
"trainer.load_config(\"sample\", manual_config={\"data_derivers\": deriver_configs})\n",
"```\n",
"\n",
"3. Use `DataModule.set_data_derivers`"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": 20,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The project will be saved to ../../../../output/sample/2023-09-23-20-36-53-0_sample\n",
"Dataset size: 153 51 52\n",
"Data saved to ../../../../output/sample/2023-09-23-20-36-53-0_sample (data.csv and tabular_data.csv).\n"
]
}
],
"source": [
"trainer.load_config(\"sample\")\n",
"trainer.datamodule.set_data_derivers(deriver_configs)\n",
"trainer.load_data()"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"Two `stacked` features can be found in `Trainer.df` or `Trainer.datamodule.df`. `derived_cont` is a continuous feature because `intermediate=False`, but `sample_weight` is not."
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": 21,
"outputs": [
{
"data": {
"text/plain": " derived_cont sample_weight\n0 5.884222 1.045746\n1 -9.058123 1.063506\n2 -3.650394 0.959582\n3 1.678893 0.974096\n4 -4.099185 1.000761\n.. ... ...\n251 -1.355422 0.958380\n252 1.088160 0.978138\n253 0.374183 0.969419\n254 1.199032 0.967882\n255 -1.521539 0.979158\n\n[256 rows x 2 columns]",
"text/html": "\n\n
\n \n \n | \n derived_cont | \n sample_weight | \n
\n \n \n \n | 0 | \n 5.884222 | \n 1.045746 | \n
\n \n | 1 | \n -9.058123 | \n 1.063506 | \n
\n \n | 2 | \n -3.650394 | \n 0.959582 | \n
\n \n | 3 | \n 1.678893 | \n 0.974096 | \n
\n \n | 4 | \n -4.099185 | \n 1.000761 | \n
\n \n | ... | \n ... | \n ... | \n
\n \n | 251 | \n -1.355422 | \n 0.958380 | \n
\n \n | 252 | \n 1.088160 | \n 0.978138 | \n
\n \n | 253 | \n 0.374183 | \n 0.969419 | \n
\n \n | 254 | \n 1.199032 | \n 0.967882 | \n
\n \n | 255 | \n -1.521539 | \n 0.979158 | \n
\n \n
\n
256 rows × 2 columns
\n
"
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"trainer.df[[\"derived_cont\", \"sample_weight\"]]"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "code",
"execution_count": 22,
"outputs": [
{
"data": {
"text/plain": "(True, False)"
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"\"derived_cont\" in trainer.cont_feature_names, \"sample_weight\" in trainer.cont_feature_names"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"The unstacked feature `unscaled` can be found in `Trainer.derived_data`"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": 23,
"outputs": [
{
"data": {
"text/plain": "array([[-1.30652685, -0.22203901, -0.11816405, -0.15957344, 1.65813068,\n 5.88422203],\n [ 2.01125668, -0.22203901, 0.1950697 , 0.52700421, -0.04459543,\n -9.05812296],\n [-1.21607658, 0.33313567, -0.74367217, 0.73018353, 0.1406719 ,\n -3.65039438],\n ...,\n [-0.0698565 , -0.18669092, -1.02191332, -1.1436414 , 0.25011391,\n 0.37418262],\n [-1.03148246, -0.86026245, -0.06163805, 0.3283013 , -1.42999126,\n 1.19903229],\n [-1.46173269, 0.9606934 , 0.3675449 , 1.32906285, -0.68343977,\n -1.52153922]])"
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"trainer.derived_data[\"unscaled\"]"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"Available derivers can be seen by"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": 24,
"outputs": [
{
"data": {
"text/plain": "{'AbstractDeriver': tabensemb.data.base.AbstractDeriver,\n 'RelativeDeriver': tabensemb.data.dataderiver.RelativeDeriver,\n 'SampleWeightDeriver': tabensemb.data.dataderiver.SampleWeightDeriver,\n 'UnscaledDataDeriver': tabensemb.data.dataderiver.UnscaledDataDeriver}"
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from tabensemb.data.dataderiver import deriver_mapping, get_data_deriver\n",
"deriver_mapping"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"Arguments can be found in API docs or in the docstring."
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": 25,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
" Dividing a feature by another to derive a new feature. Required arguments are:\n",
"\n",
" absolute_col: str\n",
" The feature that needs to be divided.\n",
" relative2_col: str\n",
" The feature that acts as the denominator.\n",
" \n"
]
}
],
"source": [
"print(get_data_deriver(\"RelativeDeriver\").__doc__)"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"## Access the processed dataset"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"All these data can be found in the `DataModule` instance in the trainer, along with many modified data structures for further usage:\n",
"\n",
"* Continuous features\n",
" * `DataModule.feature_data`: scaled\n",
" * `DataModule.unscaled_feature_data`: not scaled\n",
" * `DataModule.X_train/X_val/X_test[trainer.cont_feature_names]`: scaled and divided into three partitions\n",
" * `DataModule.tensors[0]`: scaled and transformed into torch.Tensor.\n",
"* Categorical features\n",
" * `DataModule.categorical_data`: ordinal-encoded\n",
" * `DataModule.X_train/X_val/X_test[trainer.cat_feature_names]`: ordinal-encoded and divided into three partitions\n",
" * `DataModule.derived_data[\"categorical\"]`: ordinal-encoded\n",
" * `trainer.datamodule.tensors[list(trainer.datamodule.derived_data.keys()).index(\"categorical\")+1]`: ordinal-encoded and transformed into torch.Tensor.\n",
"* Derived unstacked features\n",
" * `DataModule.derived_data`: include unstacked features, categorical features, and the signal for each data point representing whether it is an augmented one.\n",
" * `DataModule.tensors[1:-1]`: same as `DataModule.derived_data`, but are `torch.Tensor`s.\n",
"\n",
"**Remark**: Currently, derived unstacked features are not used in the supported external model bases. But it can be easily accessed using the above approaches, even easier for a customized `PyTorch`-based model base class `TorchModel`, which will be introduced in the \"Advanced Usage\" sections.\n",
"\n",
"**Remark**: Stacked (continuous) derived features are derived after imputation but before data processing. These features will also be imputed. Unstacked derived features are derived after all other steps are finished.\n"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.6"
}
},
"nbformat": 4,
"nbformat_minor": 0
}