{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "# Basics of running benchmarks\n",
    "\n",
    "Tabular Ensemble (`tabensemb`) is a benchmark platform for tabular prediction tasks. We support three well-established model bases as baselines:\n",
    "\n",
    "* `autogluon`: [Link](https://github.com/autogluon/autogluon)\n",
    "\n",
    "* `pytorch_widedeep`: [Link](https://github.com/jrzaurin/pytorch-widedeep)\n",
    "\n",
    "* `pytorch_tabular`: [Link](https://github.com/manujosephv/pytorch_tabular)\n",
    "\n",
    "Users can run benchmarks on customized datasets using customized preprocessing steps and implement customized models in the framework to run and compare their performance with baselines within a consistent procedure.\n",
    "\n",
    "In this part, minimum examples of regression, binary classification, and multiclass classification are performed to show the basic functionality of the package."
   ]
  },
  {
   "cell_type": "markdown",
   "source": [
    "## Regression"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "### Loading packages\n",
    "\n",
    "First, import the necessary modules. Then check the validity of `CUDA` and determine the training device."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Using cuda device\n"
     ]
    }
   ],
   "source": [
    "import torch\n",
    "from tabensemb.trainer import Trainer\n",
    "from tabensemb.model import *\n",
    "from tabensemb.config import UserConfig\n",
    "import tabensemb\n",
    "import os\n",
    "\n",
    "device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n",
    "print(\"Using {} device\".format(device))"
   ]
  },
  {
   "cell_type": "markdown",
   "source": [
    "`tabensemb` uses paths relative to the current directory. For different IDEs (PyCharm, VSCode, etc.), the directory can be different. Set default paths to desired ones.\n",
    "\n",
    "* `tabensemb.setting[\"default_output_path\"]`: It will be used to save results. This path will be created if it does not exist.\n",
    "* `tabensemb.setting[\"default_config_path\"]`: It should be the path to configuration files (See \"Using a configuration file\" for its case).\n",
    "* `tabensemb.setting[\"default_config_path\"]`: It should be the path to data files. It will also be used to save downloaded datasets (See \"Using a configuration file\" for its case).\n",
    "\n",
    "In this notebook, we use a temporary directory for cleanliness. Change `temp_path.name` to your own directory."
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "outputs": [],
   "source": [
    "from tempfile import TemporaryDirectory\n",
    "\n",
    "temp_path = TemporaryDirectory()\n",
    "tabensemb.setting[\"default_output_path\"] = os.path.join(temp_path.name, \"output\")\n",
    "tabensemb.setting[\"default_config_path\"] = os.path.join(temp_path.name, \"configs\")\n",
    "tabensemb.setting[\"default_data_path\"] = os.path.join(temp_path.name, \"data\")"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "### Configuring a `Trainer`\n",
    "\n",
    "Create a `Trainer`, which acts as a bridge of data and models and provides some useful utilities."
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "trainer = Trainer(device=device)"
   ]
  },
  {
   "cell_type": "markdown",
   "source": [
    "As an example, we use the Auto MPG dataset from [UCI datasets](https://archive.ics.uci.edu/datasets). We can import UCI datasets through the `UserConfig` class."
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Downloading https://archive.ics.uci.edu/static/public/9/auto+mpg.zip to /tmp/tmpqcxgn2l1/data/Auto MPG.zip\n",
      "cylinders is Integer and will be treated as a continuous feature.\n",
      "model_year is Integer and will be treated as a continuous feature.\n",
      "origin is Integer and will be treated as a continuous feature.\n",
      "Unknown values are detected in ['horsepower']. They will be treated as np.nan.\n",
      "The project will be saved to /tmp/tmpqcxgn2l1/output/auto-mpg/2023-09-23-20-36-00-0_UserInputConfig\n"
     ]
    }
   ],
   "source": [
    "mpg_columns = [\n",
    "    \"mpg\",\n",
    "    \"cylinders\",\n",
    "    \"displacement\",\n",
    "    \"horsepower\",\n",
    "    \"weight\",\n",
    "    \"acceleration\",\n",
    "    \"model_year\",\n",
    "    \"origin\",\n",
    "    \"car_name\",\n",
    "]\n",
    "cfg = UserConfig.from_uci(\"Auto MPG\", column_names=mpg_columns, sep=r\"\\s+\")\n",
    "trainer.load_config(cfg)"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "*Optional*: We provide a useful `Logging` class to record all outputs to a file located in the above project root so that users can review the training process. This step is optional but we strongly recommend using it.\n",
    "\n",
    "`Trainer.project_root` is the output directory of the `trainer`, and here we log all `stdout` and `stderr` to `log.txt` in this directory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "from tabensemb.utils import Logging\n",
    "log = Logging()\n",
    "log.enter(os.path.join(trainer.project_root, \"log.txt\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "source": [
    "### Viewing configurations\n",
    "\n",
    "We can view the summary of the current environment, including devices/Python version, the loaded configuration, and global settings of `tabensemb`."
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Device:\n",
      "{\n",
      "\t'System': 'Linux',\n",
      "\t'Node name': 'xlluo-WS',\n",
      "\t'System release': '5.15.6-custom',\n",
      "\t'System version': '#1 SMP Mon Dec 13 20:27:58 CST 2021',\n",
      "\t'Machine architecture': 'x86_64',\n",
      "\t'Processor architecture': 'x86_64',\n",
      "\t'Processor model': '11th Gen Intel(R) Core(TM) i9-11900K @ 3.50GHz',\n",
      "\t'Physical cores': 8,\n",
      "\t'Total cores': 16,\n",
      "\t'Max core frequency': '5150.00Mhz',\n",
      "\t'Total memory': '31.20GB',\n",
      "\t'Python version': '3.10.12',\n",
      "\t'Python implementation': 'CPython',\n",
      "\t'Python compiler': 'GCC 11.2.0',\n",
      "\t'Cuda availability': True,\n",
      "\t'GPU devices': [\n",
      "\t\t'NVIDIA GeForce RTX 3090'\n",
      "\t]\n",
      "}\n",
      "Configurations:\n",
      "{\n",
      "\t'database': 'auto-mpg',\n",
      "\t'task': 'regression',\n",
      "\t'loss': None,\n",
      "\t'bayes_opt': False,\n",
      "\t'bayes_calls': 50,\n",
      "\t'bayes_epoch': 30,\n",
      "\t'patience': 100,\n",
      "\t'epoch': 300,\n",
      "\t'lr': 0.001,\n",
      "\t'weight_decay': 1e-09,\n",
      "\t'batch_size': 1024,\n",
      "\t'layers': [\n",
      "\t\t64,\n",
      "\t\t128,\n",
      "\t\t256,\n",
      "\t\t128,\n",
      "\t\t64\n",
      "\t],\n",
      "\t'SPACEs': {\n",
      "\t\t'lr': {\n",
      "\t\t\t'type': 'Real',\n",
      "\t\t\t'low': 0.0001,\n",
      "\t\t\t'high': 0.05,\n",
      "\t\t\t'prior': 'log-uniform'\n",
      "\t\t},\n",
      "\t\t'weight_decay': {\n",
      "\t\t\t'type': 'Real',\n",
      "\t\t\t'low': 1e-09,\n",
      "\t\t\t'high': 0.05,\n",
      "\t\t\t'prior': 'log-uniform'\n",
      "\t\t},\n",
      "\t\t'batch_size': {\n",
      "\t\t\t'type': 'Categorical',\n",
      "\t\t\t'categories': [\n",
      "\t\t\t\t64,\n",
      "\t\t\t\t128,\n",
      "\t\t\t\t256,\n",
      "\t\t\t\t512,\n",
      "\t\t\t\t1024,\n",
      "\t\t\t\t2048\n",
      "\t\t\t]\n",
      "\t\t}\n",
      "\t},\n",
      "\t'data_splitter': 'RandomSplitter',\n",
      "\t'split_ratio': [\n",
      "\t\t0.6,\n",
      "\t\t0.2,\n",
      "\t\t0.2\n",
      "\t],\n",
      "\t'data_imputer': 'MissForestImputer',\n",
      "\t'data_processors': [\n",
      "\t\t(\n",
      "\t\t\t'CategoricalOrdinalEncoder',\n",
      "\t\t\t{\n",
      "\t\t\t}\n",
      "\t\t),\n",
      "\t\t(\n",
      "\t\t\t'NaNFeatureRemover',\n",
      "\t\t\t{\n",
      "\t\t\t}\n",
      "\t\t),\n",
      "\t\t(\n",
      "\t\t\t'VarianceFeatureSelector',\n",
      "\t\t\t{\n",
      "\t\t\t\t'thres': 1\n",
      "\t\t\t}\n",
      "\t\t),\n",
      "\t\t(\n",
      "\t\t\t'StandardScaler',\n",
      "\t\t\t{\n",
      "\t\t\t}\n",
      "\t\t)\n",
      "\t],\n",
      "\t'data_derivers': [\n",
      "\t],\n",
      "\t'categorical_feature_names': [\n",
      "\t],\n",
      "\t'continuous_feature_names': [\n",
      "\t\t'displacement',\n",
      "\t\t'cylinders',\n",
      "\t\t'horsepower',\n",
      "\t\t'weight',\n",
      "\t\t'acceleration',\n",
      "\t\t'model_year',\n",
      "\t\t'origin'\n",
      "\t],\n",
      "\t'feature_types': {\n",
      "\t\t'displacement': 'Continuous',\n",
      "\t\t'cylinders': 'Continuous',\n",
      "\t\t'horsepower': 'Continuous',\n",
      "\t\t'weight': 'Continuous',\n",
      "\t\t'acceleration': 'Continuous',\n",
      "\t\t'model_year': 'Continuous',\n",
      "\t\t'origin': 'Continuous'\n",
      "\t},\n",
      "\t'unique_feature_types': [\n",
      "\t\t'Continuous'\n",
      "\t],\n",
      "\t'label_name': [\n",
      "\t\t'mpg'\n",
      "\t]\n",
      "}\n",
      "Global settings:\n",
      "{\n",
      "\t'random_seed': 42,\n",
      "\t'low_memory': True,\n",
      "\t'verbose_per_epoch': 20,\n",
      "\t'test_with_no_grad': True,\n",
      "\t'debug_mode': False,\n",
      "\t'default_output_path': '/tmp/tmpqcxgn2l1/output',\n",
      "\t'default_config_path': '/tmp/tmpqcxgn2l1/configs',\n",
      "\t'default_data_path': '/tmp/tmpqcxgn2l1/data',\n",
      "\t'warn_nan_metric': True,\n",
      "\t'raise_inconsistent_inferred_task': False,\n",
      "\t'matplotlib_usetex': False\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "trainer.summarize_setting()"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "### Loading data\n",
    "\n",
    "In the configuration summary above, the dataset file is defined by \"database\" under the `Configurations` category. `Trainer.load_data` automatically searches the file in the current directory and `tabensemb.setting[\"default_data_path\"]`. Now, load the Auto MPG dataset into the `Trainer`. It will process the dataset and get ready for training models:\n",
    "\n",
    "1. Data splitting (training/validation/testing sets)\n",
    "2. Data imputation\n",
    "3. Data augmentation (for features)\n",
    "4. Data processing\n",
    "    * Data augmentation (for data points)\n",
    "    * Data filtering\n",
    "    * Feature selection\n",
    "    * Categorical encoding\n",
    "    * Data scaling\n",
    "    * etc.\n",
    "5. Data augmentation (for features, especially multi-modal features)\n"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Dataset size: 238 80 80\n",
      "Data saved to /tmp/tmpqcxgn2l1/output/auto-mpg/2023-09-23-20-36-00-0_UserInputConfig (data.csv and tabular_data.csv).\n"
     ]
    }
   ],
   "source": [
    "trainer.load_data()"
   ]
  },
  {
   "cell_type": "markdown",
   "source": [
    "### Initializing model bases\n",
    "\n",
    "Initialize model bases and add them to the `Trainer`. We only choose a subset of models in each model base for demonstration by passing the `model_subset` argument (without it, all available models will be trained)."
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "models = [\n",
    "    PytorchTabular(trainer, model_subset=[\"Category Embedding\"]),\n",
    "    WideDeep(trainer, model_subset=[\"TabMlp\"]),\n",
    "    AutoGluon(trainer, model_subset=[\"Random Forest\"]),\n",
    "]\n",
    "trainer.add_modelbases(models)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "### Start training\n",
    "\n",
    "Now train the model bases. The argument `stderr_to_stdout` will redirect warnings and loggings to `stdout` and make records in the notebook clean.\n",
    "\n",
    "*Optional*: Using the following line, we can run k-fold cross-validation to get the leaderboard, where k is `cross_validation`.\n",
    "\n",
    "```python\n",
    "trainer.get_leaderboard(cross_validation=10, split_type=\"cv\", stderr_to_stdout=True)\n",
    "```\n",
    "\n",
    "**Remark**: `split_type` can be `random`, which means that the dataset is randomly split according to the given `split_ratio` in the configuration and different random seeds."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "-------------Run PytorchTabular-------------\n",
      "\n",
      "Training Category Embedding\n",
      "Global seed set to 42\n",
      "2023-09-23 20:36:01,062 - {pytorch_tabular.tabular_model:473} - INFO - Preparing the DataLoaders\n",
      "2023-09-23 20:36:01,062 - {pytorch_tabular.tabular_datamodule:290} - INFO - Setting up the datamodule for regression task\n",
      "2023-09-23 20:36:01,070 - {pytorch_tabular.tabular_model:521} - INFO - Preparing the Model: CategoryEmbeddingModel\n",
      "2023-09-23 20:36:01,081 - {pytorch_tabular.tabular_model:268} - INFO - Preparing the Trainer\n",
      "/home/xlluo/anaconda3/envs/tabular_ensemble/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py:589: LightningDeprecationWarning: The Trainer argument `auto_select_gpus` has been deprecated in v1.9.0 and will be removed in v2.0.0. Please use the function `pytorch_lightning.accelerators.find_usable_cuda_devices` instead.\n",
      "  rank_zero_deprecation(\n",
      "Auto select gpus: [0]\n",
      "GPU available: True (cuda), used: True\n",
      "TPU available: False, using: 0 TPU cores\n",
      "IPU available: False, using: 0 IPUs\n",
      "HPU available: False, using: 0 HPUs\n",
      "2023-09-23 20:36:01,991 - {pytorch_tabular.tabular_model:582} - INFO - Training Started\n",
      "You are using a CUDA device ('NVIDIA GeForce RTX 3090') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision\n",
      "LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]\n",
      "\n",
      "  | Name             | Type                      | Params\n",
      "---------------------------------------------------------------\n",
      "0 | _backbone        | CategoryEmbeddingBackbone | 11.4 K\n",
      "1 | _embedding_layer | Embedding1dLayer          | 14    \n",
      "2 | head             | LinearHead                | 33    \n",
      "3 | loss             | MSELoss                   | 0     \n",
      "---------------------------------------------------------------\n",
      "11.4 K    Trainable params\n",
      "0         Non-trainable params\n",
      "11.4 K    Total params\n",
      "0.046     Total estimated model params size (MB)\n",
      "Epoch: 1/300, Train loss: 677.8015, Val loss: 582.9557, Min val loss: 582.9557, Epoch time: 0.012s.\n",
      "Epoch: 20/300, Train loss: 353.7851, Val loss: 302.0203, Min val loss: 302.0203, Epoch time: 0.010s.\n",
      "Epoch: 40/300, Train loss: 85.0776, Val loss: 62.1153, Min val loss: 62.1153, Epoch time: 0.009s.\n",
      "Epoch: 60/300, Train loss: 45.2654, Val loss: 34.2778, Min val loss: 34.2691, Epoch time: 0.009s.\n",
      "Epoch: 80/300, Train loss: 33.9537, Val loss: 26.8622, Min val loss: 26.8622, Epoch time: 0.016s.\n",
      "Epoch: 100/300, Train loss: 26.9038, Val loss: 23.2417, Min val loss: 23.2372, Epoch time: 0.011s.\n",
      "Epoch: 120/300, Train loss: 24.9622, Val loss: 20.4360, Min val loss: 20.4360, Epoch time: 0.008s.\n",
      "Epoch: 140/300, Train loss: 24.1636, Val loss: 19.4010, Min val loss: 19.4010, Epoch time: 0.012s.\n",
      "Epoch: 160/300, Train loss: 22.9200, Val loss: 18.0232, Min val loss: 17.9749, Epoch time: 0.010s.\n",
      "Epoch: 180/300, Train loss: 19.7677, Val loss: 16.9469, Min val loss: 16.9469, Epoch time: 0.011s.\n",
      "Epoch: 200/300, Train loss: 17.9390, Val loss: 16.6545, Min val loss: 16.4093, Epoch time: 0.010s.\n",
      "Epoch: 220/300, Train loss: 19.4496, Val loss: 15.4451, Min val loss: 15.1788, Epoch time: 0.011s.\n",
      "Epoch: 240/300, Train loss: 16.0483, Val loss: 14.5508, Min val loss: 14.5508, Epoch time: 0.010s.\n",
      "Epoch: 260/300, Train loss: 16.4672, Val loss: 13.8354, Min val loss: 13.8354, Epoch time: 0.010s.\n",
      "Epoch: 280/300, Train loss: 13.6031, Val loss: 12.9315, Min val loss: 12.9315, Epoch time: 0.018s.\n",
      "Epoch: 300/300, Train loss: 16.5369, Val loss: 12.3673, Min val loss: 12.3673, Epoch time: 0.011s.\n",
      "`Trainer.fit` stopped: `max_epochs=300` reached.\n",
      "2023-09-23 20:36:07,420 - {pytorch_tabular.tabular_model:584} - INFO - Training the model completed\n",
      "2023-09-23 20:36:07,420 - {pytorch_tabular.tabular_model:1258} - INFO - Loading the best model\n",
      "/home/xlluo/anaconda3/envs/tabular_ensemble/lib/python3.10/site-packages/pytorch_lightning/utilities/cloud_io.py:33: LightningDeprecationWarning: `pytorch_lightning.utilities.cloud_io.get_filesystem` has been deprecated in v1.8.0 and will be removed in v2.0.0. Please use `lightning_fabric.utilities.cloud_io.get_filesystem` instead.\n",
      "  rank_zero_deprecation(\n",
      "Training mse loss: 11.25175\n",
      "Validation mse loss: 12.36725\n",
      "Testing mse loss: 7.83801\n",
      "Trainer saved. To load the trainer, run trainer = load_trainer(path='/tmp/tmpqcxgn2l1/output/auto-mpg/2023-09-23-20-36-00-0_UserInputConfig/trainer.pkl')\n",
      "\n",
      "-------------PytorchTabular End-------------\n",
      "\n",
      "\n",
      "-------------Run WideDeep-------------\n",
      "\n",
      "Training TabMlp\n",
      "Epoch: 1/300, Train loss: 635.5330, Val loss: 555.4755, Min val loss: 555.4755\n",
      "Epoch: 21/300, Train loss: 441.6902, Val loss: 375.7337, Min val loss: 375.7337\n",
      "Epoch: 41/300, Train loss: 145.8623, Val loss: 119.9598, Min val loss: 119.9598\n",
      "Epoch: 61/300, Train loss: 45.9133, Val loss: 34.0160, Min val loss: 34.0160\n",
      "Epoch: 81/300, Train loss: 27.6878, Val loss: 24.1525, Min val loss: 24.1525\n",
      "Epoch: 101/300, Train loss: 23.0877, Val loss: 18.2096, Min val loss: 18.2096\n",
      "Epoch: 121/300, Train loss: 21.4056, Val loss: 17.2203, Min val loss: 17.1303\n",
      "Epoch: 141/300, Train loss: 21.2559, Val loss: 16.0746, Min val loss: 16.0746\n",
      "Epoch: 161/300, Train loss: 19.2337, Val loss: 15.3027, Min val loss: 15.3027\n",
      "Epoch: 181/300, Train loss: 16.1232, Val loss: 14.5777, Min val loss: 14.5777\n",
      "Epoch: 201/300, Train loss: 16.7095, Val loss: 14.2274, Min val loss: 14.2274\n",
      "Epoch: 221/300, Train loss: 15.7366, Val loss: 13.5223, Min val loss: 13.5223\n",
      "Epoch: 241/300, Train loss: 16.9825, Val loss: 12.9892, Min val loss: 12.9892\n",
      "Epoch: 261/300, Train loss: 15.3358, Val loss: 12.4278, Min val loss: 12.4278\n",
      "Epoch: 281/300, Train loss: 13.3989, Val loss: 12.1155, Min val loss: 12.1155\n",
      "Restoring model weights from the end of the best epoch\n",
      "Training mse loss: 10.17037\n",
      "Validation mse loss: 11.66271\n",
      "Testing mse loss: 6.43856\n",
      "Trainer saved. To load the trainer, run trainer = load_trainer(path='/tmp/tmpqcxgn2l1/output/auto-mpg/2023-09-23-20-36-00-0_UserInputConfig/trainer.pkl')\n",
      "\n",
      "-------------WideDeep End-------------\n",
      "\n",
      "\n",
      "-------------Run AutoGluon-------------\n",
      "\n",
      "Training Random Forest\n",
      "Presets specified: ['best_quality']\n",
      "Warning: hyperparameter tuning is currently experimental and may cause the process to hang.\n",
      "Stack configuration (auto_stack=True): num_stack_levels=0, num_bag_folds=5, num_bag_sets=1\n",
      "Beginning AutoGluon training ...\n",
      "AutoGluon will save models to \"/tmp/tmpqcxgn2l1/output/auto-mpg/2023-09-23-20-36-00-0_UserInputConfig/AutoGluon/Random Forest/\"\n",
      "AutoGluon Version:  0.8.2\n",
      "Python Version:     3.10.12\n",
      "Operating System:   Linux\n",
      "Platform Machine:   x86_64\n",
      "Platform Version:   #1 SMP Mon Dec 13 20:27:58 CST 2021\n",
      "Disk Space Avail:   149.28 GB / 502.47 GB (29.7%)\n",
      "Train Data Rows:    238\n",
      "Train Data Columns: 7\n",
      "Tuning Data Rows:    80\n",
      "Tuning Data Columns: 7\n",
      "Label Column: mpg\n",
      "Preprocessing data ...\n",
      "Using Feature Generators to preprocess the data ...\n",
      "Fitting PipelineFeatureGenerator...\n",
      "\tAvailable Memory:                    12533.16 MB\n",
      "\tTrain Data (Original)  Memory Usage: 0.02 MB (0.0% of available memory)\n",
      "\tInferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.\n",
      "\tStage 1 Generators:\n",
      "\t\tFitting AsTypeFeatureGenerator...\n",
      "\tStage 2 Generators:\n",
      "\t\tFitting FillNaFeatureGenerator...\n",
      "\tStage 3 Generators:\n",
      "\t\tFitting IdentityFeatureGenerator...\n",
      "\t\t\tWarning: feature_metadata_in passed as input to fit_transform, but self.feature_metadata_in was already set. Ignoring feature_metadata_in.\n",
      "\tStage 4 Generators:\n",
      "\t\tFitting DropUniqueFeatureGenerator...\n",
      "\tStage 5 Generators:\n",
      "\t\tFitting DropDuplicatesFeatureGenerator...\n",
      "\tTypes of features in original data (raw dtype, special dtypes):\n",
      "\t\t('float', []) : 4 | ['displacement', 'horsepower', 'weight', 'acceleration']\n",
      "\t\t('int', [])   : 3 | ['cylinders', 'model_year', 'origin']\n",
      "\tTypes of features in processed data (raw dtype, special dtypes):\n",
      "\t\t('float', []) : 4 | ['displacement', 'horsepower', 'weight', 'acceleration']\n",
      "\t\t('int', [])   : 3 | ['cylinders', 'model_year', 'origin']\n",
      "\t0.0s = Fit runtime\n",
      "\t7 features in original data used to generate 7 features in processed data.\n",
      "\tTrain Data (Processed) Memory Usage: 0.02 MB (0.0% of available memory)\n",
      "Data preprocessing and feature engineering runtime = 0.03s ...\n",
      "AutoGluon will gauge predictive performance using evaluation metric: 'mean_squared_error'\n",
      "\tThis metric's sign has been flipped to adhere to being higher_is_better. The metric score can be multiplied by -1 to get the metric value.\n",
      "\tTo change this, specify the eval_metric parameter of Predictor()\n",
      "use_bag_holdout=True, will use tuning_data as holdout (will not be used for early stopping).\n",
      "User-specified model hyperparameters to be fit:\n",
      "{\n",
      "\t'RF': {},\n",
      "}\n",
      "Fitting 1 L1 models ...\n",
      "Hyperparameter tuning model: RandomForest_BAG_L1 ...\n",
      "\tNo hyperparameter search space specified for RandomForest_BAG_L1. Skipping HPO. Will train one model based on the provided hyperparameters.\n",
      "Fitted model: RandomForest_BAG_L1 ...\n",
      "\t-8.1918\t = Validation score   (-mean_squared_error)\n",
      "\t0.28s\t = Training   runtime\n",
      "\t0.0s\t = Validation runtime\n",
      "Fitting model: WeightedEnsemble_L2 ...\n",
      "\t-11.4141\t = Validation score   (-mean_squared_error)\n",
      "\t0.0s\t = Training   runtime\n",
      "\t0.0s\t = Validation runtime\n",
      "AutoGluon training complete, total runtime = 0.35s ... Best model: \"WeightedEnsemble_L2\"\n",
      "TabularPredictor saved. To load, use: predictor = TabularPredictor.load(\"/tmp/tmpqcxgn2l1/output/auto-mpg/2023-09-23-20-36-00-0_UserInputConfig/AutoGluon/Random Forest/\")\n",
      "Training mse loss: 1.07741\n",
      "Validation mse loss: 11.41409\n",
      "Testing mse loss: 4.19031\n",
      "Trainer saved. To load the trainer, run trainer = load_trainer(path='/tmp/tmpqcxgn2l1/output/auto-mpg/2023-09-23-20-36-00-0_UserInputConfig/trainer.pkl')\n",
      "\n",
      "-------------AutoGluon End-------------\n",
      "\n"
     ]
    }
   ],
   "source": [
    "trainer.train(stderr_to_stdout=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "source": [
    "After training finishes, check the leaderboard to see their performance.\n",
    "\n",
    "Metrics used in leaderboards can be found in `tabensemb.utils.utils.REGRESSION_METRICS/BINARY_METRICS/MULTICLASS_METRICS`. Most of the metrics are from `sklearn.metrics`."
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "PytorchTabular metrics\n",
      "Category Embedding 1/1\n",
      "WideDeep metrics\n",
      "TabMlp 1/1\n",
      "AutoGluon metrics\n",
      "Random Forest 1/1\n",
      "Trainer saved. To load the trainer, run trainer = load_trainer(path='/tmp/tmpqcxgn2l1/output/auto-mpg/2023-09-23-20-36-00-0_UserInputConfig/trainer.pkl')\n"
     ]
    },
    {
     "data": {
      "text/plain": "          Program               Model  Training RMSE  Training MSE  \\\n0       AutoGluon       Random Forest       1.037981      1.077405   \n1        WideDeep              TabMlp       3.189102     10.170372   \n2  PytorchTabular  Category Embedding       3.354362     11.251746   \n\n   Training MAE  Training MAPE  Training R2  Training MEDIAN_ABSOLUTE_ERROR  \\\n0      0.741566       0.031074     0.983285                        0.529500   \n1      2.318564       0.096454     0.842218                        1.669983   \n2      2.445915       0.101659     0.825442                        1.775388   \n\n   Training EXPLAINED_VARIANCE_SCORE  Testing RMSE  ...  Testing R2  \\\n0                           0.983293      2.047025  ...    0.922065   \n1                           0.859805      2.537431  ...    0.880250   \n2                           0.854523      2.799644  ...    0.854221   \n\n   Testing MEDIAN_ABSOLUTE_ERROR  Testing EXPLAINED_VARIANCE_SCORE  \\\n0                       1.156333                          0.922591   \n1                       1.767459                          0.900587   \n2                       1.963455                          0.888258   \n\n   Validation RMSE  Validation MSE  Validation MAE  Validation MAPE  \\\n0         3.378475       11.414091        2.269187         0.102995   \n1         3.415071       11.662707        2.539188         0.116035   \n2         3.516710       12.367250        2.731159         0.125136   \n\n   Validation R2  Validation MEDIAN_ABSOLUTE_ERROR  \\\n0       0.796098                          1.641334   \n1       0.791657                          1.904160   \n2       0.779071                          2.375105   \n\n   Validation EXPLAINED_VARIANCE_SCORE  \n0                             0.796506  \n1                             0.806152  \n2                             0.808039  \n\n[3 rows x 23 columns]",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>Program</th>\n      <th>Model</th>\n      <th>Training RMSE</th>\n      <th>Training MSE</th>\n      <th>Training MAE</th>\n      <th>Training MAPE</th>\n      <th>Training R2</th>\n      <th>Training MEDIAN_ABSOLUTE_ERROR</th>\n      <th>Training EXPLAINED_VARIANCE_SCORE</th>\n      <th>Testing RMSE</th>\n      <th>...</th>\n      <th>Testing R2</th>\n      <th>Testing MEDIAN_ABSOLUTE_ERROR</th>\n      <th>Testing EXPLAINED_VARIANCE_SCORE</th>\n      <th>Validation RMSE</th>\n      <th>Validation MSE</th>\n      <th>Validation MAE</th>\n      <th>Validation MAPE</th>\n      <th>Validation R2</th>\n      <th>Validation MEDIAN_ABSOLUTE_ERROR</th>\n      <th>Validation EXPLAINED_VARIANCE_SCORE</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>AutoGluon</td>\n      <td>Random Forest</td>\n      <td>1.037981</td>\n      <td>1.077405</td>\n      <td>0.741566</td>\n      <td>0.031074</td>\n      <td>0.983285</td>\n      <td>0.529500</td>\n      <td>0.983293</td>\n      <td>2.047025</td>\n      <td>...</td>\n      <td>0.922065</td>\n      <td>1.156333</td>\n      <td>0.922591</td>\n      <td>3.378475</td>\n      <td>11.414091</td>\n      <td>2.269187</td>\n      <td>0.102995</td>\n      <td>0.796098</td>\n      <td>1.641334</td>\n      <td>0.796506</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>WideDeep</td>\n      <td>TabMlp</td>\n      <td>3.189102</td>\n      <td>10.170372</td>\n      <td>2.318564</td>\n      <td>0.096454</td>\n      <td>0.842218</td>\n      <td>1.669983</td>\n      <td>0.859805</td>\n      <td>2.537431</td>\n      <td>...</td>\n      <td>0.880250</td>\n      <td>1.767459</td>\n      <td>0.900587</td>\n      <td>3.415071</td>\n      <td>11.662707</td>\n      <td>2.539188</td>\n      <td>0.116035</td>\n      <td>0.791657</td>\n      <td>1.904160</td>\n      <td>0.806152</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>PytorchTabular</td>\n      <td>Category Embedding</td>\n      <td>3.354362</td>\n      <td>11.251746</td>\n      <td>2.445915</td>\n      <td>0.101659</td>\n      <td>0.825442</td>\n      <td>1.775388</td>\n      <td>0.854523</td>\n      <td>2.799644</td>\n      <td>...</td>\n      <td>0.854221</td>\n      <td>1.963455</td>\n      <td>0.888258</td>\n      <td>3.516710</td>\n      <td>12.367250</td>\n      <td>2.731159</td>\n      <td>0.125136</td>\n      <td>0.779071</td>\n      <td>2.375105</td>\n      <td>0.808039</td>\n    </tr>\n  </tbody>\n</table>\n<p>3 rows × 23 columns</p>\n</div>"
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "trainer.get_leaderboard()"
   ]
  },
  {
   "cell_type": "markdown",
   "source": [
    "## Binary classification\n",
    "\n",
    "As a showcase for binary classification, we use the Adult dataset from UCI datasets. Note that the Adult dataset has an individual testing set, which will be discussed in the \"Inference on an upcoming dataset\" part."
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Downloading https://archive.ics.uci.edu/static/public/2/adult.zip to /tmp/tmpqcxgn2l1/data/Adult.zip\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/xlluo/hdd/tabular_ensemble/tabensemb/config/user_config.py:292: UserWarning: There exists .test file(s) ['adult.test'] which should be used for final metrics. The .zip file is left for the user to process.\n",
      "  warnings.warn(\n",
      "/home/xlluo/hdd/tabular_ensemble/tabensemb/utils/utils.py:464: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.\n",
      "  df = pd.read_csv(StringIO(s), names=names, sep=sep)\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "age is Integer and will be treated as a continuous feature.\n",
      "fnlwgt is Integer and will be treated as a continuous feature.\n",
      "education-num is Integer and will be treated as a continuous feature.\n",
      "capital-gain is Integer and will be treated as a continuous feature.\n",
      "capital-loss is Integer and will be treated as a continuous feature.\n",
      "hours-per-week is Integer and will be treated as a continuous feature.\n",
      "The project will be saved to /tmp/tmpqcxgn2l1/output/adult/2023-09-23-20-36-15-0_UserInputConfig\n",
      "Dataset size: 19536 6512 6513\n",
      "Data saved to /tmp/tmpqcxgn2l1/output/adult/2023-09-23-20-36-15-0_UserInputConfig (data.csv and tabular_data.csv).\n",
      "\n",
      "-------------Run PytorchTabular-------------\n",
      "\n",
      "Training Category Embedding\n",
      "Global seed set to 42\n",
      "2023-09-23 20:36:17,315 - {pytorch_tabular.tabular_model:473} - INFO - Preparing the DataLoaders\n",
      "2023-09-23 20:36:17,317 - {pytorch_tabular.tabular_datamodule:290} - INFO - Setting up the datamodule for classification task\n",
      "2023-09-23 20:36:17,382 - {pytorch_tabular.tabular_model:521} - INFO - Preparing the Model: CategoryEmbeddingModel\n",
      "2023-09-23 20:36:17,412 - {pytorch_tabular.tabular_model:268} - INFO - Preparing the Trainer\n",
      "/home/xlluo/anaconda3/envs/tabular_ensemble/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py:589: LightningDeprecationWarning: The Trainer argument `auto_select_gpus` has been deprecated in v1.9.0 and will be removed in v2.0.0. Please use the function `pytorch_lightning.accelerators.find_usable_cuda_devices` instead.\n",
      "  rank_zero_deprecation(\n",
      "Auto select gpus: [0]\n",
      "GPU available: True (cuda), used: True\n",
      "TPU available: False, using: 0 TPU cores\n",
      "IPU available: False, using: 0 IPUs\n",
      "HPU available: False, using: 0 HPUs\n",
      "2023-09-23 20:36:17,437 - {pytorch_tabular.tabular_model:582} - INFO - Training Started\n",
      "You are using a CUDA device ('NVIDIA GeForce RTX 3090') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision\n",
      "LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]\n",
      "\n",
      "  | Name             | Type                      | Params\n",
      "---------------------------------------------------------------\n",
      "0 | _backbone        | CategoryEmbeddingBackbone | 18.5 K\n",
      "1 | _embedding_layer | Embedding1dLayer          | 1.4 K \n",
      "2 | head             | LinearHead                | 66    \n",
      "3 | loss             | CrossEntropyLoss          | 0     \n",
      "---------------------------------------------------------------\n",
      "20.0 K    Trainable params\n",
      "0         Non-trainable params\n",
      "20.0 K    Total params\n",
      "0.080     Total estimated model params size (MB)\n",
      "Epoch: 1/300, Train loss: 0.4666, Val loss: 0.3794, Min val loss: 0.3794, Epoch time: 0.465s.\n",
      "Epoch: 20/300, Train loss: 0.3112, Val loss: 0.3184, Min val loss: 0.3177, Epoch time: 0.472s.\n",
      "Epoch: 40/300, Train loss: 0.2972, Val loss: 0.3213, Min val loss: 0.3162, Epoch time: 0.435s.\n",
      "Epoch: 60/300, Train loss: 0.2864, Val loss: 0.3248, Min val loss: 0.3162, Epoch time: 0.476s.\n",
      "Epoch: 80/300, Train loss: 0.2766, Val loss: 0.3323, Min val loss: 0.3162, Epoch time: 0.530s.\n",
      "Epoch: 100/300, Train loss: 0.2677, Val loss: 0.3416, Min val loss: 0.3162, Epoch time: 0.385s.\n",
      "Epoch: 120/300, Train loss: 0.2602, Val loss: 0.3484, Min val loss: 0.3162, Epoch time: 0.527s.\n",
      "2023-09-23 20:37:10,637 - {pytorch_tabular.tabular_model:584} - INFO - Training the model completed\n",
      "2023-09-23 20:37:10,638 - {pytorch_tabular.tabular_model:1258} - INFO - Loading the best model\n",
      "/home/xlluo/anaconda3/envs/tabular_ensemble/lib/python3.10/site-packages/pytorch_lightning/utilities/cloud_io.py:33: LightningDeprecationWarning: `pytorch_lightning.utilities.cloud_io.get_filesystem` has been deprecated in v1.8.0 and will be removed in v2.0.0. Please use `lightning_fabric.utilities.cloud_io.get_filesystem` instead.\n",
      "  rank_zero_deprecation(\n",
      "Training log_loss loss: 0.28862\n",
      "Validation log_loss loss: 0.31619\n",
      "Testing log_loss loss: 0.31114\n",
      "Trainer saved. To load the trainer, run trainer = load_trainer(path='/tmp/tmpqcxgn2l1/output/adult/2023-09-23-20-36-15-0_UserInputConfig/trainer.pkl')\n",
      "\n",
      "-------------PytorchTabular End-------------\n",
      "\n",
      "\n",
      "-------------Run WideDeep-------------\n",
      "\n",
      "Training TabMlp\n",
      "Epoch: 1/300, Train loss: 0.5206, Val loss: 0.4274, Min val loss: 0.4274\n",
      "Epoch: 21/300, Train loss: 0.3145, Val loss: 0.3161, Min val loss: 0.3161\n",
      "Epoch: 41/300, Train loss: 0.3063, Val loss: 0.3192, Min val loss: 0.3161\n",
      "Epoch: 61/300, Train loss: 0.2984, Val loss: 0.3209, Min val loss: 0.3161\n",
      "Epoch: 81/300, Train loss: 0.2905, Val loss: 0.3227, Min val loss: 0.3161\n",
      "Epoch: 101/300, Train loss: 0.2819, Val loss: 0.3266, Min val loss: 0.3161\n",
      "Epoch: 121/300, Train loss: 0.2754, Val loss: 0.3293, Min val loss: 0.3161\n",
      "Epoch 00121: early stopping\n",
      "Restoring model weights from the end of the best epoch\n",
      "Training log_loss loss: 0.30084\n",
      "Validation log_loss loss: 0.31729\n",
      "Testing log_loss loss: 0.31133\n",
      "Trainer saved. To load the trainer, run trainer = load_trainer(path='/tmp/tmpqcxgn2l1/output/adult/2023-09-23-20-36-15-0_UserInputConfig/trainer.pkl')\n",
      "\n",
      "-------------WideDeep End-------------\n",
      "\n",
      "\n",
      "-------------Run AutoGluon-------------\n",
      "\n",
      "Training Random Forest\n",
      "Presets specified: ['best_quality']\n",
      "Warning: hyperparameter tuning is currently experimental and may cause the process to hang.\n",
      "Stack configuration (auto_stack=True): num_stack_levels=0, num_bag_folds=8, num_bag_sets=1\n",
      "Beginning AutoGluon training ...\n",
      "AutoGluon will save models to \"/tmp/tmpqcxgn2l1/output/adult/2023-09-23-20-36-15-0_UserInputConfig/AutoGluon/Random Forest/\"\n",
      "AutoGluon Version:  0.8.2\n",
      "Python Version:     3.10.12\n",
      "Operating System:   Linux\n",
      "Platform Machine:   x86_64\n",
      "Platform Version:   #1 SMP Mon Dec 13 20:27:58 CST 2021\n",
      "Disk Space Avail:   148.97 GB / 502.47 GB (29.6%)\n",
      "Train Data Rows:    19536\n",
      "Train Data Columns: 14\n",
      "Tuning Data Rows:    6512\n",
      "Tuning Data Columns: 14\n",
      "Label Column: income\n",
      "Preprocessing data ...\n",
      "Selected class <--> label mapping:  class 1 = 1, class 0 = 0\n",
      "Using Feature Generators to preprocess the data ...\n",
      "Fitting PipelineFeatureGenerator...\n",
      "\tAvailable Memory:                    6146.04 MB\n",
      "\tTrain Data (Original)  Memory Usage: 15.07 MB (0.2% of available memory)\n",
      "\tInferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.\n",
      "\tStage 1 Generators:\n",
      "\t\tFitting AsTypeFeatureGenerator...\n",
      "\t\t\tNote: Converting 1 features to boolean dtype as they only contain 2 unique values.\n",
      "\tStage 2 Generators:\n",
      "\t\tFitting FillNaFeatureGenerator...\n",
      "\tStage 3 Generators:\n",
      "\t\tFitting IdentityFeatureGenerator...\n",
      "\t\t\tWarning: feature_metadata_in passed as input to fit_transform, but self.feature_metadata_in was already set. Ignoring feature_metadata_in.\n",
      "\t\tFitting CategoryFeatureGenerator...\n",
      "\t\t\tWarning: feature_metadata_in passed as input to fit_transform, but self.feature_metadata_in was already set. Ignoring feature_metadata_in.\n",
      "\t\t\tFitting CategoryMemoryMinimizeFeatureGenerator...\n",
      "\tStage 4 Generators:\n",
      "\t\tFitting DropUniqueFeatureGenerator...\n",
      "\tStage 5 Generators:\n",
      "\t\tFitting DropDuplicatesFeatureGenerator...\n",
      "\tTypes of features in original data (raw dtype, special dtypes):\n",
      "\t\t('int', [])    : 6 | ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', ...]\n",
      "\t\t('object', []) : 8 | ['education', 'marital-status', 'native-country', 'occupation', 'race', ...]\n",
      "\tTypes of features in processed data (raw dtype, special dtypes):\n",
      "\t\t('category', []) : 8 | ['education', 'marital-status', 'native-country', 'occupation', 'race', ...]\n",
      "\t\t('int', [])      : 6 | ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', ...]\n",
      "\t0.2s = Fit runtime\n",
      "\t14 features in original data used to generate 14 features in processed data.\n",
      "\tTrain Data (Processed) Memory Usage: 1.46 MB (0.0% of available memory)\n",
      "Data preprocessing and feature engineering runtime = 0.18s ...\n",
      "AutoGluon will gauge predictive performance using evaluation metric: 'accuracy'\n",
      "\tTo change this, specify the eval_metric parameter of Predictor()\n",
      "use_bag_holdout=True, will use tuning_data as holdout (will not be used for early stopping).\n",
      "User-specified model hyperparameters to be fit:\n",
      "{\n",
      "\t'RF': {},\n",
      "}\n",
      "Fitting 1 L1 models ...\n",
      "Hyperparameter tuning model: RandomForest_BAG_L1 ...\n",
      "\tNo hyperparameter search space specified for RandomForest_BAG_L1. Skipping HPO. Will train one model based on the provided hyperparameters.\n",
      "Fitted model: RandomForest_BAG_L1 ...\n",
      "\t0.8543\t = Validation score   (accuracy)\n",
      "\t1.92s\t = Training   runtime\n",
      "\t0.0s\t = Validation runtime\n",
      "Fitting model: WeightedEnsemble_L2 ...\n",
      "\t0.8538\t = Validation score   (accuracy)\n",
      "\t0.0s\t = Training   runtime\n",
      "\t0.01s\t = Validation runtime\n",
      "AutoGluon training complete, total runtime = 2.4s ... Best model: \"WeightedEnsemble_L2\"\n",
      "TabularPredictor saved. To load, use: predictor = TabularPredictor.load(\"/tmp/tmpqcxgn2l1/output/adult/2023-09-23-20-36-15-0_UserInputConfig/AutoGluon/Random Forest/\")\n",
      "Training log_loss loss: 0.08021\n",
      "Validation log_loss loss: 0.31802\n",
      "Testing log_loss loss: 0.31261\n",
      "Trainer saved. To load the trainer, run trainer = load_trainer(path='/tmp/tmpqcxgn2l1/output/adult/2023-09-23-20-36-15-0_UserInputConfig/trainer.pkl')\n",
      "\n",
      "-------------AutoGluon End-------------\n",
      "\n",
      "PytorchTabular metrics\n",
      "Category Embedding 1/1\n",
      "WideDeep metrics\n",
      "TabMlp 1/1\n",
      "AutoGluon metrics\n",
      "Random Forest 1/1\n",
      "Trainer saved. To load the trainer, run trainer = load_trainer(path='/tmp/tmpqcxgn2l1/output/adult/2023-09-23-20-36-15-0_UserInputConfig/trainer.pkl')\n"
     ]
    },
    {
     "data": {
      "text/plain": "          Program               Model  Training F1_SCORE  \\\n0        WideDeep              TabMlp           0.694200   \n1       AutoGluon       Random Forest           1.000000   \n2  PytorchTabular  Category Embedding           0.709806   \n\n   Training PRECISION_SCORE  Training RECALL_SCORE  Training JACCARD_SCORE  \\\n0                  0.728505               0.662981                0.531628   \n1                  1.000000               1.000000                1.000000   \n2                  0.738341               0.683394                0.550154   \n\n   Training ACCURACY_SCORE  Training BALANCED_ACCURACY_SCORE  \\\n0                 0.859388                          0.792321   \n1                 1.000000                          1.000000   \n2                 0.865479                          0.803303   \n\n   Training COHEN_KAPPA_SCORE  Training HAMMING_LOSS  ...  \\\n0                    0.603167               0.140612  ...   \n1                    1.000000               0.000000  ...   \n2                    0.622423               0.134521  ...   \n\n   Validation ACCURACY_SCORE  Validation BALANCED_ACCURACY_SCORE  \\\n0                   0.852426                            0.784474   \n1                   0.853808                            0.776665   \n2                   0.850430                            0.784467   \n\n   Validation COHEN_KAPPA_SCORE  Validation HAMMING_LOSS  \\\n0                      0.584884                 0.147574   \n1                      0.580404                 0.146192   \n2                      0.581612                 0.149570   \n\n   Validation MATTHEWS_CORRCOEF  Validation ZERO_ONE_LOSS  \\\n0                      0.585738                  0.147574   \n1                      0.583003                  0.146192   \n2                      0.582150                  0.149570   \n\n   Validation ROC_AUC_SCORE  Validation LOG_LOSS  Validation BRIER_SCORE_LOSS  \\\n0                  0.908951             0.317288                     0.101612   \n1                  0.907010             0.318016                     0.100486   \n2                  0.909318             0.316194                     0.101722   \n\n   Validation AVERAGE_PRECISION_SCORE  \n0                            0.868420  \n1                            0.875084  \n2                            0.868410  \n\n[3 rows x 44 columns]",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>Program</th>\n      <th>Model</th>\n      <th>Training F1_SCORE</th>\n      <th>Training PRECISION_SCORE</th>\n      <th>Training RECALL_SCORE</th>\n      <th>Training JACCARD_SCORE</th>\n      <th>Training ACCURACY_SCORE</th>\n      <th>Training BALANCED_ACCURACY_SCORE</th>\n      <th>Training COHEN_KAPPA_SCORE</th>\n      <th>Training HAMMING_LOSS</th>\n      <th>...</th>\n      <th>Validation ACCURACY_SCORE</th>\n      <th>Validation BALANCED_ACCURACY_SCORE</th>\n      <th>Validation COHEN_KAPPA_SCORE</th>\n      <th>Validation HAMMING_LOSS</th>\n      <th>Validation MATTHEWS_CORRCOEF</th>\n      <th>Validation ZERO_ONE_LOSS</th>\n      <th>Validation ROC_AUC_SCORE</th>\n      <th>Validation LOG_LOSS</th>\n      <th>Validation BRIER_SCORE_LOSS</th>\n      <th>Validation AVERAGE_PRECISION_SCORE</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>WideDeep</td>\n      <td>TabMlp</td>\n      <td>0.694200</td>\n      <td>0.728505</td>\n      <td>0.662981</td>\n      <td>0.531628</td>\n      <td>0.859388</td>\n      <td>0.792321</td>\n      <td>0.603167</td>\n      <td>0.140612</td>\n      <td>...</td>\n      <td>0.852426</td>\n      <td>0.784474</td>\n      <td>0.584884</td>\n      <td>0.147574</td>\n      <td>0.585738</td>\n      <td>0.147574</td>\n      <td>0.908951</td>\n      <td>0.317288</td>\n      <td>0.101612</td>\n      <td>0.868420</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>AutoGluon</td>\n      <td>Random Forest</td>\n      <td>1.000000</td>\n      <td>1.000000</td>\n      <td>1.000000</td>\n      <td>1.000000</td>\n      <td>1.000000</td>\n      <td>1.000000</td>\n      <td>1.000000</td>\n      <td>0.000000</td>\n      <td>...</td>\n      <td>0.853808</td>\n      <td>0.776665</td>\n      <td>0.580404</td>\n      <td>0.146192</td>\n      <td>0.583003</td>\n      <td>0.146192</td>\n      <td>0.907010</td>\n      <td>0.318016</td>\n      <td>0.100486</td>\n      <td>0.875084</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>PytorchTabular</td>\n      <td>Category Embedding</td>\n      <td>0.709806</td>\n      <td>0.738341</td>\n      <td>0.683394</td>\n      <td>0.550154</td>\n      <td>0.865479</td>\n      <td>0.803303</td>\n      <td>0.622423</td>\n      <td>0.134521</td>\n      <td>...</td>\n      <td>0.850430</td>\n      <td>0.784467</td>\n      <td>0.581612</td>\n      <td>0.149570</td>\n      <td>0.582150</td>\n      <td>0.149570</td>\n      <td>0.909318</td>\n      <td>0.316194</td>\n      <td>0.101722</td>\n      <td>0.868410</td>\n    </tr>\n  </tbody>\n</table>\n<p>3 rows × 44 columns</p>\n</div>"
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "trainer = Trainer(device=device)\n",
    "adult_columns = [\n",
    "    \"age\",\n",
    "    \"workclass\",\n",
    "    \"fnlwgt\",\n",
    "    \"education\",\n",
    "    \"education-num\",\n",
    "    \"marital-status\",\n",
    "    \"occupation\",\n",
    "    \"relationship\",\n",
    "    \"race\",\n",
    "    \"sex\",\n",
    "    \"capital-gain\",\n",
    "    \"capital-loss\",\n",
    "    \"hours-per-week\",\n",
    "    \"native-country\",\n",
    "    \"income\",\n",
    "]\n",
    "cfg = UserConfig.from_uci(\"Adult\", column_names=adult_columns, sep=\", \")\n",
    "trainer.load_config(cfg)\n",
    "trainer.load_data()\n",
    "models = [\n",
    "    PytorchTabular(trainer, model_subset=[\"Category Embedding\"]),\n",
    "    WideDeep(trainer, model_subset=[\"TabMlp\"]),\n",
    "    AutoGluon(trainer, model_subset=[\"Random Forest\"]),\n",
    "]\n",
    "trainer.add_modelbases(models)\n",
    "trainer.train(stderr_to_stdout=True)\n",
    "trainer.get_leaderboard()"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "## Multiclass classification\n",
    "\n",
    "Iris is a famous multiclass classification task. It is also loaded from UCI datasets. We gave the argument `column_names` to `from_uci` in the above examples. If we do not know the column labels, column names from the UCI website are used (whose order might be wrong, such as those for the Auto MPG dataset) and the downloaded archive will not be removed after `from_uci`. There should be a file named `xxx.name` in the archive with column names in it."
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Downloading https://archive.ics.uci.edu/static/public/53/iris.zip to /tmp/tmpqcxgn2l1/data/Iris.zip\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/xlluo/hdd/tabular_ensemble/tabensemb/config/user_config.py:323: UserWarning: `column_names` is not given. The order of columns will be loaded from the website. It is highly recommended to manually set column names. The downloaded .zip is saved. Please check its .name file for the correct order.\n",
      "  warnings.warn(\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The project will be saved to /tmp/tmpqcxgn2l1/output/iris/2023-09-23-20-37-50-0_UserInputConfig\n",
      "Dataset size: 90 30 30\n",
      "Data saved to /tmp/tmpqcxgn2l1/output/iris/2023-09-23-20-37-50-0_UserInputConfig (data.csv and tabular_data.csv).\n",
      "\n",
      "-------------Run PytorchTabular-------------\n",
      "\n",
      "Training Category Embedding\n",
      "Global seed set to 42\n",
      "2023-09-23 20:37:51,096 - {pytorch_tabular.tabular_model:473} - INFO - Preparing the DataLoaders\n",
      "2023-09-23 20:37:51,096 - {pytorch_tabular.tabular_datamodule:290} - INFO - Setting up the datamodule for classification task\n",
      "2023-09-23 20:37:51,106 - {pytorch_tabular.tabular_model:521} - INFO - Preparing the Model: CategoryEmbeddingModel\n",
      "2023-09-23 20:37:51,121 - {pytorch_tabular.tabular_model:268} - INFO - Preparing the Trainer\n",
      "/home/xlluo/anaconda3/envs/tabular_ensemble/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py:589: LightningDeprecationWarning: The Trainer argument `auto_select_gpus` has been deprecated in v1.9.0 and will be removed in v2.0.0. Please use the function `pytorch_lightning.accelerators.find_usable_cuda_devices` instead.\n",
      "  rank_zero_deprecation(\n",
      "Auto select gpus: [0]\n",
      "GPU available: True (cuda), used: True\n",
      "TPU available: False, using: 0 TPU cores\n",
      "IPU available: False, using: 0 IPUs\n",
      "HPU available: False, using: 0 HPUs\n",
      "2023-09-23 20:37:51,137 - {pytorch_tabular.tabular_model:582} - INFO - Training Started\n",
      "You are using a CUDA device ('NVIDIA GeForce RTX 3090') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision\n",
      "LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]\n",
      "\n",
      "  | Name             | Type                      | Params\n",
      "---------------------------------------------------------------\n",
      "0 | _backbone        | CategoryEmbeddingBackbone | 11.0 K\n",
      "1 | _embedding_layer | Embedding1dLayer          | 8     \n",
      "2 | head             | LinearHead                | 99    \n",
      "3 | loss             | CrossEntropyLoss          | 0     \n",
      "---------------------------------------------------------------\n",
      "11.1 K    Trainable params\n",
      "0         Non-trainable params\n",
      "11.1 K    Total params\n",
      "0.044     Total estimated model params size (MB)\n",
      "Epoch: 1/300, Train loss: 1.7917, Val loss: 1.4287, Min val loss: 1.4287, Epoch time: 0.025s.\n",
      "Epoch: 20/300, Train loss: 0.3055, Val loss: 0.6197, Min val loss: 0.6197, Epoch time: 0.019s.\n",
      "Epoch: 40/300, Train loss: 0.2105, Val loss: 0.5668, Min val loss: 0.5668, Epoch time: 0.016s.\n",
      "Epoch: 60/300, Train loss: 0.1510, Val loss: 0.5234, Min val loss: 0.5234, Epoch time: 0.013s.\n",
      "Epoch: 80/300, Train loss: 0.1741, Val loss: 0.5314, Min val loss: 0.5216, Epoch time: 0.010s.\n",
      "Epoch: 100/300, Train loss: 0.0870, Val loss: 0.4985, Min val loss: 0.4853, Epoch time: 0.013s.\n",
      "Epoch: 120/300, Train loss: 0.0437, Val loss: 0.5143, Min val loss: 0.4853, Epoch time: 0.027s.\n",
      "Epoch: 140/300, Train loss: 0.0248, Val loss: 0.4864, Min val loss: 0.4844, Epoch time: 0.017s.\n",
      "Epoch: 160/300, Train loss: 0.0663, Val loss: 0.5182, Min val loss: 0.4506, Epoch time: 0.012s.\n",
      "Epoch: 180/300, Train loss: 0.0457, Val loss: 0.5648, Min val loss: 0.4506, Epoch time: 0.018s.\n",
      "Epoch: 200/300, Train loss: 0.0188, Val loss: 0.4554, Min val loss: 0.4319, Epoch time: 0.012s.\n",
      "Epoch: 220/300, Train loss: 0.0421, Val loss: 0.4980, Min val loss: 0.4269, Epoch time: 0.010s.\n",
      "Epoch: 240/300, Train loss: 0.0202, Val loss: 0.4948, Min val loss: 0.4187, Epoch time: 0.011s.\n",
      "Epoch: 260/300, Train loss: 0.0269, Val loss: 0.4776, Min val loss: 0.4187, Epoch time: 0.035s.\n",
      "Epoch: 280/300, Train loss: 0.0621, Val loss: 0.4086, Min val loss: 0.3670, Epoch time: 0.013s.\n",
      "Epoch: 300/300, Train loss: 0.0098, Val loss: 0.4748, Min val loss: 0.3670, Epoch time: 0.011s.\n",
      "`Trainer.fit` stopped: `max_epochs=300` reached.\n",
      "2023-09-23 20:37:56,026 - {pytorch_tabular.tabular_model:584} - INFO - Training the model completed\n",
      "2023-09-23 20:37:56,027 - {pytorch_tabular.tabular_model:1258} - INFO - Loading the best model\n",
      "/home/xlluo/anaconda3/envs/tabular_ensemble/lib/python3.10/site-packages/pytorch_lightning/utilities/cloud_io.py:33: LightningDeprecationWarning: `pytorch_lightning.utilities.cloud_io.get_filesystem` has been deprecated in v1.8.0 and will be removed in v2.0.0. Please use `lightning_fabric.utilities.cloud_io.get_filesystem` instead.\n",
      "  rank_zero_deprecation(\n",
      "Training log_loss loss: 0.01391\n",
      "Validation log_loss loss: 0.36698\n",
      "Testing log_loss loss: 0.14373\n",
      "Trainer saved. To load the trainer, run trainer = load_trainer(path='/tmp/tmpqcxgn2l1/output/iris/2023-09-23-20-37-50-0_UserInputConfig/trainer.pkl')\n",
      "\n",
      "-------------PytorchTabular End-------------\n",
      "\n",
      "\n",
      "-------------Run WideDeep-------------\n",
      "\n",
      "Training TabMlp\n",
      "Epoch: 1/300, Train loss: 1.0850, Val loss: 1.0164, Min val loss: 1.0164\n",
      "Epoch: 21/300, Train loss: 0.3373, Val loss: 0.5153, Min val loss: 0.5153\n",
      "Epoch: 41/300, Train loss: 0.1735, Val loss: 0.3780, Min val loss: 0.3780\n",
      "Epoch: 61/300, Train loss: 0.1253, Val loss: 0.3279, Min val loss: 0.3165\n",
      "Epoch: 81/300, Train loss: 0.0960, Val loss: 0.3603, Min val loss: 0.2951\n",
      "Epoch: 101/300, Train loss: 0.1037, Val loss: 0.3682, Min val loss: 0.2951\n",
      "Epoch: 121/300, Train loss: 0.1171, Val loss: 0.3525, Min val loss: 0.2951\n",
      "Epoch: 141/300, Train loss: 0.0503, Val loss: 0.3754, Min val loss: 0.2951\n",
      "Epoch: 161/300, Train loss: 0.0421, Val loss: 0.3265, Min val loss: 0.2951\n",
      "Epoch 00167: early stopping\n",
      "Restoring model weights from the end of the best epoch\n",
      "Training log_loss loss: 0.06156\n",
      "Validation log_loss loss: 0.29513\n",
      "Testing log_loss loss: 0.11595\n",
      "Trainer saved. To load the trainer, run trainer = load_trainer(path='/tmp/tmpqcxgn2l1/output/iris/2023-09-23-20-37-50-0_UserInputConfig/trainer.pkl')\n",
      "\n",
      "-------------WideDeep End-------------\n",
      "\n",
      "\n",
      "-------------Run AutoGluon-------------\n",
      "\n",
      "Training Random Forest\n",
      "Presets specified: ['best_quality']\n",
      "Warning: hyperparameter tuning is currently experimental and may cause the process to hang.\n",
      "Stack configuration (auto_stack=True): num_stack_levels=0, num_bag_folds=5, num_bag_sets=1\n",
      "Beginning AutoGluon training ...\n",
      "AutoGluon will save models to \"/tmp/tmpqcxgn2l1/output/iris/2023-09-23-20-37-50-0_UserInputConfig/AutoGluon/Random Forest/\"\n",
      "AutoGluon Version:  0.8.2\n",
      "Python Version:     3.10.12\n",
      "Operating System:   Linux\n",
      "Platform Machine:   x86_64\n",
      "Platform Version:   #1 SMP Mon Dec 13 20:27:58 CST 2021\n",
      "Disk Space Avail:   148.84 GB / 502.47 GB (29.6%)\n",
      "Train Data Rows:    90\n",
      "Train Data Columns: 4\n",
      "Tuning Data Rows:    30\n",
      "Tuning Data Columns: 4\n",
      "Label Column: class\n",
      "Preprocessing data ...\n",
      "Train Data Class Count: 3\n",
      "Using Feature Generators to preprocess the data ...\n",
      "Fitting PipelineFeatureGenerator...\n",
      "\tAvailable Memory:                    4021.68 MB\n",
      "\tTrain Data (Original)  Memory Usage: 0.0 MB (0.0% of available memory)\n",
      "\tInferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.\n",
      "\tStage 1 Generators:\n",
      "\t\tFitting AsTypeFeatureGenerator...\n",
      "\tStage 2 Generators:\n",
      "\t\tFitting FillNaFeatureGenerator...\n",
      "\tStage 3 Generators:\n",
      "\t\tFitting IdentityFeatureGenerator...\n",
      "\t\t\tWarning: feature_metadata_in passed as input to fit_transform, but self.feature_metadata_in was already set. Ignoring feature_metadata_in.\n",
      "\tStage 4 Generators:\n",
      "\t\tFitting DropUniqueFeatureGenerator...\n",
      "\tStage 5 Generators:\n",
      "\t\tFitting DropDuplicatesFeatureGenerator...\n",
      "\tTypes of features in original data (raw dtype, special dtypes):\n",
      "\t\t('float', []) : 4 | ['sepal length', 'sepal width', 'petal length', 'petal width']\n",
      "\tTypes of features in processed data (raw dtype, special dtypes):\n",
      "\t\t('float', []) : 4 | ['sepal length', 'sepal width', 'petal length', 'petal width']\n",
      "\t0.0s = Fit runtime\n",
      "\t4 features in original data used to generate 4 features in processed data.\n",
      "\tTrain Data (Processed) Memory Usage: 0.0 MB (0.0% of available memory)\n",
      "Data preprocessing and feature engineering runtime = 0.06s ...\n",
      "AutoGluon will gauge predictive performance using evaluation metric: 'accuracy'\n",
      "\tTo change this, specify the eval_metric parameter of Predictor()\n",
      "use_bag_holdout=True, will use tuning_data as holdout (will not be used for early stopping).\n",
      "User-specified model hyperparameters to be fit:\n",
      "{\n",
      "\t'RF': {},\n",
      "}\n",
      "Fitting 1 L1 models ...\n",
      "Hyperparameter tuning model: RandomForest_BAG_L1 ...\n",
      "\tNo hyperparameter search space specified for RandomForest_BAG_L1. Skipping HPO. Will train one model based on the provided hyperparameters.\n",
      "Fitted model: RandomForest_BAG_L1 ...\n",
      "\t1.0\t = Validation score   (accuracy)\n",
      "\t0.57s\t = Training   runtime\n",
      "\t0.0s\t = Validation runtime\n",
      "Fitting model: WeightedEnsemble_L2 ...\n",
      "\t0.8\t = Validation score   (accuracy)\n",
      "\t0.0s\t = Training   runtime\n",
      "\t0.0s\t = Validation runtime\n",
      "AutoGluon training complete, total runtime = 0.69s ... Best model: \"WeightedEnsemble_L2\"\n",
      "TabularPredictor saved. To load, use: predictor = TabularPredictor.load(\"/tmp/tmpqcxgn2l1/output/iris/2023-09-23-20-37-50-0_UserInputConfig/AutoGluon/Random Forest/\")\n",
      "Training log_loss loss: 0.01002\n",
      "Validation log_loss loss: 0.78155\n",
      "Testing log_loss loss: 0.04824\n",
      "Trainer saved. To load the trainer, run trainer = load_trainer(path='/tmp/tmpqcxgn2l1/output/iris/2023-09-23-20-37-50-0_UserInputConfig/trainer.pkl')\n",
      "\n",
      "-------------AutoGluon End-------------\n",
      "\n",
      "PytorchTabular metrics\n",
      "Category Embedding 1/1\n",
      "WideDeep metrics\n",
      "TabMlp 1/1\n",
      "AutoGluon metrics\n",
      "Random Forest 1/1\n",
      "Trainer saved. To load the trainer, run trainer = load_trainer(path='/tmp/tmpqcxgn2l1/output/iris/2023-09-23-20-37-50-0_UserInputConfig/trainer.pkl')\n"
     ]
    },
    {
     "data": {
      "text/plain": "          Program               Model  Training ACCURACY_SCORE  \\\n0  PytorchTabular  Category Embedding                      1.0   \n1        WideDeep              TabMlp                      1.0   \n2       AutoGluon       Random Forest                      1.0   \n\n   Training BALANCED_ACCURACY_SCORE  Training COHEN_KAPPA_SCORE  \\\n0                               1.0                         1.0   \n1                               1.0                         1.0   \n2                               1.0                         1.0   \n\n   Training HAMMING_LOSS  Training MATTHEWS_CORRCOEF  Training ZERO_ONE_LOSS  \\\n0                    0.0                         1.0                     0.0   \n1                    0.0                         1.0                     0.0   \n2                    0.0                         1.0                     0.0   \n\n   Training PRECISION_SCORE_MACRO  Training PRECISION_SCORE_MICRO  ...  \\\n0                             1.0                             1.0  ...   \n1                             1.0                             1.0  ...   \n2                             1.0                             1.0  ...   \n\n   Validation F1_SCORE_MICRO  Validation F1_SCORE_WEIGHTED  \\\n0                   0.833333                      0.837232   \n1                   0.833333                      0.837232   \n2                   0.800000                      0.804615   \n\n   Validation JACCARD_SCORE_MACRO  Validation JACCARD_SCORE_MICRO  \\\n0                        0.756944                        0.714286   \n1                        0.756944                        0.714286   \n2                        0.721154                        0.666667   \n\n   Validation JACCARD_SCORE_WEIGHTED  Validation TOP_K_ACCURACY_SCORE  \\\n0                           0.732639                              1.0   \n1                           0.732639                              1.0   \n2                           0.689423                              1.0   \n\n   Validation LOG_LOSS  Validation ROC_AUC_SCORE_OVR_MACRO  \\\n0             0.366983                            0.974891   \n1             0.295129                            0.979747   \n2             0.781551                            0.950812   \n\n   Validation ROC_AUC_SCORE_OVR_WEIGHTED  Validation ROC_AUC_SCORE_OVO  \n0                               0.971616                      0.976042  \n1                               0.977576                      0.980833  \n2                               0.941465                      0.951042  \n\n[3 rows x 71 columns]",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>Program</th>\n      <th>Model</th>\n      <th>Training ACCURACY_SCORE</th>\n      <th>Training BALANCED_ACCURACY_SCORE</th>\n      <th>Training COHEN_KAPPA_SCORE</th>\n      <th>Training HAMMING_LOSS</th>\n      <th>Training MATTHEWS_CORRCOEF</th>\n      <th>Training ZERO_ONE_LOSS</th>\n      <th>Training PRECISION_SCORE_MACRO</th>\n      <th>Training PRECISION_SCORE_MICRO</th>\n      <th>...</th>\n      <th>Validation F1_SCORE_MICRO</th>\n      <th>Validation F1_SCORE_WEIGHTED</th>\n      <th>Validation JACCARD_SCORE_MACRO</th>\n      <th>Validation JACCARD_SCORE_MICRO</th>\n      <th>Validation JACCARD_SCORE_WEIGHTED</th>\n      <th>Validation TOP_K_ACCURACY_SCORE</th>\n      <th>Validation LOG_LOSS</th>\n      <th>Validation ROC_AUC_SCORE_OVR_MACRO</th>\n      <th>Validation ROC_AUC_SCORE_OVR_WEIGHTED</th>\n      <th>Validation ROC_AUC_SCORE_OVO</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>PytorchTabular</td>\n      <td>Category Embedding</td>\n      <td>1.0</td>\n      <td>1.0</td>\n      <td>1.0</td>\n      <td>0.0</td>\n      <td>1.0</td>\n      <td>0.0</td>\n      <td>1.0</td>\n      <td>1.0</td>\n      <td>...</td>\n      <td>0.833333</td>\n      <td>0.837232</td>\n      <td>0.756944</td>\n      <td>0.714286</td>\n      <td>0.732639</td>\n      <td>1.0</td>\n      <td>0.366983</td>\n      <td>0.974891</td>\n      <td>0.971616</td>\n      <td>0.976042</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>WideDeep</td>\n      <td>TabMlp</td>\n      <td>1.0</td>\n      <td>1.0</td>\n      <td>1.0</td>\n      <td>0.0</td>\n      <td>1.0</td>\n      <td>0.0</td>\n      <td>1.0</td>\n      <td>1.0</td>\n      <td>...</td>\n      <td>0.833333</td>\n      <td>0.837232</td>\n      <td>0.756944</td>\n      <td>0.714286</td>\n      <td>0.732639</td>\n      <td>1.0</td>\n      <td>0.295129</td>\n      <td>0.979747</td>\n      <td>0.977576</td>\n      <td>0.980833</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>AutoGluon</td>\n      <td>Random Forest</td>\n      <td>1.0</td>\n      <td>1.0</td>\n      <td>1.0</td>\n      <td>0.0</td>\n      <td>1.0</td>\n      <td>0.0</td>\n      <td>1.0</td>\n      <td>1.0</td>\n      <td>...</td>\n      <td>0.800000</td>\n      <td>0.804615</td>\n      <td>0.721154</td>\n      <td>0.666667</td>\n      <td>0.689423</td>\n      <td>1.0</td>\n      <td>0.781551</td>\n      <td>0.950812</td>\n      <td>0.941465</td>\n      <td>0.951042</td>\n    </tr>\n  </tbody>\n</table>\n<p>3 rows × 71 columns</p>\n</div>"
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "trainer = Trainer(device=device)\n",
    "cfg = UserConfig.from_uci(\"Iris\", datafile_name=\"iris\")\n",
    "trainer.load_config(cfg)\n",
    "trainer.load_data()\n",
    "models = [\n",
    "    PytorchTabular(trainer, model_subset=[\"Category Embedding\"]),\n",
    "    WideDeep(trainer, model_subset=[\"TabMlp\"]),\n",
    "    AutoGluon(trainer, model_subset=[\"Random Forest\"]),\n",
    "]\n",
    "trainer.add_modelbases(models)\n",
    "trainer.train(stderr_to_stdout=True)\n",
    "trainer.get_leaderboard()"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "## Using a configuration file\n",
    "\n",
    "In the above introduction, we use UCI datasets whose configuration is automatically generated. The configuration can also be loaded from a local `.py` or `.json` file. To run a minimum example, we provide a randomly generated sample dataset (`data/sample.csv`) and its configuration file (`configs/sample.py`) in the repository. See \"Dataset and configuration\" for the detailed introduction of configuration files.\n",
    "\n",
    "`tabensemb` uses paths relative to the current directory. For different IDEs (PyCharm, VSCode, etc.), the directory might be different. Set default paths to desired ones after checking the current working directory using magic commands in notebooks like `!pwd` or scripts like `import os; os.getcwd()`."
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "path = \"../../../../\"\n",
    "tabensemb.setting[\"default_config_path\"] = path + \"configs\"\n",
    "tabensemb.setting[\"default_data_path\"] = path + \"data\""
   ]
  },
  {
   "cell_type": "markdown",
   "source": [
    "Load the configuration file `sample.py` using `Trainer.load_config`, which automatically searches the file in the current directory and `tabensemb.setting[\"default_config_path\"]`."
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The project will be saved to /tmp/tmpqcxgn2l1/output/iris/2023-09-23-20-37-58-0_sample\n",
      "Dataset size: 153 51 52\n",
      "Data saved to /tmp/tmpqcxgn2l1/output/iris/2023-09-23-20-37-58-0_sample (data.csv and tabular_data.csv).\n"
     ]
    }
   ],
   "source": [
    "trainer.load_config(\"sample\")\n",
    "trainer.load_data()"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "Then initialize models:"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "outputs": [],
   "source": [
    "trainer.clear_modelbase()\n",
    "models = [\n",
    "    PytorchTabular(trainer, model_subset=[\"Category Embedding\"])\n",
    "]\n",
    "trainer.add_modelbases(models)"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "*Optional*: For a quick development test, changing the following global setting significantly reduces training time."
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "outputs": [],
   "source": [
    "tabensemb.setting[\"debug_mode\"] = True"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "-------------Run PytorchTabular-------------\n",
      "\n",
      "Training Category Embedding\n",
      "Global seed set to 42\n",
      "2023-09-23 20:37:59,305 - {pytorch_tabular.tabular_model:473} - INFO - Preparing the DataLoaders\n",
      "2023-09-23 20:37:59,306 - {pytorch_tabular.tabular_datamodule:290} - INFO - Setting up the datamodule for regression task\n",
      "2023-09-23 20:37:59,326 - {pytorch_tabular.tabular_model:521} - INFO - Preparing the Model: CategoryEmbeddingModel\n",
      "2023-09-23 20:37:59,350 - {pytorch_tabular.tabular_model:268} - INFO - Preparing the Trainer\n",
      "/home/xlluo/anaconda3/envs/tabular_ensemble/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py:589: LightningDeprecationWarning: The Trainer argument `auto_select_gpus` has been deprecated in v1.9.0 and will be removed in v2.0.0. Please use the function `pytorch_lightning.accelerators.find_usable_cuda_devices` instead.\n",
      "  rank_zero_deprecation(\n",
      "Auto select gpus: [0]\n",
      "GPU available: True (cuda), used: True\n",
      "TPU available: False, using: 0 TPU cores\n",
      "IPU available: False, using: 0 IPUs\n",
      "HPU available: False, using: 0 HPUs\n",
      "2023-09-23 20:37:59,372 - {pytorch_tabular.tabular_model:582} - INFO - Training Started\n",
      "You are using a CUDA device ('NVIDIA GeForce RTX 3090') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision\n",
      "LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]\n",
      "\n",
      "  | Name             | Type                      | Params\n",
      "---------------------------------------------------------------\n",
      "0 | _backbone        | CategoryEmbeddingBackbone | 12.3 K\n",
      "1 | _embedding_layer | Embedding1dLayer          | 64    \n",
      "2 | head             | LinearHead                | 33    \n",
      "3 | loss             | MSELoss                   | 0     \n",
      "---------------------------------------------------------------\n",
      "12.4 K    Trainable params\n",
      "0         Non-trainable params\n",
      "12.4 K    Total params\n",
      "0.049     Total estimated model params size (MB)\n",
      "Epoch: 1/2, Train loss: 33183.6562, Val loss: 22223.0391, Min val loss: 22223.0391, Epoch time: 0.014s.\n",
      "`Trainer.fit` stopped: `max_epochs=2` reached.\n",
      "2023-09-23 20:37:59,437 - {pytorch_tabular.tabular_model:584} - INFO - Training the model completed\n",
      "2023-09-23 20:37:59,438 - {pytorch_tabular.tabular_model:1258} - INFO - Loading the best model\n",
      "/home/xlluo/anaconda3/envs/tabular_ensemble/lib/python3.10/site-packages/pytorch_lightning/utilities/cloud_io.py:33: LightningDeprecationWarning: `pytorch_lightning.utilities.cloud_io.get_filesystem` has been deprecated in v1.8.0 and will be removed in v2.0.0. Please use `lightning_fabric.utilities.cloud_io.get_filesystem` instead.\n",
      "  rank_zero_deprecation(\n",
      "Training mse loss: 33085.08333\n",
      "Validation mse loss: 22182.62019\n",
      "Testing mse loss: 29810.51082\n",
      "Trainer saved. To load the trainer, run trainer = load_trainer(path='/tmp/tmpqcxgn2l1/output/iris/2023-09-23-20-37-58-0_sample/trainer.pkl')\n",
      "\n",
      "-------------PytorchTabular End-------------\n",
      "\n",
      "PytorchTabular metrics\n",
      "Category Embedding 1/1\n",
      "Trainer saved. To load the trainer, run trainer = load_trainer(path='/tmp/tmpqcxgn2l1/output/iris/2023-09-23-20-37-58-0_sample/trainer.pkl')\n"
     ]
    },
    {
     "data": {
      "text/plain": "          Program               Model  Training RMSE  Training MSE  \\\n0  PytorchTabular  Category Embedding     181.893055  33085.083331   \n\n   Training MAE  Training MAPE  Training R2  Training MEDIAN_ABSOLUTE_ERROR  \\\n0    145.409738       1.055588    -0.004359                      121.139843   \n\n   Training EXPLAINED_VARIANCE_SCORE  Testing RMSE  ...  Testing R2  \\\n0                           0.001236    172.657206  ...   -0.005851   \n\n   Testing MEDIAN_ABSOLUTE_ERROR  Testing EXPLAINED_VARIANCE_SCORE  \\\n0                     118.665751                         -0.001657   \n\n   Validation RMSE  Validation MSE  Validation MAE  Validation MAPE  \\\n0        148.93831    22182.620185      121.146176          1.00906   \n\n   Validation R2  Validation MEDIAN_ABSOLUTE_ERROR  \\\n0      -0.001817                         92.916794   \n\n   Validation EXPLAINED_VARIANCE_SCORE  \n0                             0.001214  \n\n[1 rows x 23 columns]",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>Program</th>\n      <th>Model</th>\n      <th>Training RMSE</th>\n      <th>Training MSE</th>\n      <th>Training MAE</th>\n      <th>Training MAPE</th>\n      <th>Training R2</th>\n      <th>Training MEDIAN_ABSOLUTE_ERROR</th>\n      <th>Training EXPLAINED_VARIANCE_SCORE</th>\n      <th>Testing RMSE</th>\n      <th>...</th>\n      <th>Testing R2</th>\n      <th>Testing MEDIAN_ABSOLUTE_ERROR</th>\n      <th>Testing EXPLAINED_VARIANCE_SCORE</th>\n      <th>Validation RMSE</th>\n      <th>Validation MSE</th>\n      <th>Validation MAE</th>\n      <th>Validation MAPE</th>\n      <th>Validation R2</th>\n      <th>Validation MEDIAN_ABSOLUTE_ERROR</th>\n      <th>Validation EXPLAINED_VARIANCE_SCORE</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>PytorchTabular</td>\n      <td>Category Embedding</td>\n      <td>181.893055</td>\n      <td>33085.083331</td>\n      <td>145.409738</td>\n      <td>1.055588</td>\n      <td>-0.004359</td>\n      <td>121.139843</td>\n      <td>0.001236</td>\n      <td>172.657206</td>\n      <td>...</td>\n      <td>-0.005851</td>\n      <td>118.665751</td>\n      <td>-0.001657</td>\n      <td>148.93831</td>\n      <td>22182.620185</td>\n      <td>121.146176</td>\n      <td>1.00906</td>\n      <td>-0.001817</td>\n      <td>92.916794</td>\n      <td>0.001214</td>\n    </tr>\n  </tbody>\n</table>\n<p>1 rows × 23 columns</p>\n</div>"
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "trainer.train(stderr_to_stdout=True)\n",
    "trainer.get_leaderboard()"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "Clean the temporary directory of the notebook."
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "outputs": [],
   "source": [
    "temp_path.cleanup()"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.16"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}