{
 "cells": [
  {
   "cell_type": "markdown",
   "source": [
    "# Inference on an upcoming dataset\n",
    "\n",
    "In this part, we will simulate the real deployment of the package and make inferences on an upcoming dataset. We use the Adult dataset from UCI datasets which has an individual testing set.\n",
    "\n",
    "## Training models\n",
    "\n",
    "Similar to the first example, we initialize a `Trainer` and model bases, then train all models."
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true,
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Using cuda device\n",
      "Downloading https://archive.ics.uci.edu/static/public/2/adult.zip to /tmp/tmpkc_l71f1/data/Adult.zip\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/xlluo/hdd/tabular_ensemble/tabensemb/config/user_config.py:292: UserWarning: There exists .test file(s) ['adult.test'] which should be used for final metrics. The .zip file is left for the user to process.\n",
      "  warnings.warn(\n",
      "/home/xlluo/hdd/tabular_ensemble/tabensemb/utils/utils.py:464: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.\n",
      "  df = pd.read_csv(StringIO(s), names=names, sep=sep)\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "age is Integer and will be treated as a continuous feature.\n",
      "fnlwgt is Integer and will be treated as a continuous feature.\n",
      "education-num is Integer and will be treated as a continuous feature.\n",
      "capital-gain is Integer and will be treated as a continuous feature.\n",
      "capital-loss is Integer and will be treated as a continuous feature.\n",
      "hours-per-week is Integer and will be treated as a continuous feature.\n",
      "The project will be saved to /tmp/tmpkc_l71f1/output/adult/2023-09-23-20-34-58-0_UserInputConfig\n",
      "Dataset size: 19536 6512 6513\n",
      "Data saved to /tmp/tmpkc_l71f1/output/adult/2023-09-23-20-34-58-0_UserInputConfig (data.csv and tabular_data.csv).\n",
      "\n",
      "-------------Run PytorchTabular-------------\n",
      "\n",
      "Training Category Embedding\n",
      "Global seed set to 42\n",
      "2023-09-23 20:35:00,222 - {pytorch_tabular.tabular_model:473} - INFO - Preparing the DataLoaders\n",
      "2023-09-23 20:35:00,223 - {pytorch_tabular.tabular_datamodule:290} - INFO - Setting up the datamodule for classification task\n",
      "2023-09-23 20:35:00,288 - {pytorch_tabular.tabular_model:521} - INFO - Preparing the Model: CategoryEmbeddingModel\n",
      "2023-09-23 20:35:00,307 - {pytorch_tabular.tabular_model:268} - INFO - Preparing the Trainer\n",
      "/home/xlluo/anaconda3/envs/tabular_ensemble/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py:589: LightningDeprecationWarning: The Trainer argument `auto_select_gpus` has been deprecated in v1.9.0 and will be removed in v2.0.0. Please use the function `pytorch_lightning.accelerators.find_usable_cuda_devices` instead.\n",
      "  rank_zero_deprecation(\n",
      "Auto select gpus: [0]\n",
      "GPU available: True (cuda), used: True\n",
      "TPU available: False, using: 0 TPU cores\n",
      "IPU available: False, using: 0 IPUs\n",
      "HPU available: False, using: 0 HPUs\n",
      "2023-09-23 20:35:01,129 - {pytorch_tabular.tabular_model:582} - INFO - Training Started\n",
      "You are using a CUDA device ('NVIDIA GeForce RTX 3090') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision\n",
      "LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]\n",
      "\n",
      "  | Name             | Type                      | Params\n",
      "---------------------------------------------------------------\n",
      "0 | _backbone        | CategoryEmbeddingBackbone | 18.5 K\n",
      "1 | _embedding_layer | Embedding1dLayer          | 1.4 K \n",
      "2 | head             | LinearHead                | 66    \n",
      "3 | loss             | CrossEntropyLoss          | 0     \n",
      "---------------------------------------------------------------\n",
      "20.0 K    Trainable params\n",
      "0         Non-trainable params\n",
      "20.0 K    Total params\n",
      "0.080     Total estimated model params size (MB)\n",
      "Epoch: 1/300, Train loss: 0.4666, Val loss: 0.3794, Min val loss: 0.3794, Epoch time: 0.376s.\n",
      "Epoch: 20/300, Train loss: 0.3112, Val loss: 0.3184, Min val loss: 0.3177, Epoch time: 0.406s.\n",
      "Epoch: 40/300, Train loss: 0.2972, Val loss: 0.3213, Min val loss: 0.3162, Epoch time: 0.326s.\n",
      "Epoch: 60/300, Train loss: 0.2864, Val loss: 0.3248, Min val loss: 0.3162, Epoch time: 0.330s.\n",
      "Epoch: 80/300, Train loss: 0.2766, Val loss: 0.3323, Min val loss: 0.3162, Epoch time: 0.386s.\n",
      "Epoch: 100/300, Train loss: 0.2677, Val loss: 0.3416, Min val loss: 0.3162, Epoch time: 0.540s.\n",
      "Epoch: 120/300, Train loss: 0.2602, Val loss: 0.3484, Min val loss: 0.3162, Epoch time: 0.357s.\n",
      "2023-09-23 20:35:50,914 - {pytorch_tabular.tabular_model:584} - INFO - Training the model completed\n",
      "2023-09-23 20:35:50,915 - {pytorch_tabular.tabular_model:1258} - INFO - Loading the best model\n",
      "/home/xlluo/anaconda3/envs/tabular_ensemble/lib/python3.10/site-packages/pytorch_lightning/utilities/cloud_io.py:33: LightningDeprecationWarning: `pytorch_lightning.utilities.cloud_io.get_filesystem` has been deprecated in v1.8.0 and will be removed in v2.0.0. Please use `lightning_fabric.utilities.cloud_io.get_filesystem` instead.\n",
      "  rank_zero_deprecation(\n",
      "Training log_loss loss: 0.28862\n",
      "Validation log_loss loss: 0.31619\n",
      "Testing log_loss loss: 0.31114\n",
      "Trainer saved. To load the trainer, run trainer = load_trainer(path='/tmp/tmpkc_l71f1/output/adult/2023-09-23-20-34-58-0_UserInputConfig/trainer.pkl')\n",
      "\n",
      "-------------PytorchTabular End-------------\n",
      "\n",
      "\n",
      "-------------Run WideDeep-------------\n",
      "\n",
      "Training TabMlp\n",
      "Epoch: 1/300, Train loss: 0.5206, Val loss: 0.4274, Min val loss: 0.4274\n",
      "Epoch: 21/300, Train loss: 0.3145, Val loss: 0.3161, Min val loss: 0.3161\n",
      "Epoch: 41/300, Train loss: 0.3063, Val loss: 0.3192, Min val loss: 0.3161\n",
      "Epoch: 61/300, Train loss: 0.2984, Val loss: 0.3209, Min val loss: 0.3161\n",
      "Epoch: 81/300, Train loss: 0.2905, Val loss: 0.3227, Min val loss: 0.3161\n",
      "Epoch: 101/300, Train loss: 0.2819, Val loss: 0.3266, Min val loss: 0.3161\n",
      "Epoch: 121/300, Train loss: 0.2754, Val loss: 0.3293, Min val loss: 0.3161\n",
      "Epoch 00121: early stopping\n",
      "Restoring model weights from the end of the best epoch\n",
      "Training log_loss loss: 0.30084\n",
      "Validation log_loss loss: 0.31729\n",
      "Testing log_loss loss: 0.31133\n",
      "Trainer saved. To load the trainer, run trainer = load_trainer(path='/tmp/tmpkc_l71f1/output/adult/2023-09-23-20-34-58-0_UserInputConfig/trainer.pkl')\n",
      "\n",
      "-------------WideDeep End-------------\n",
      "\n",
      "\n",
      "-------------Run AutoGluon-------------\n",
      "\n",
      "Training Random Forest\n",
      "Presets specified: ['best_quality']\n",
      "Warning: hyperparameter tuning is currently experimental and may cause the process to hang.\n",
      "Stack configuration (auto_stack=True): num_stack_levels=0, num_bag_folds=8, num_bag_sets=1\n",
      "Beginning AutoGluon training ...\n",
      "AutoGluon will save models to \"/tmp/tmpkc_l71f1/output/adult/2023-09-23-20-34-58-0_UserInputConfig/AutoGluon/Random Forest/\"\n",
      "AutoGluon Version:  0.8.2\n",
      "Python Version:     3.10.12\n",
      "Operating System:   Linux\n",
      "Platform Machine:   x86_64\n",
      "Platform Version:   #1 SMP Mon Dec 13 20:27:58 CST 2021\n",
      "Disk Space Avail:   149.26 GB / 502.47 GB (29.7%)\n",
      "Train Data Rows:    19536\n",
      "Train Data Columns: 14\n",
      "Tuning Data Rows:    6512\n",
      "Tuning Data Columns: 14\n",
      "Label Column: income\n",
      "Preprocessing data ...\n",
      "Selected class <--> label mapping:  class 1 = 1, class 0 = 0\n",
      "Using Feature Generators to preprocess the data ...\n",
      "Fitting PipelineFeatureGenerator...\n",
      "\tAvailable Memory:                    12442.89 MB\n",
      "\tTrain Data (Original)  Memory Usage: 15.07 MB (0.1% of available memory)\n",
      "\tInferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.\n",
      "\tStage 1 Generators:\n",
      "\t\tFitting AsTypeFeatureGenerator...\n",
      "\t\t\tNote: Converting 1 features to boolean dtype as they only contain 2 unique values.\n",
      "\tStage 2 Generators:\n",
      "\t\tFitting FillNaFeatureGenerator...\n",
      "\tStage 3 Generators:\n",
      "\t\tFitting IdentityFeatureGenerator...\n",
      "\t\t\tWarning: feature_metadata_in passed as input to fit_transform, but self.feature_metadata_in was already set. Ignoring feature_metadata_in.\n",
      "\t\tFitting CategoryFeatureGenerator...\n",
      "\t\t\tWarning: feature_metadata_in passed as input to fit_transform, but self.feature_metadata_in was already set. Ignoring feature_metadata_in.\n",
      "\t\t\tFitting CategoryMemoryMinimizeFeatureGenerator...\n",
      "\tStage 4 Generators:\n",
      "\t\tFitting DropUniqueFeatureGenerator...\n",
      "\tStage 5 Generators:\n",
      "\t\tFitting DropDuplicatesFeatureGenerator...\n",
      "\tTypes of features in original data (raw dtype, special dtypes):\n",
      "\t\t('int', [])    : 6 | ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', ...]\n",
      "\t\t('object', []) : 8 | ['education', 'marital-status', 'native-country', 'occupation', 'race', ...]\n",
      "\tTypes of features in processed data (raw dtype, special dtypes):\n",
      "\t\t('category', []) : 8 | ['education', 'marital-status', 'native-country', 'occupation', 'race', ...]\n",
      "\t\t('int', [])      : 6 | ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', ...]\n",
      "\t0.1s = Fit runtime\n",
      "\t14 features in original data used to generate 14 features in processed data.\n",
      "\tTrain Data (Processed) Memory Usage: 1.46 MB (0.0% of available memory)\n",
      "Data preprocessing and feature engineering runtime = 0.15s ...\n",
      "AutoGluon will gauge predictive performance using evaluation metric: 'accuracy'\n",
      "\tTo change this, specify the eval_metric parameter of Predictor()\n",
      "use_bag_holdout=True, will use tuning_data as holdout (will not be used for early stopping).\n",
      "User-specified model hyperparameters to be fit:\n",
      "{\n",
      "\t'RF': {},\n",
      "}\n",
      "Fitting 1 L1 models ...\n",
      "Hyperparameter tuning model: RandomForest_BAG_L1 ...\n",
      "\tNo hyperparameter search space specified for RandomForest_BAG_L1. Skipping HPO. Will train one model based on the provided hyperparameters.\n",
      "Fitted model: RandomForest_BAG_L1 ...\n",
      "\t0.8543\t = Validation score   (accuracy)\n",
      "\t1.19s\t = Training   runtime\n",
      "\t0.0s\t = Validation runtime\n",
      "Fitting model: WeightedEnsemble_L2 ...\n",
      "\t0.8538\t = Validation score   (accuracy)\n",
      "\t0.0s\t = Training   runtime\n",
      "\t0.01s\t = Validation runtime\n",
      "AutoGluon training complete, total runtime = 1.54s ... Best model: \"WeightedEnsemble_L2\"\n",
      "TabularPredictor saved. To load, use: predictor = TabularPredictor.load(\"/tmp/tmpkc_l71f1/output/adult/2023-09-23-20-34-58-0_UserInputConfig/AutoGluon/Random Forest/\")\n",
      "Training log_loss loss: 0.08021\n",
      "Validation log_loss loss: 0.31802\n",
      "Testing log_loss loss: 0.31261\n",
      "Trainer saved. To load the trainer, run trainer = load_trainer(path='/tmp/tmpkc_l71f1/output/adult/2023-09-23-20-34-58-0_UserInputConfig/trainer.pkl')\n",
      "\n",
      "-------------AutoGluon End-------------\n",
      "\n",
      "PytorchTabular metrics\n",
      "Category Embedding 1/1\n",
      "WideDeep metrics\n",
      "TabMlp 1/1\n",
      "AutoGluon metrics\n",
      "Random Forest 1/1\n",
      "Trainer saved. To load the trainer, run trainer = load_trainer(path='/tmp/tmpkc_l71f1/output/adult/2023-09-23-20-34-58-0_UserInputConfig/trainer.pkl')\n"
     ]
    },
    {
     "data": {
      "text/plain": "          Program               Model  Training F1_SCORE  \\\n0        WideDeep              TabMlp           0.694200   \n1       AutoGluon       Random Forest           1.000000   \n2  PytorchTabular  Category Embedding           0.709806   \n\n   Training PRECISION_SCORE  Training RECALL_SCORE  Training JACCARD_SCORE  \\\n0                  0.728505               0.662981                0.531628   \n1                  1.000000               1.000000                1.000000   \n2                  0.738341               0.683394                0.550154   \n\n   Training ACCURACY_SCORE  Training BALANCED_ACCURACY_SCORE  \\\n0                 0.859388                          0.792321   \n1                 1.000000                          1.000000   \n2                 0.865479                          0.803303   \n\n   Training COHEN_KAPPA_SCORE  Training HAMMING_LOSS  ...  \\\n0                    0.603167               0.140612  ...   \n1                    1.000000               0.000000  ...   \n2                    0.622423               0.134521  ...   \n\n   Validation ACCURACY_SCORE  Validation BALANCED_ACCURACY_SCORE  \\\n0                   0.852426                            0.784474   \n1                   0.853808                            0.776665   \n2                   0.850430                            0.784467   \n\n   Validation COHEN_KAPPA_SCORE  Validation HAMMING_LOSS  \\\n0                      0.584884                 0.147574   \n1                      0.580404                 0.146192   \n2                      0.581612                 0.149570   \n\n   Validation MATTHEWS_CORRCOEF  Validation ZERO_ONE_LOSS  \\\n0                      0.585738                  0.147574   \n1                      0.583003                  0.146192   \n2                      0.582150                  0.149570   \n\n   Validation ROC_AUC_SCORE  Validation LOG_LOSS  Validation BRIER_SCORE_LOSS  \\\n0                  0.908951             0.317288                     0.101612   \n1                  0.907010             0.318016                     0.100486   \n2                  0.909318             0.316194                     0.101722   \n\n   Validation AVERAGE_PRECISION_SCORE  \n0                            0.868420  \n1                            0.875084  \n2                            0.868410  \n\n[3 rows x 44 columns]",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>Program</th>\n      <th>Model</th>\n      <th>Training F1_SCORE</th>\n      <th>Training PRECISION_SCORE</th>\n      <th>Training RECALL_SCORE</th>\n      <th>Training JACCARD_SCORE</th>\n      <th>Training ACCURACY_SCORE</th>\n      <th>Training BALANCED_ACCURACY_SCORE</th>\n      <th>Training COHEN_KAPPA_SCORE</th>\n      <th>Training HAMMING_LOSS</th>\n      <th>...</th>\n      <th>Validation ACCURACY_SCORE</th>\n      <th>Validation BALANCED_ACCURACY_SCORE</th>\n      <th>Validation COHEN_KAPPA_SCORE</th>\n      <th>Validation HAMMING_LOSS</th>\n      <th>Validation MATTHEWS_CORRCOEF</th>\n      <th>Validation ZERO_ONE_LOSS</th>\n      <th>Validation ROC_AUC_SCORE</th>\n      <th>Validation LOG_LOSS</th>\n      <th>Validation BRIER_SCORE_LOSS</th>\n      <th>Validation AVERAGE_PRECISION_SCORE</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>WideDeep</td>\n      <td>TabMlp</td>\n      <td>0.694200</td>\n      <td>0.728505</td>\n      <td>0.662981</td>\n      <td>0.531628</td>\n      <td>0.859388</td>\n      <td>0.792321</td>\n      <td>0.603167</td>\n      <td>0.140612</td>\n      <td>...</td>\n      <td>0.852426</td>\n      <td>0.784474</td>\n      <td>0.584884</td>\n      <td>0.147574</td>\n      <td>0.585738</td>\n      <td>0.147574</td>\n      <td>0.908951</td>\n      <td>0.317288</td>\n      <td>0.101612</td>\n      <td>0.868420</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>AutoGluon</td>\n      <td>Random Forest</td>\n      <td>1.000000</td>\n      <td>1.000000</td>\n      <td>1.000000</td>\n      <td>1.000000</td>\n      <td>1.000000</td>\n      <td>1.000000</td>\n      <td>1.000000</td>\n      <td>0.000000</td>\n      <td>...</td>\n      <td>0.853808</td>\n      <td>0.776665</td>\n      <td>0.580404</td>\n      <td>0.146192</td>\n      <td>0.583003</td>\n      <td>0.146192</td>\n      <td>0.907010</td>\n      <td>0.318016</td>\n      <td>0.100486</td>\n      <td>0.875084</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>PytorchTabular</td>\n      <td>Category Embedding</td>\n      <td>0.709806</td>\n      <td>0.738341</td>\n      <td>0.683394</td>\n      <td>0.550154</td>\n      <td>0.865479</td>\n      <td>0.803303</td>\n      <td>0.622423</td>\n      <td>0.134521</td>\n      <td>...</td>\n      <td>0.850430</td>\n      <td>0.784467</td>\n      <td>0.581612</td>\n      <td>0.149570</td>\n      <td>0.582150</td>\n      <td>0.149570</td>\n      <td>0.909318</td>\n      <td>0.316194</td>\n      <td>0.101722</td>\n      <td>0.868410</td>\n    </tr>\n  </tbody>\n</table>\n<p>3 rows × 44 columns</p>\n</div>"
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import torch\n",
    "from tabensemb.trainer import Trainer\n",
    "from tabensemb.model import *\n",
    "import tabensemb\n",
    "from tabensemb.config import UserConfig\n",
    "import os\n",
    "from tempfile import TemporaryDirectory\n",
    "\n",
    "temp_path = TemporaryDirectory()\n",
    "tabensemb.setting[\"default_output_path\"] = os.path.join(temp_path.name, \"output\")\n",
    "tabensemb.setting[\"default_config_path\"] = os.path.join(temp_path.name, \"configs\")\n",
    "tabensemb.setting[\"default_data_path\"] = os.path.join(temp_path.name, \"data\")\n",
    "\n",
    "device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n",
    "print(\"Using {} device\".format(device))\n",
    "\n",
    "trainer = Trainer(device=device)\n",
    "adult_columns = [\n",
    "    \"age\",\n",
    "    \"workclass\",\n",
    "    \"fnlwgt\",\n",
    "    \"education\",\n",
    "    \"education-num\",\n",
    "    \"marital-status\",\n",
    "    \"occupation\",\n",
    "    \"relationship\",\n",
    "    \"race\",\n",
    "    \"sex\",\n",
    "    \"capital-gain\",\n",
    "    \"capital-loss\",\n",
    "    \"hours-per-week\",\n",
    "    \"native-country\",\n",
    "    \"income\",\n",
    "]\n",
    "cfg = UserConfig.from_uci(\"Adult\", column_names=adult_columns, sep=\", \")\n",
    "trainer.load_config(cfg)\n",
    "trainer.load_data()\n",
    "models = [\n",
    "    PytorchTabular(trainer, model_subset=[\"Category Embedding\"]),\n",
    "    WideDeep(trainer, model_subset=[\"TabMlp\"]),\n",
    "    AutoGluon(trainer, model_subset=[\"Random Forest\"]),\n",
    "]\n",
    "trainer.add_modelbases(models)\n",
    "trainer.train(stderr_to_stdout=True)\n",
    "trainer.get_leaderboard()"
   ]
  },
  {
   "cell_type": "markdown",
   "source": [
    "## Selecting and storing a model\n",
    "\n",
    "From the leaderboard, we can check the performance of each model and select one of the models for deployment. Say we want to choose `Random Forest` from `AutoGluon`, we detach the model from the heavy `trainer`. It is stored locally in a separate directory."
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Trainer saved. To load the trainer, run trainer = load_trainer(path='/tmp/tmpkc_l71f1/output/adult/2023-09-23-20-34-58-0_UserInputConfig-I1/trainer.pkl')\n"
     ]
    }
   ],
   "source": [
    "trainer_of_one_model = trainer.detach_model(program=\"AutoGluon\", model_name=\"Random Forest\")"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "The detached trainer now has only one model base."
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "outputs": [
    {
     "data": {
      "text/plain": "[<tabensemb.model.autogluon.AutoGluon at 0x7f581fc3bd00>]"
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Model bases of the detached trainer\n",
    "trainer_of_one_model.modelbases"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "outputs": [
    {
     "data": {
      "text/plain": "<autogluon.tabular.predictor.predictor.TabularPredictor at 0x7f5644bdbfa0>"
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# The model in the model base\n",
    "trainer_of_one_model.get_modelbase(\"AutoGluon_Random Forest\").model[\"Random Forest\"]"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "## Loading the model\n",
    "\n",
    "Now the `Trainer` containing a single model is stored in a separate directory. Assume that we want to load the local trainer in a separate script for inference. In the following line, the argument `path` of `load_trainer` is the path to `trainer.pkl`, which is already printed when detaching the model or training the model bases. Here we just use the directory of the detached trainer `trainer_of_one_model`.\n",
    "\n",
    "**Remark**: You can move the directory to any other place (or other devices if the version of the package and the environment are all consistent) and rename the folder. `tabensemb` automatically configures the path."
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "outputs": [],
   "source": [
    "from tabensemb.trainer import load_trainer\n",
    "\n",
    "trainer = load_trainer(path=os.path.join(trainer_of_one_model.project_root, \"trainer.pkl\"))"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "outputs": [
    {
     "data": {
      "text/plain": "<autogluon.tabular.predictor.predictor.TabularPredictor at 0x7f5644bd8130>"
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "trainer.get_modelbase(\"AutoGluon_Random Forest\").model[\"Random Forest\"]"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "## Inference\n",
    "\n",
    "Assume that we have a new `DataFrame` representing an upcoming dataset. For demonstration, we use the testing set here. The classification target is ordinal encoded by `trainer.datamodule.label_ordinal_encoder`."
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "outputs": [
    {
     "data": {
      "text/plain": "array([0, 0, 1, ..., 1, 0, 0])"
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = trainer.df.loc[trainer.test_indices, :]\n",
    "truth = trainer.df.loc[trainer.test_indices, trainer.label_name].values.flatten()\n",
    "truth"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "Use the functionality of the model base to do inference. The returned result should be"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "outputs": [
    {
     "data": {
      "text/plain": "array([['<=50K'],\n       ['<=50K'],\n       ['>50K'],\n       ...,\n       ['>50K'],\n       ['<=50K'],\n       ['<=50K']], dtype=object)"
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "result = trainer.get_modelbase(\"AutoGluon_Random Forest\").predict(df, model_name=\"Random Forest\")\n",
    "result"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "You can see the F1 score on the \"new\" (testing) dataset is the same as that in the above leaderboard. The result should be ordinal-encoded first to calculate metrics. We provide `DataModule.label_categories_transform` to achieve this (and `DataModule.label_categories_inverse_transform` to do the inverse transform).\n",
    "\n",
    "The `auto_metric_sklearn` automatically calculates different kinds of `sklearn.metrics`, which is extremely useful for classification tasks"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "outputs": [
    {
     "data": {
      "text/plain": "(0.6936416184971098, array([0.69364162]))"
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from tabensemb.utils import auto_metric_sklearn\n",
    "\n",
    "encoded_result = trainer.datamodule.label_categories_transform(pd.DataFrame(result, columns=trainer.label_name)).values\n",
    "# You can also use trainer.datamodule.label_ordinal_encoder.transform(result) to get the same result.\n",
    "auto_metric_sklearn(truth, encoded_result, \"f1_score\", \"binary\"), trainer.leaderboard.loc[trainer.leaderboard[\"Model\"]==\"Random Forest\", \"Testing F1_SCORE\"].values"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "## Inference on the individual testing set\n",
    "\n",
    "When loading from UCI datasets, `UserConfig.from_uci` finds that an individual testing dataset exists, so the downloaded .zip file is not removed. We can load the archive using `zipfile`."
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "outputs": [
    {
     "data": {
      "text/plain": "['Index', 'adult.data', 'adult.names', 'adult.test', 'old.adult.names']"
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import zipfile\n",
    "\n",
    "zipf = zipfile.ZipFile(os.path.join(tabensemb.setting[\"default_data_path\"], \"Adult.zip\"))\n",
    "zipf.namelist()"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "Now check the content of `adult.test`. It is a .csv-like file, just like the `adult.data` file but has an additional row at the front, and an additional \".\" at the end of each line."
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "|1x3 Cross validator\n",
      "25, Private, 226802, 11th, 7, Never-married, Machine-op-inspct, Own-child, Black, Male, 0, 0, 40, United-States, <=50K.\n",
      "38, Private, 89814, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 50, United-States, <=50K.\n",
      "28, Local-gov, 336951, Assoc-acdm, 12, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 40, United-States, >50K.\n",
      "44, Private, 160323, Some-college, 10, Married-civ-spouse, Machine-op-inspct, Husband, Black, Male, 7688, 0\n"
     ]
    }
   ],
   "source": [
    "file = zipf.read(\"adult.test\").decode()\n",
    "print(file[:500])"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "We provide the `str_to_dataframe` function that translates the string object to a `DataFrame` and replaces illegal values with `np.nan`."
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/xlluo/hdd/tabular_ensemble/tabensemb/utils/utils.py:464: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.\n",
      "  df = pd.read_csv(StringIO(s), names=names, sep=sep)\n"
     ]
    },
    {
     "data": {
      "text/plain": "       age     workclass  fnlwgt     education  education-num  \\\n0       25       Private  226802          11th              7   \n1       38       Private   89814       HS-grad              9   \n2       28     Local-gov  336951    Assoc-acdm             12   \n3       44       Private  160323  Some-college             10   \n4       18             ?  103497  Some-college             10   \n...    ...           ...     ...           ...            ...   \n16276   39       Private  215419     Bachelors             13   \n16277   64             ?  321403       HS-grad              9   \n16278   38       Private  374983     Bachelors             13   \n16279   44       Private   83891     Bachelors             13   \n16280   35  Self-emp-inc  182148     Bachelors             13   \n\n           marital-status         occupation    relationship  \\\n0           Never-married  Machine-op-inspct       Own-child   \n1      Married-civ-spouse    Farming-fishing         Husband   \n2      Married-civ-spouse    Protective-serv         Husband   \n3      Married-civ-spouse  Machine-op-inspct         Husband   \n4           Never-married                  ?       Own-child   \n...                   ...                ...             ...   \n16276            Divorced     Prof-specialty   Not-in-family   \n16277             Widowed                  ?  Other-relative   \n16278  Married-civ-spouse     Prof-specialty         Husband   \n16279            Divorced       Adm-clerical       Own-child   \n16280  Married-civ-spouse    Exec-managerial         Husband   \n\n                     race     sex  capital-gain  capital-loss  hours-per-week  \\\n0                   Black    Male             0             0              40   \n1                   White    Male             0             0              50   \n2                   White    Male             0             0              40   \n3                   Black    Male          7688             0              40   \n4                   White  Female             0             0              30   \n...                   ...     ...           ...           ...             ...   \n16276               White  Female             0             0              36   \n16277               Black    Male             0             0              40   \n16278               White    Male             0             0              50   \n16279  Asian-Pac-Islander    Male          5455             0              40   \n16280               White    Male             0             0              60   \n\n      native-country income  \n0      United-States  <=50K  \n1      United-States  <=50K  \n2      United-States   >50K  \n3      United-States   >50K  \n4      United-States  <=50K  \n...              ...    ...  \n16276  United-States  <=50K  \n16277  United-States  <=50K  \n16278  United-States  <=50K  \n16279  United-States  <=50K  \n16280  United-States   >50K  \n\n[16281 rows x 15 columns]",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>age</th>\n      <th>workclass</th>\n      <th>fnlwgt</th>\n      <th>education</th>\n      <th>education-num</th>\n      <th>marital-status</th>\n      <th>occupation</th>\n      <th>relationship</th>\n      <th>race</th>\n      <th>sex</th>\n      <th>capital-gain</th>\n      <th>capital-loss</th>\n      <th>hours-per-week</th>\n      <th>native-country</th>\n      <th>income</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>25</td>\n      <td>Private</td>\n      <td>226802</td>\n      <td>11th</td>\n      <td>7</td>\n      <td>Never-married</td>\n      <td>Machine-op-inspct</td>\n      <td>Own-child</td>\n      <td>Black</td>\n      <td>Male</td>\n      <td>0</td>\n      <td>0</td>\n      <td>40</td>\n      <td>United-States</td>\n      <td>&lt;=50K</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>38</td>\n      <td>Private</td>\n      <td>89814</td>\n      <td>HS-grad</td>\n      <td>9</td>\n      <td>Married-civ-spouse</td>\n      <td>Farming-fishing</td>\n      <td>Husband</td>\n      <td>White</td>\n      <td>Male</td>\n      <td>0</td>\n      <td>0</td>\n      <td>50</td>\n      <td>United-States</td>\n      <td>&lt;=50K</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>28</td>\n      <td>Local-gov</td>\n      <td>336951</td>\n      <td>Assoc-acdm</td>\n      <td>12</td>\n      <td>Married-civ-spouse</td>\n      <td>Protective-serv</td>\n      <td>Husband</td>\n      <td>White</td>\n      <td>Male</td>\n      <td>0</td>\n      <td>0</td>\n      <td>40</td>\n      <td>United-States</td>\n      <td>&gt;50K</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>44</td>\n      <td>Private</td>\n      <td>160323</td>\n      <td>Some-college</td>\n      <td>10</td>\n      <td>Married-civ-spouse</td>\n      <td>Machine-op-inspct</td>\n      <td>Husband</td>\n      <td>Black</td>\n      <td>Male</td>\n      <td>7688</td>\n      <td>0</td>\n      <td>40</td>\n      <td>United-States</td>\n      <td>&gt;50K</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>18</td>\n      <td>?</td>\n      <td>103497</td>\n      <td>Some-college</td>\n      <td>10</td>\n      <td>Never-married</td>\n      <td>?</td>\n      <td>Own-child</td>\n      <td>White</td>\n      <td>Female</td>\n      <td>0</td>\n      <td>0</td>\n      <td>30</td>\n      <td>United-States</td>\n      <td>&lt;=50K</td>\n    </tr>\n    <tr>\n      <th>...</th>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n    </tr>\n    <tr>\n      <th>16276</th>\n      <td>39</td>\n      <td>Private</td>\n      <td>215419</td>\n      <td>Bachelors</td>\n      <td>13</td>\n      <td>Divorced</td>\n      <td>Prof-specialty</td>\n      <td>Not-in-family</td>\n      <td>White</td>\n      <td>Female</td>\n      <td>0</td>\n      <td>0</td>\n      <td>36</td>\n      <td>United-States</td>\n      <td>&lt;=50K</td>\n    </tr>\n    <tr>\n      <th>16277</th>\n      <td>64</td>\n      <td>?</td>\n      <td>321403</td>\n      <td>HS-grad</td>\n      <td>9</td>\n      <td>Widowed</td>\n      <td>?</td>\n      <td>Other-relative</td>\n      <td>Black</td>\n      <td>Male</td>\n      <td>0</td>\n      <td>0</td>\n      <td>40</td>\n      <td>United-States</td>\n      <td>&lt;=50K</td>\n    </tr>\n    <tr>\n      <th>16278</th>\n      <td>38</td>\n      <td>Private</td>\n      <td>374983</td>\n      <td>Bachelors</td>\n      <td>13</td>\n      <td>Married-civ-spouse</td>\n      <td>Prof-specialty</td>\n      <td>Husband</td>\n      <td>White</td>\n      <td>Male</td>\n      <td>0</td>\n      <td>0</td>\n      <td>50</td>\n      <td>United-States</td>\n      <td>&lt;=50K</td>\n    </tr>\n    <tr>\n      <th>16279</th>\n      <td>44</td>\n      <td>Private</td>\n      <td>83891</td>\n      <td>Bachelors</td>\n      <td>13</td>\n      <td>Divorced</td>\n      <td>Adm-clerical</td>\n      <td>Own-child</td>\n      <td>Asian-Pac-Islander</td>\n      <td>Male</td>\n      <td>5455</td>\n      <td>0</td>\n      <td>40</td>\n      <td>United-States</td>\n      <td>&lt;=50K</td>\n    </tr>\n    <tr>\n      <th>16280</th>\n      <td>35</td>\n      <td>Self-emp-inc</td>\n      <td>182148</td>\n      <td>Bachelors</td>\n      <td>13</td>\n      <td>Married-civ-spouse</td>\n      <td>Exec-managerial</td>\n      <td>Husband</td>\n      <td>White</td>\n      <td>Male</td>\n      <td>0</td>\n      <td>0</td>\n      <td>60</td>\n      <td>United-States</td>\n      <td>&gt;50K</td>\n    </tr>\n  </tbody>\n</table>\n<p>16281 rows × 15 columns</p>\n</div>"
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from tabensemb.utils import str_to_dataframe\n",
    "\n",
    "file = file.replace(\"|1x3 Cross validator\\n\",\"\").replace(\".\\n\", \"\\n\")\n",
    "testing_df = str_to_dataframe(file, sep=\", \", names=trainer.df.columns, check_nan_on=trainer.cont_feature_names)\n",
    "testing_df"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "The inference is similar."
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "outputs": [
    {
     "data": {
      "text/plain": "array([['<=50K'],\n       ['<=50K'],\n       ['<=50K'],\n       ...,\n       ['>50K'],\n       ['<=50K'],\n       ['>50K']], dtype=object)"
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result = trainer.get_modelbase(\"AutoGluon_Random Forest\").predict(testing_df, model_name=\"Random Forest\")\n",
    "result"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "Both the truth and the result require to be ordinal-encoded to calculate metrics."
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "outputs": [
    {
     "data": {
      "text/plain": "(0.6690190543401553, 0.7731706276694975)"
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "encoded_truth = trainer.datamodule.label_categories_transform(testing_df[trainer.label_name]).values\n",
    "encoded_result = trainer.datamodule.label_categories_transform(pd.DataFrame(result, columns=trainer.label_name)).values\n",
    "auto_metric_sklearn(encoded_truth, encoded_result, \"f1_score\", \"binary\"), auto_metric_sklearn(encoded_truth, encoded_result, \"roc_auc_score\", \"binary\")"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}