{ "cells": [ { "cell_type": "markdown", "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } }, "source": [ "# Basics of running benchmarks\n", "\n", "Tabular Ensemble (`tabensemb`) is a benchmark platform for tabular prediction tasks. We support three well-established model bases as baselines:\n", "\n", "* `autogluon`: [Link](https://github.com/autogluon/autogluon)\n", "\n", "* `pytorch_widedeep`: [Link](https://github.com/jrzaurin/pytorch-widedeep)\n", "\n", "* `pytorch_tabular`: [Link](https://github.com/manujosephv/pytorch_tabular)\n", "\n", "Users can run benchmarks on customized datasets using customized preprocessing steps and implement customized models in the framework to run and compare their performance with baselines within a consistent procedure.\n", "\n", "In this part, minimum examples of regression, binary classification, and multiclass classification are performed to show the basic functionality of the package." ] }, { "cell_type": "markdown", "source": [ "## Regression" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "markdown", "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } }, "source": [ "### Loading packages\n", "\n", "First, import the necessary modules. Then check the validity of `CUDA` and determine the training device." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Using cuda device\n" ] } ], "source": [ "import torch\n", "from tabensemb.trainer import Trainer\n", "from tabensemb.model import *\n", "from tabensemb.config import UserConfig\n", "import tabensemb\n", "import os\n", "\n", "device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n", "print(\"Using {} device\".format(device))" ] }, { "cell_type": "markdown", "source": [ "`tabensemb` uses paths relative to the current directory. For different IDEs (PyCharm, VSCode, etc.), the directory can be different. Set default paths to desired ones.\n", "\n", "* `tabensemb.setting[\"default_output_path\"]`: It will be used to save results. This path will be created if it does not exist.\n", "* `tabensemb.setting[\"default_config_path\"]`: It should be the path to configuration files (See \"Using a configuration file\" for its case).\n", "* `tabensemb.setting[\"default_config_path\"]`: It should be the path to data files. It will also be used to save downloaded datasets (See \"Using a configuration file\" for its case).\n", "\n", "In this notebook, we use a temporary directory for cleanliness. Change `temp_path.name` to your own directory." ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": 2, "outputs": [], "source": [ "from tempfile import TemporaryDirectory\n", "\n", "temp_path = TemporaryDirectory()\n", "tabensemb.setting[\"default_output_path\"] = os.path.join(temp_path.name, \"output\")\n", "tabensemb.setting[\"default_config_path\"] = os.path.join(temp_path.name, \"configs\")\n", "tabensemb.setting[\"default_data_path\"] = os.path.join(temp_path.name, \"data\")" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "### Configuring a `Trainer`\n", "\n", "Create a `Trainer`, which acts as a bridge of data and models and provides some useful utilities." ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "trainer = Trainer(device=device)" ] }, { "cell_type": "markdown", "source": [ "As an example, we use the Auto MPG dataset from [UCI datasets](https://archive.ics.uci.edu/datasets). We can import UCI datasets through the `UserConfig` class." ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": 4, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Downloading https://archive.ics.uci.edu/static/public/9/auto+mpg.zip to /tmp/tmpqcxgn2l1/data/Auto MPG.zip\n", "cylinders is Integer and will be treated as a continuous feature.\n", "model_year is Integer and will be treated as a continuous feature.\n", "origin is Integer and will be treated as a continuous feature.\n", "Unknown values are detected in ['horsepower']. They will be treated as np.nan.\n", "The project will be saved to /tmp/tmpqcxgn2l1/output/auto-mpg/2023-09-23-20-36-00-0_UserInputConfig\n" ] } ], "source": [ "mpg_columns = [\n", " \"mpg\",\n", " \"cylinders\",\n", " \"displacement\",\n", " \"horsepower\",\n", " \"weight\",\n", " \"acceleration\",\n", " \"model_year\",\n", " \"origin\",\n", " \"car_name\",\n", "]\n", "cfg = UserConfig.from_uci(\"Auto MPG\", column_names=mpg_columns, sep=r\"\\s+\")\n", "trainer.load_config(cfg)" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } }, "source": [ "*Optional*: We provide a useful `Logging` class to record all outputs to a file located in the above project root so that users can review the training process. This step is optional but we strongly recommend using it.\n", "\n", "`Trainer.project_root` is the output directory of the `trainer`, and here we log all `stdout` and `stderr` to `log.txt` in this directory." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "from tabensemb.utils import Logging\n", "log = Logging()\n", "log.enter(os.path.join(trainer.project_root, \"log.txt\"))" ] }, { "cell_type": "markdown", "source": [ "### Viewing configurations\n", "\n", "We can view the summary of the current environment, including devices/Python version, the loaded configuration, and global settings of `tabensemb`." ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": 6, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Device:\n", "{\n", "\t'System': 'Linux',\n", "\t'Node name': 'xlluo-WS',\n", "\t'System release': '5.15.6-custom',\n", "\t'System version': '#1 SMP Mon Dec 13 20:27:58 CST 2021',\n", "\t'Machine architecture': 'x86_64',\n", "\t'Processor architecture': 'x86_64',\n", "\t'Processor model': '11th Gen Intel(R) Core(TM) i9-11900K @ 3.50GHz',\n", "\t'Physical cores': 8,\n", "\t'Total cores': 16,\n", "\t'Max core frequency': '5150.00Mhz',\n", "\t'Total memory': '31.20GB',\n", "\t'Python version': '3.10.12',\n", "\t'Python implementation': 'CPython',\n", "\t'Python compiler': 'GCC 11.2.0',\n", "\t'Cuda availability': True,\n", "\t'GPU devices': [\n", "\t\t'NVIDIA GeForce RTX 3090'\n", "\t]\n", "}\n", "Configurations:\n", "{\n", "\t'database': 'auto-mpg',\n", "\t'task': 'regression',\n", "\t'loss': None,\n", "\t'bayes_opt': False,\n", "\t'bayes_calls': 50,\n", "\t'bayes_epoch': 30,\n", "\t'patience': 100,\n", "\t'epoch': 300,\n", "\t'lr': 0.001,\n", "\t'weight_decay': 1e-09,\n", "\t'batch_size': 1024,\n", "\t'layers': [\n", "\t\t64,\n", "\t\t128,\n", "\t\t256,\n", "\t\t128,\n", "\t\t64\n", "\t],\n", "\t'SPACEs': {\n", "\t\t'lr': {\n", "\t\t\t'type': 'Real',\n", "\t\t\t'low': 0.0001,\n", "\t\t\t'high': 0.05,\n", "\t\t\t'prior': 'log-uniform'\n", "\t\t},\n", "\t\t'weight_decay': {\n", "\t\t\t'type': 'Real',\n", "\t\t\t'low': 1e-09,\n", "\t\t\t'high': 0.05,\n", "\t\t\t'prior': 'log-uniform'\n", "\t\t},\n", "\t\t'batch_size': {\n", "\t\t\t'type': 'Categorical',\n", "\t\t\t'categories': [\n", "\t\t\t\t64,\n", "\t\t\t\t128,\n", "\t\t\t\t256,\n", "\t\t\t\t512,\n", "\t\t\t\t1024,\n", "\t\t\t\t2048\n", "\t\t\t]\n", "\t\t}\n", "\t},\n", "\t'data_splitter': 'RandomSplitter',\n", "\t'split_ratio': [\n", "\t\t0.6,\n", "\t\t0.2,\n", "\t\t0.2\n", "\t],\n", "\t'data_imputer': 'MissForestImputer',\n", "\t'data_processors': [\n", "\t\t(\n", "\t\t\t'CategoricalOrdinalEncoder',\n", "\t\t\t{\n", "\t\t\t}\n", "\t\t),\n", "\t\t(\n", "\t\t\t'NaNFeatureRemover',\n", "\t\t\t{\n", "\t\t\t}\n", "\t\t),\n", "\t\t(\n", "\t\t\t'VarianceFeatureSelector',\n", "\t\t\t{\n", "\t\t\t\t'thres': 1\n", "\t\t\t}\n", "\t\t),\n", "\t\t(\n", "\t\t\t'StandardScaler',\n", "\t\t\t{\n", "\t\t\t}\n", "\t\t)\n", "\t],\n", "\t'data_derivers': [\n", "\t],\n", "\t'categorical_feature_names': [\n", "\t],\n", "\t'continuous_feature_names': [\n", "\t\t'displacement',\n", "\t\t'cylinders',\n", "\t\t'horsepower',\n", "\t\t'weight',\n", "\t\t'acceleration',\n", "\t\t'model_year',\n", "\t\t'origin'\n", "\t],\n", "\t'feature_types': {\n", "\t\t'displacement': 'Continuous',\n", "\t\t'cylinders': 'Continuous',\n", "\t\t'horsepower': 'Continuous',\n", "\t\t'weight': 'Continuous',\n", "\t\t'acceleration': 'Continuous',\n", "\t\t'model_year': 'Continuous',\n", "\t\t'origin': 'Continuous'\n", "\t},\n", "\t'unique_feature_types': [\n", "\t\t'Continuous'\n", "\t],\n", "\t'label_name': [\n", "\t\t'mpg'\n", "\t]\n", "}\n", "Global settings:\n", "{\n", "\t'random_seed': 42,\n", "\t'low_memory': True,\n", "\t'verbose_per_epoch': 20,\n", "\t'test_with_no_grad': True,\n", "\t'debug_mode': False,\n", "\t'default_output_path': '/tmp/tmpqcxgn2l1/output',\n", "\t'default_config_path': '/tmp/tmpqcxgn2l1/configs',\n", "\t'default_data_path': '/tmp/tmpqcxgn2l1/data',\n", "\t'warn_nan_metric': True,\n", "\t'raise_inconsistent_inferred_task': False,\n", "\t'matplotlib_usetex': False\n", "}\n" ] } ], "source": [ "trainer.summarize_setting()" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "### Loading data\n", "\n", "In the configuration summary above, the dataset file is defined by \"database\" under the `Configurations` category. `Trainer.load_data` automatically searches the file in the current directory and `tabensemb.setting[\"default_data_path\"]`. Now, load the Auto MPG dataset into the `Trainer`. It will process the dataset and get ready for training models:\n", "\n", "1. Data splitting (training/validation/testing sets)\n", "2. Data imputation\n", "3. Data augmentation (for features)\n", "4. Data processing\n", " * Data augmentation (for data points)\n", " * Data filtering\n", " * Feature selection\n", " * Categorical encoding\n", " * Data scaling\n", " * etc.\n", "5. Data augmentation (for features, especially multi-modal features)\n" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Dataset size: 238 80 80\n", "Data saved to /tmp/tmpqcxgn2l1/output/auto-mpg/2023-09-23-20-36-00-0_UserInputConfig (data.csv and tabular_data.csv).\n" ] } ], "source": [ "trainer.load_data()" ] }, { "cell_type": "markdown", "source": [ "### Initializing model bases\n", "\n", "Initialize model bases and add them to the `Trainer`. We only choose a subset of models in each model base for demonstration by passing the `model_subset` argument (without it, all available models will be trained)." ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "models = [\n", " PytorchTabular(trainer, model_subset=[\"Category Embedding\"]),\n", " WideDeep(trainer, model_subset=[\"TabMlp\"]),\n", " AutoGluon(trainer, model_subset=[\"Random Forest\"]),\n", "]\n", "trainer.add_modelbases(models)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } }, "source": [ "### Start training\n", "\n", "Now train the model bases. The argument `stderr_to_stdout` will redirect warnings and loggings to `stdout` and make records in the notebook clean.\n", "\n", "*Optional*: Using the following line, we can run k-fold cross-validation to get the leaderboard, where k is `cross_validation`.\n", "\n", "```python\n", "trainer.get_leaderboard(cross_validation=10, split_type=\"cv\", stderr_to_stdout=True)\n", "```\n", "\n", "**Remark**: `split_type` can be `random`, which means that the dataset is randomly split according to the given `split_ratio` in the configuration and different random seeds." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "-------------Run PytorchTabular-------------\n", "\n", "Training Category Embedding\n", "Global seed set to 42\n", "2023-09-23 20:36:01,062 - {pytorch_tabular.tabular_model:473} - INFO - Preparing the DataLoaders\n", "2023-09-23 20:36:01,062 - {pytorch_tabular.tabular_datamodule:290} - INFO - Setting up the datamodule for regression task\n", "2023-09-23 20:36:01,070 - {pytorch_tabular.tabular_model:521} - INFO - Preparing the Model: CategoryEmbeddingModel\n", "2023-09-23 20:36:01,081 - {pytorch_tabular.tabular_model:268} - INFO - Preparing the Trainer\n", "/home/xlluo/anaconda3/envs/tabular_ensemble/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py:589: LightningDeprecationWarning: The Trainer argument `auto_select_gpus` has been deprecated in v1.9.0 and will be removed in v2.0.0. Please use the function `pytorch_lightning.accelerators.find_usable_cuda_devices` instead.\n", " rank_zero_deprecation(\n", "Auto select gpus: [0]\n", "GPU available: True (cuda), used: True\n", "TPU available: False, using: 0 TPU cores\n", "IPU available: False, using: 0 IPUs\n", "HPU available: False, using: 0 HPUs\n", "2023-09-23 20:36:01,991 - {pytorch_tabular.tabular_model:582} - INFO - Training Started\n", "You are using a CUDA device ('NVIDIA GeForce RTX 3090') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision\n", "LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]\n", "\n", " | Name | Type | Params\n", "---------------------------------------------------------------\n", "0 | _backbone | CategoryEmbeddingBackbone | 11.4 K\n", "1 | _embedding_layer | Embedding1dLayer | 14 \n", "2 | head | LinearHead | 33 \n", "3 | loss | MSELoss | 0 \n", "---------------------------------------------------------------\n", "11.4 K Trainable params\n", "0 Non-trainable params\n", "11.4 K Total params\n", "0.046 Total estimated model params size (MB)\n", "Epoch: 1/300, Train loss: 677.8015, Val loss: 582.9557, Min val loss: 582.9557, Epoch time: 0.012s.\n", "Epoch: 20/300, Train loss: 353.7851, Val loss: 302.0203, Min val loss: 302.0203, Epoch time: 0.010s.\n", "Epoch: 40/300, Train loss: 85.0776, Val loss: 62.1153, Min val loss: 62.1153, Epoch time: 0.009s.\n", "Epoch: 60/300, Train loss: 45.2654, Val loss: 34.2778, Min val loss: 34.2691, Epoch time: 0.009s.\n", "Epoch: 80/300, Train loss: 33.9537, Val loss: 26.8622, Min val loss: 26.8622, Epoch time: 0.016s.\n", "Epoch: 100/300, Train loss: 26.9038, Val loss: 23.2417, Min val loss: 23.2372, Epoch time: 0.011s.\n", "Epoch: 120/300, Train loss: 24.9622, Val loss: 20.4360, Min val loss: 20.4360, Epoch time: 0.008s.\n", "Epoch: 140/300, Train loss: 24.1636, Val loss: 19.4010, Min val loss: 19.4010, Epoch time: 0.012s.\n", "Epoch: 160/300, Train loss: 22.9200, Val loss: 18.0232, Min val loss: 17.9749, Epoch time: 0.010s.\n", "Epoch: 180/300, Train loss: 19.7677, Val loss: 16.9469, Min val loss: 16.9469, Epoch time: 0.011s.\n", "Epoch: 200/300, Train loss: 17.9390, Val loss: 16.6545, Min val loss: 16.4093, Epoch time: 0.010s.\n", "Epoch: 220/300, Train loss: 19.4496, Val loss: 15.4451, Min val loss: 15.1788, Epoch time: 0.011s.\n", "Epoch: 240/300, Train loss: 16.0483, Val loss: 14.5508, Min val loss: 14.5508, Epoch time: 0.010s.\n", "Epoch: 260/300, Train loss: 16.4672, Val loss: 13.8354, Min val loss: 13.8354, Epoch time: 0.010s.\n", "Epoch: 280/300, Train loss: 13.6031, Val loss: 12.9315, Min val loss: 12.9315, Epoch time: 0.018s.\n", "Epoch: 300/300, Train loss: 16.5369, Val loss: 12.3673, Min val loss: 12.3673, Epoch time: 0.011s.\n", "`Trainer.fit` stopped: `max_epochs=300` reached.\n", "2023-09-23 20:36:07,420 - {pytorch_tabular.tabular_model:584} - INFO - Training the model completed\n", "2023-09-23 20:36:07,420 - {pytorch_tabular.tabular_model:1258} - INFO - Loading the best model\n", "/home/xlluo/anaconda3/envs/tabular_ensemble/lib/python3.10/site-packages/pytorch_lightning/utilities/cloud_io.py:33: LightningDeprecationWarning: `pytorch_lightning.utilities.cloud_io.get_filesystem` has been deprecated in v1.8.0 and will be removed in v2.0.0. Please use `lightning_fabric.utilities.cloud_io.get_filesystem` instead.\n", " rank_zero_deprecation(\n", "Training mse loss: 11.25175\n", "Validation mse loss: 12.36725\n", "Testing mse loss: 7.83801\n", "Trainer saved. To load the trainer, run trainer = load_trainer(path='/tmp/tmpqcxgn2l1/output/auto-mpg/2023-09-23-20-36-00-0_UserInputConfig/trainer.pkl')\n", "\n", "-------------PytorchTabular End-------------\n", "\n", "\n", "-------------Run WideDeep-------------\n", "\n", "Training TabMlp\n", "Epoch: 1/300, Train loss: 635.5330, Val loss: 555.4755, Min val loss: 555.4755\n", "Epoch: 21/300, Train loss: 441.6902, Val loss: 375.7337, Min val loss: 375.7337\n", "Epoch: 41/300, Train loss: 145.8623, Val loss: 119.9598, Min val loss: 119.9598\n", "Epoch: 61/300, Train loss: 45.9133, Val loss: 34.0160, Min val loss: 34.0160\n", "Epoch: 81/300, Train loss: 27.6878, Val loss: 24.1525, Min val loss: 24.1525\n", "Epoch: 101/300, Train loss: 23.0877, Val loss: 18.2096, Min val loss: 18.2096\n", "Epoch: 121/300, Train loss: 21.4056, Val loss: 17.2203, Min val loss: 17.1303\n", "Epoch: 141/300, Train loss: 21.2559, Val loss: 16.0746, Min val loss: 16.0746\n", "Epoch: 161/300, Train loss: 19.2337, Val loss: 15.3027, Min val loss: 15.3027\n", "Epoch: 181/300, Train loss: 16.1232, Val loss: 14.5777, Min val loss: 14.5777\n", "Epoch: 201/300, Train loss: 16.7095, Val loss: 14.2274, Min val loss: 14.2274\n", "Epoch: 221/300, Train loss: 15.7366, Val loss: 13.5223, Min val loss: 13.5223\n", "Epoch: 241/300, Train loss: 16.9825, Val loss: 12.9892, Min val loss: 12.9892\n", "Epoch: 261/300, Train loss: 15.3358, Val loss: 12.4278, Min val loss: 12.4278\n", "Epoch: 281/300, Train loss: 13.3989, Val loss: 12.1155, Min val loss: 12.1155\n", "Restoring model weights from the end of the best epoch\n", "Training mse loss: 10.17037\n", "Validation mse loss: 11.66271\n", "Testing mse loss: 6.43856\n", "Trainer saved. To load the trainer, run trainer = load_trainer(path='/tmp/tmpqcxgn2l1/output/auto-mpg/2023-09-23-20-36-00-0_UserInputConfig/trainer.pkl')\n", "\n", "-------------WideDeep End-------------\n", "\n", "\n", "-------------Run AutoGluon-------------\n", "\n", "Training Random Forest\n", "Presets specified: ['best_quality']\n", "Warning: hyperparameter tuning is currently experimental and may cause the process to hang.\n", "Stack configuration (auto_stack=True): num_stack_levels=0, num_bag_folds=5, num_bag_sets=1\n", "Beginning AutoGluon training ...\n", "AutoGluon will save models to \"/tmp/tmpqcxgn2l1/output/auto-mpg/2023-09-23-20-36-00-0_UserInputConfig/AutoGluon/Random Forest/\"\n", "AutoGluon Version: 0.8.2\n", "Python Version: 3.10.12\n", "Operating System: Linux\n", "Platform Machine: x86_64\n", "Platform Version: #1 SMP Mon Dec 13 20:27:58 CST 2021\n", "Disk Space Avail: 149.28 GB / 502.47 GB (29.7%)\n", "Train Data Rows: 238\n", "Train Data Columns: 7\n", "Tuning Data Rows: 80\n", "Tuning Data Columns: 7\n", "Label Column: mpg\n", "Preprocessing data ...\n", "Using Feature Generators to preprocess the data ...\n", "Fitting PipelineFeatureGenerator...\n", "\tAvailable Memory: 12533.16 MB\n", "\tTrain Data (Original) Memory Usage: 0.02 MB (0.0% of available memory)\n", "\tInferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.\n", "\tStage 1 Generators:\n", "\t\tFitting AsTypeFeatureGenerator...\n", "\tStage 2 Generators:\n", "\t\tFitting FillNaFeatureGenerator...\n", "\tStage 3 Generators:\n", "\t\tFitting IdentityFeatureGenerator...\n", "\t\t\tWarning: feature_metadata_in passed as input to fit_transform, but self.feature_metadata_in was already set. Ignoring feature_metadata_in.\n", "\tStage 4 Generators:\n", "\t\tFitting DropUniqueFeatureGenerator...\n", "\tStage 5 Generators:\n", "\t\tFitting DropDuplicatesFeatureGenerator...\n", "\tTypes of features in original data (raw dtype, special dtypes):\n", "\t\t('float', []) : 4 | ['displacement', 'horsepower', 'weight', 'acceleration']\n", "\t\t('int', []) : 3 | ['cylinders', 'model_year', 'origin']\n", "\tTypes of features in processed data (raw dtype, special dtypes):\n", "\t\t('float', []) : 4 | ['displacement', 'horsepower', 'weight', 'acceleration']\n", "\t\t('int', []) : 3 | ['cylinders', 'model_year', 'origin']\n", "\t0.0s = Fit runtime\n", "\t7 features in original data used to generate 7 features in processed data.\n", "\tTrain Data (Processed) Memory Usage: 0.02 MB (0.0% of available memory)\n", "Data preprocessing and feature engineering runtime = 0.03s ...\n", "AutoGluon will gauge predictive performance using evaluation metric: 'mean_squared_error'\n", "\tThis metric's sign has been flipped to adhere to being higher_is_better. The metric score can be multiplied by -1 to get the metric value.\n", "\tTo change this, specify the eval_metric parameter of Predictor()\n", "use_bag_holdout=True, will use tuning_data as holdout (will not be used for early stopping).\n", "User-specified model hyperparameters to be fit:\n", "{\n", "\t'RF': {},\n", "}\n", "Fitting 1 L1 models ...\n", "Hyperparameter tuning model: RandomForest_BAG_L1 ...\n", "\tNo hyperparameter search space specified for RandomForest_BAG_L1. Skipping HPO. Will train one model based on the provided hyperparameters.\n", "Fitted model: RandomForest_BAG_L1 ...\n", "\t-8.1918\t = Validation score (-mean_squared_error)\n", "\t0.28s\t = Training runtime\n", "\t0.0s\t = Validation runtime\n", "Fitting model: WeightedEnsemble_L2 ...\n", "\t-11.4141\t = Validation score (-mean_squared_error)\n", "\t0.0s\t = Training runtime\n", "\t0.0s\t = Validation runtime\n", "AutoGluon training complete, total runtime = 0.35s ... Best model: \"WeightedEnsemble_L2\"\n", "TabularPredictor saved. To load, use: predictor = TabularPredictor.load(\"/tmp/tmpqcxgn2l1/output/auto-mpg/2023-09-23-20-36-00-0_UserInputConfig/AutoGluon/Random Forest/\")\n", "Training mse loss: 1.07741\n", "Validation mse loss: 11.41409\n", "Testing mse loss: 4.19031\n", "Trainer saved. To load the trainer, run trainer = load_trainer(path='/tmp/tmpqcxgn2l1/output/auto-mpg/2023-09-23-20-36-00-0_UserInputConfig/trainer.pkl')\n", "\n", "-------------AutoGluon End-------------\n", "\n" ] } ], "source": [ "trainer.train(stderr_to_stdout=True)" ] }, { "cell_type": "markdown", "source": [ "After training finishes, check the leaderboard to see their performance.\n", "\n", "Metrics used in leaderboards can be found in `tabensemb.utils.utils.REGRESSION_METRICS/BINARY_METRICS/MULTICLASS_METRICS`. Most of the metrics are from `sklearn.metrics`." ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": 10, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "PytorchTabular metrics\n", "Category Embedding 1/1\n", "WideDeep metrics\n", "TabMlp 1/1\n", "AutoGluon metrics\n", "Random Forest 1/1\n", "Trainer saved. To load the trainer, run trainer = load_trainer(path='/tmp/tmpqcxgn2l1/output/auto-mpg/2023-09-23-20-36-00-0_UserInputConfig/trainer.pkl')\n" ] }, { "data": { "text/plain": " Program Model Training RMSE Training MSE \\\n0 AutoGluon Random Forest 1.037981 1.077405 \n1 WideDeep TabMlp 3.189102 10.170372 \n2 PytorchTabular Category Embedding 3.354362 11.251746 \n\n Training MAE Training MAPE Training R2 Training MEDIAN_ABSOLUTE_ERROR \\\n0 0.741566 0.031074 0.983285 0.529500 \n1 2.318564 0.096454 0.842218 1.669983 \n2 2.445915 0.101659 0.825442 1.775388 \n\n Training EXPLAINED_VARIANCE_SCORE Testing RMSE ... Testing R2 \\\n0 0.983293 2.047025 ... 0.922065 \n1 0.859805 2.537431 ... 0.880250 \n2 0.854523 2.799644 ... 0.854221 \n\n Testing MEDIAN_ABSOLUTE_ERROR Testing EXPLAINED_VARIANCE_SCORE \\\n0 1.156333 0.922591 \n1 1.767459 0.900587 \n2 1.963455 0.888258 \n\n Validation RMSE Validation MSE Validation MAE Validation MAPE \\\n0 3.378475 11.414091 2.269187 0.102995 \n1 3.415071 11.662707 2.539188 0.116035 \n2 3.516710 12.367250 2.731159 0.125136 \n\n Validation R2 Validation MEDIAN_ABSOLUTE_ERROR \\\n0 0.796098 1.641334 \n1 0.791657 1.904160 \n2 0.779071 2.375105 \n\n Validation EXPLAINED_VARIANCE_SCORE \n0 0.796506 \n1 0.806152 \n2 0.808039 \n\n[3 rows x 23 columns]", "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
ProgramModelTraining RMSETraining MSETraining MAETraining MAPETraining R2Training MEDIAN_ABSOLUTE_ERRORTraining EXPLAINED_VARIANCE_SCORETesting RMSE...Testing R2Testing MEDIAN_ABSOLUTE_ERRORTesting EXPLAINED_VARIANCE_SCOREValidation RMSEValidation MSEValidation MAEValidation MAPEValidation R2Validation MEDIAN_ABSOLUTE_ERRORValidation EXPLAINED_VARIANCE_SCORE
0AutoGluonRandom Forest1.0379811.0774050.7415660.0310740.9832850.5295000.9832932.047025...0.9220651.1563330.9225913.37847511.4140912.2691870.1029950.7960981.6413340.796506
1WideDeepTabMlp3.18910210.1703722.3185640.0964540.8422181.6699830.8598052.537431...0.8802501.7674590.9005873.41507111.6627072.5391880.1160350.7916571.9041600.806152
2PytorchTabularCategory Embedding3.35436211.2517462.4459150.1016590.8254421.7753880.8545232.799644...0.8542211.9634550.8882583.51671012.3672502.7311590.1251360.7790712.3751050.808039
\n

3 rows × 23 columns

\n
" }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "trainer.get_leaderboard()" ] }, { "cell_type": "markdown", "source": [ "## Binary classification\n", "\n", "As a showcase for binary classification, we use the Adult dataset from UCI datasets. Note that the Adult dataset has an individual testing set, which will be discussed in the \"Inference on an upcoming dataset\" part." ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": 11, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Downloading https://archive.ics.uci.edu/static/public/2/adult.zip to /tmp/tmpqcxgn2l1/data/Adult.zip\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/xlluo/hdd/tabular_ensemble/tabensemb/config/user_config.py:292: UserWarning: There exists .test file(s) ['adult.test'] which should be used for final metrics. The .zip file is left for the user to process.\n", " warnings.warn(\n", "/home/xlluo/hdd/tabular_ensemble/tabensemb/utils/utils.py:464: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.\n", " df = pd.read_csv(StringIO(s), names=names, sep=sep)\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "age is Integer and will be treated as a continuous feature.\n", "fnlwgt is Integer and will be treated as a continuous feature.\n", "education-num is Integer and will be treated as a continuous feature.\n", "capital-gain is Integer and will be treated as a continuous feature.\n", "capital-loss is Integer and will be treated as a continuous feature.\n", "hours-per-week is Integer and will be treated as a continuous feature.\n", "The project will be saved to /tmp/tmpqcxgn2l1/output/adult/2023-09-23-20-36-15-0_UserInputConfig\n", "Dataset size: 19536 6512 6513\n", "Data saved to /tmp/tmpqcxgn2l1/output/adult/2023-09-23-20-36-15-0_UserInputConfig (data.csv and tabular_data.csv).\n", "\n", "-------------Run PytorchTabular-------------\n", "\n", "Training Category Embedding\n", "Global seed set to 42\n", "2023-09-23 20:36:17,315 - {pytorch_tabular.tabular_model:473} - INFO - Preparing the DataLoaders\n", "2023-09-23 20:36:17,317 - {pytorch_tabular.tabular_datamodule:290} - INFO - Setting up the datamodule for classification task\n", "2023-09-23 20:36:17,382 - {pytorch_tabular.tabular_model:521} - INFO - Preparing the Model: CategoryEmbeddingModel\n", "2023-09-23 20:36:17,412 - {pytorch_tabular.tabular_model:268} - INFO - Preparing the Trainer\n", "/home/xlluo/anaconda3/envs/tabular_ensemble/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py:589: LightningDeprecationWarning: The Trainer argument `auto_select_gpus` has been deprecated in v1.9.0 and will be removed in v2.0.0. Please use the function `pytorch_lightning.accelerators.find_usable_cuda_devices` instead.\n", " rank_zero_deprecation(\n", "Auto select gpus: [0]\n", "GPU available: True (cuda), used: True\n", "TPU available: False, using: 0 TPU cores\n", "IPU available: False, using: 0 IPUs\n", "HPU available: False, using: 0 HPUs\n", "2023-09-23 20:36:17,437 - {pytorch_tabular.tabular_model:582} - INFO - Training Started\n", "You are using a CUDA device ('NVIDIA GeForce RTX 3090') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision\n", "LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]\n", "\n", " | Name | Type | Params\n", "---------------------------------------------------------------\n", "0 | _backbone | CategoryEmbeddingBackbone | 18.5 K\n", "1 | _embedding_layer | Embedding1dLayer | 1.4 K \n", "2 | head | LinearHead | 66 \n", "3 | loss | CrossEntropyLoss | 0 \n", "---------------------------------------------------------------\n", "20.0 K Trainable params\n", "0 Non-trainable params\n", "20.0 K Total params\n", "0.080 Total estimated model params size (MB)\n", "Epoch: 1/300, Train loss: 0.4666, Val loss: 0.3794, Min val loss: 0.3794, Epoch time: 0.465s.\n", "Epoch: 20/300, Train loss: 0.3112, Val loss: 0.3184, Min val loss: 0.3177, Epoch time: 0.472s.\n", "Epoch: 40/300, Train loss: 0.2972, Val loss: 0.3213, Min val loss: 0.3162, Epoch time: 0.435s.\n", "Epoch: 60/300, Train loss: 0.2864, Val loss: 0.3248, Min val loss: 0.3162, Epoch time: 0.476s.\n", "Epoch: 80/300, Train loss: 0.2766, Val loss: 0.3323, Min val loss: 0.3162, Epoch time: 0.530s.\n", "Epoch: 100/300, Train loss: 0.2677, Val loss: 0.3416, Min val loss: 0.3162, Epoch time: 0.385s.\n", "Epoch: 120/300, Train loss: 0.2602, Val loss: 0.3484, Min val loss: 0.3162, Epoch time: 0.527s.\n", "2023-09-23 20:37:10,637 - {pytorch_tabular.tabular_model:584} - INFO - Training the model completed\n", "2023-09-23 20:37:10,638 - {pytorch_tabular.tabular_model:1258} - INFO - Loading the best model\n", "/home/xlluo/anaconda3/envs/tabular_ensemble/lib/python3.10/site-packages/pytorch_lightning/utilities/cloud_io.py:33: LightningDeprecationWarning: `pytorch_lightning.utilities.cloud_io.get_filesystem` has been deprecated in v1.8.0 and will be removed in v2.0.0. Please use `lightning_fabric.utilities.cloud_io.get_filesystem` instead.\n", " rank_zero_deprecation(\n", "Training log_loss loss: 0.28862\n", "Validation log_loss loss: 0.31619\n", "Testing log_loss loss: 0.31114\n", "Trainer saved. To load the trainer, run trainer = load_trainer(path='/tmp/tmpqcxgn2l1/output/adult/2023-09-23-20-36-15-0_UserInputConfig/trainer.pkl')\n", "\n", "-------------PytorchTabular End-------------\n", "\n", "\n", "-------------Run WideDeep-------------\n", "\n", "Training TabMlp\n", "Epoch: 1/300, Train loss: 0.5206, Val loss: 0.4274, Min val loss: 0.4274\n", "Epoch: 21/300, Train loss: 0.3145, Val loss: 0.3161, Min val loss: 0.3161\n", "Epoch: 41/300, Train loss: 0.3063, Val loss: 0.3192, Min val loss: 0.3161\n", "Epoch: 61/300, Train loss: 0.2984, Val loss: 0.3209, Min val loss: 0.3161\n", "Epoch: 81/300, Train loss: 0.2905, Val loss: 0.3227, Min val loss: 0.3161\n", "Epoch: 101/300, Train loss: 0.2819, Val loss: 0.3266, Min val loss: 0.3161\n", "Epoch: 121/300, Train loss: 0.2754, Val loss: 0.3293, Min val loss: 0.3161\n", "Epoch 00121: early stopping\n", "Restoring model weights from the end of the best epoch\n", "Training log_loss loss: 0.30084\n", "Validation log_loss loss: 0.31729\n", "Testing log_loss loss: 0.31133\n", "Trainer saved. To load the trainer, run trainer = load_trainer(path='/tmp/tmpqcxgn2l1/output/adult/2023-09-23-20-36-15-0_UserInputConfig/trainer.pkl')\n", "\n", "-------------WideDeep End-------------\n", "\n", "\n", "-------------Run AutoGluon-------------\n", "\n", "Training Random Forest\n", "Presets specified: ['best_quality']\n", "Warning: hyperparameter tuning is currently experimental and may cause the process to hang.\n", "Stack configuration (auto_stack=True): num_stack_levels=0, num_bag_folds=8, num_bag_sets=1\n", "Beginning AutoGluon training ...\n", "AutoGluon will save models to \"/tmp/tmpqcxgn2l1/output/adult/2023-09-23-20-36-15-0_UserInputConfig/AutoGluon/Random Forest/\"\n", "AutoGluon Version: 0.8.2\n", "Python Version: 3.10.12\n", "Operating System: Linux\n", "Platform Machine: x86_64\n", "Platform Version: #1 SMP Mon Dec 13 20:27:58 CST 2021\n", "Disk Space Avail: 148.97 GB / 502.47 GB (29.6%)\n", "Train Data Rows: 19536\n", "Train Data Columns: 14\n", "Tuning Data Rows: 6512\n", "Tuning Data Columns: 14\n", "Label Column: income\n", "Preprocessing data ...\n", "Selected class <--> label mapping: class 1 = 1, class 0 = 0\n", "Using Feature Generators to preprocess the data ...\n", "Fitting PipelineFeatureGenerator...\n", "\tAvailable Memory: 6146.04 MB\n", "\tTrain Data (Original) Memory Usage: 15.07 MB (0.2% of available memory)\n", "\tInferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.\n", "\tStage 1 Generators:\n", "\t\tFitting AsTypeFeatureGenerator...\n", "\t\t\tNote: Converting 1 features to boolean dtype as they only contain 2 unique values.\n", "\tStage 2 Generators:\n", "\t\tFitting FillNaFeatureGenerator...\n", "\tStage 3 Generators:\n", "\t\tFitting IdentityFeatureGenerator...\n", "\t\t\tWarning: feature_metadata_in passed as input to fit_transform, but self.feature_metadata_in was already set. Ignoring feature_metadata_in.\n", "\t\tFitting CategoryFeatureGenerator...\n", "\t\t\tWarning: feature_metadata_in passed as input to fit_transform, but self.feature_metadata_in was already set. Ignoring feature_metadata_in.\n", "\t\t\tFitting CategoryMemoryMinimizeFeatureGenerator...\n", "\tStage 4 Generators:\n", "\t\tFitting DropUniqueFeatureGenerator...\n", "\tStage 5 Generators:\n", "\t\tFitting DropDuplicatesFeatureGenerator...\n", "\tTypes of features in original data (raw dtype, special dtypes):\n", "\t\t('int', []) : 6 | ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', ...]\n", "\t\t('object', []) : 8 | ['education', 'marital-status', 'native-country', 'occupation', 'race', ...]\n", "\tTypes of features in processed data (raw dtype, special dtypes):\n", "\t\t('category', []) : 8 | ['education', 'marital-status', 'native-country', 'occupation', 'race', ...]\n", "\t\t('int', []) : 6 | ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', ...]\n", "\t0.2s = Fit runtime\n", "\t14 features in original data used to generate 14 features in processed data.\n", "\tTrain Data (Processed) Memory Usage: 1.46 MB (0.0% of available memory)\n", "Data preprocessing and feature engineering runtime = 0.18s ...\n", "AutoGluon will gauge predictive performance using evaluation metric: 'accuracy'\n", "\tTo change this, specify the eval_metric parameter of Predictor()\n", "use_bag_holdout=True, will use tuning_data as holdout (will not be used for early stopping).\n", "User-specified model hyperparameters to be fit:\n", "{\n", "\t'RF': {},\n", "}\n", "Fitting 1 L1 models ...\n", "Hyperparameter tuning model: RandomForest_BAG_L1 ...\n", "\tNo hyperparameter search space specified for RandomForest_BAG_L1. Skipping HPO. Will train one model based on the provided hyperparameters.\n", "Fitted model: RandomForest_BAG_L1 ...\n", "\t0.8543\t = Validation score (accuracy)\n", "\t1.92s\t = Training runtime\n", "\t0.0s\t = Validation runtime\n", "Fitting model: WeightedEnsemble_L2 ...\n", "\t0.8538\t = Validation score (accuracy)\n", "\t0.0s\t = Training runtime\n", "\t0.01s\t = Validation runtime\n", "AutoGluon training complete, total runtime = 2.4s ... Best model: \"WeightedEnsemble_L2\"\n", "TabularPredictor saved. To load, use: predictor = TabularPredictor.load(\"/tmp/tmpqcxgn2l1/output/adult/2023-09-23-20-36-15-0_UserInputConfig/AutoGluon/Random Forest/\")\n", "Training log_loss loss: 0.08021\n", "Validation log_loss loss: 0.31802\n", "Testing log_loss loss: 0.31261\n", "Trainer saved. To load the trainer, run trainer = load_trainer(path='/tmp/tmpqcxgn2l1/output/adult/2023-09-23-20-36-15-0_UserInputConfig/trainer.pkl')\n", "\n", "-------------AutoGluon End-------------\n", "\n", "PytorchTabular metrics\n", "Category Embedding 1/1\n", "WideDeep metrics\n", "TabMlp 1/1\n", "AutoGluon metrics\n", "Random Forest 1/1\n", "Trainer saved. To load the trainer, run trainer = load_trainer(path='/tmp/tmpqcxgn2l1/output/adult/2023-09-23-20-36-15-0_UserInputConfig/trainer.pkl')\n" ] }, { "data": { "text/plain": " Program Model Training F1_SCORE \\\n0 WideDeep TabMlp 0.694200 \n1 AutoGluon Random Forest 1.000000 \n2 PytorchTabular Category Embedding 0.709806 \n\n Training PRECISION_SCORE Training RECALL_SCORE Training JACCARD_SCORE \\\n0 0.728505 0.662981 0.531628 \n1 1.000000 1.000000 1.000000 \n2 0.738341 0.683394 0.550154 \n\n Training ACCURACY_SCORE Training BALANCED_ACCURACY_SCORE \\\n0 0.859388 0.792321 \n1 1.000000 1.000000 \n2 0.865479 0.803303 \n\n Training COHEN_KAPPA_SCORE Training HAMMING_LOSS ... \\\n0 0.603167 0.140612 ... \n1 1.000000 0.000000 ... \n2 0.622423 0.134521 ... \n\n Validation ACCURACY_SCORE Validation BALANCED_ACCURACY_SCORE \\\n0 0.852426 0.784474 \n1 0.853808 0.776665 \n2 0.850430 0.784467 \n\n Validation COHEN_KAPPA_SCORE Validation HAMMING_LOSS \\\n0 0.584884 0.147574 \n1 0.580404 0.146192 \n2 0.581612 0.149570 \n\n Validation MATTHEWS_CORRCOEF Validation ZERO_ONE_LOSS \\\n0 0.585738 0.147574 \n1 0.583003 0.146192 \n2 0.582150 0.149570 \n\n Validation ROC_AUC_SCORE Validation LOG_LOSS Validation BRIER_SCORE_LOSS \\\n0 0.908951 0.317288 0.101612 \n1 0.907010 0.318016 0.100486 \n2 0.909318 0.316194 0.101722 \n\n Validation AVERAGE_PRECISION_SCORE \n0 0.868420 \n1 0.875084 \n2 0.868410 \n\n[3 rows x 44 columns]", "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
ProgramModelTraining F1_SCORETraining PRECISION_SCORETraining RECALL_SCORETraining JACCARD_SCORETraining ACCURACY_SCORETraining BALANCED_ACCURACY_SCORETraining COHEN_KAPPA_SCORETraining HAMMING_LOSS...Validation ACCURACY_SCOREValidation BALANCED_ACCURACY_SCOREValidation COHEN_KAPPA_SCOREValidation HAMMING_LOSSValidation MATTHEWS_CORRCOEFValidation ZERO_ONE_LOSSValidation ROC_AUC_SCOREValidation LOG_LOSSValidation BRIER_SCORE_LOSSValidation AVERAGE_PRECISION_SCORE
0WideDeepTabMlp0.6942000.7285050.6629810.5316280.8593880.7923210.6031670.140612...0.8524260.7844740.5848840.1475740.5857380.1475740.9089510.3172880.1016120.868420
1AutoGluonRandom Forest1.0000001.0000001.0000001.0000001.0000001.0000001.0000000.000000...0.8538080.7766650.5804040.1461920.5830030.1461920.9070100.3180160.1004860.875084
2PytorchTabularCategory Embedding0.7098060.7383410.6833940.5501540.8654790.8033030.6224230.134521...0.8504300.7844670.5816120.1495700.5821500.1495700.9093180.3161940.1017220.868410
\n

3 rows × 44 columns

\n
" }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "trainer = Trainer(device=device)\n", "adult_columns = [\n", " \"age\",\n", " \"workclass\",\n", " \"fnlwgt\",\n", " \"education\",\n", " \"education-num\",\n", " \"marital-status\",\n", " \"occupation\",\n", " \"relationship\",\n", " \"race\",\n", " \"sex\",\n", " \"capital-gain\",\n", " \"capital-loss\",\n", " \"hours-per-week\",\n", " \"native-country\",\n", " \"income\",\n", "]\n", "cfg = UserConfig.from_uci(\"Adult\", column_names=adult_columns, sep=\", \")\n", "trainer.load_config(cfg)\n", "trainer.load_data()\n", "models = [\n", " PytorchTabular(trainer, model_subset=[\"Category Embedding\"]),\n", " WideDeep(trainer, model_subset=[\"TabMlp\"]),\n", " AutoGluon(trainer, model_subset=[\"Random Forest\"]),\n", "]\n", "trainer.add_modelbases(models)\n", "trainer.train(stderr_to_stdout=True)\n", "trainer.get_leaderboard()" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "## Multiclass classification\n", "\n", "Iris is a famous multiclass classification task. It is also loaded from UCI datasets. We gave the argument `column_names` to `from_uci` in the above examples. If we do not know the column labels, column names from the UCI website are used (whose order might be wrong, such as those for the Auto MPG dataset) and the downloaded archive will not be removed after `from_uci`. There should be a file named `xxx.name` in the archive with column names in it." ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": 12, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Downloading https://archive.ics.uci.edu/static/public/53/iris.zip to /tmp/tmpqcxgn2l1/data/Iris.zip\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/xlluo/hdd/tabular_ensemble/tabensemb/config/user_config.py:323: UserWarning: `column_names` is not given. The order of columns will be loaded from the website. It is highly recommended to manually set column names. The downloaded .zip is saved. Please check its .name file for the correct order.\n", " warnings.warn(\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "The project will be saved to /tmp/tmpqcxgn2l1/output/iris/2023-09-23-20-37-50-0_UserInputConfig\n", "Dataset size: 90 30 30\n", "Data saved to /tmp/tmpqcxgn2l1/output/iris/2023-09-23-20-37-50-0_UserInputConfig (data.csv and tabular_data.csv).\n", "\n", "-------------Run PytorchTabular-------------\n", "\n", "Training Category Embedding\n", "Global seed set to 42\n", "2023-09-23 20:37:51,096 - {pytorch_tabular.tabular_model:473} - INFO - Preparing the DataLoaders\n", "2023-09-23 20:37:51,096 - {pytorch_tabular.tabular_datamodule:290} - INFO - Setting up the datamodule for classification task\n", "2023-09-23 20:37:51,106 - {pytorch_tabular.tabular_model:521} - INFO - Preparing the Model: CategoryEmbeddingModel\n", "2023-09-23 20:37:51,121 - {pytorch_tabular.tabular_model:268} - INFO - Preparing the Trainer\n", "/home/xlluo/anaconda3/envs/tabular_ensemble/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py:589: LightningDeprecationWarning: The Trainer argument `auto_select_gpus` has been deprecated in v1.9.0 and will be removed in v2.0.0. Please use the function `pytorch_lightning.accelerators.find_usable_cuda_devices` instead.\n", " rank_zero_deprecation(\n", "Auto select gpus: [0]\n", "GPU available: True (cuda), used: True\n", "TPU available: False, using: 0 TPU cores\n", "IPU available: False, using: 0 IPUs\n", "HPU available: False, using: 0 HPUs\n", "2023-09-23 20:37:51,137 - {pytorch_tabular.tabular_model:582} - INFO - Training Started\n", "You are using a CUDA device ('NVIDIA GeForce RTX 3090') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision\n", "LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]\n", "\n", " | Name | Type | Params\n", "---------------------------------------------------------------\n", "0 | _backbone | CategoryEmbeddingBackbone | 11.0 K\n", "1 | _embedding_layer | Embedding1dLayer | 8 \n", "2 | head | LinearHead | 99 \n", "3 | loss | CrossEntropyLoss | 0 \n", "---------------------------------------------------------------\n", "11.1 K Trainable params\n", "0 Non-trainable params\n", "11.1 K Total params\n", "0.044 Total estimated model params size (MB)\n", "Epoch: 1/300, Train loss: 1.7917, Val loss: 1.4287, Min val loss: 1.4287, Epoch time: 0.025s.\n", "Epoch: 20/300, Train loss: 0.3055, Val loss: 0.6197, Min val loss: 0.6197, Epoch time: 0.019s.\n", "Epoch: 40/300, Train loss: 0.2105, Val loss: 0.5668, Min val loss: 0.5668, Epoch time: 0.016s.\n", "Epoch: 60/300, Train loss: 0.1510, Val loss: 0.5234, Min val loss: 0.5234, Epoch time: 0.013s.\n", "Epoch: 80/300, Train loss: 0.1741, Val loss: 0.5314, Min val loss: 0.5216, Epoch time: 0.010s.\n", "Epoch: 100/300, Train loss: 0.0870, Val loss: 0.4985, Min val loss: 0.4853, Epoch time: 0.013s.\n", "Epoch: 120/300, Train loss: 0.0437, Val loss: 0.5143, Min val loss: 0.4853, Epoch time: 0.027s.\n", "Epoch: 140/300, Train loss: 0.0248, Val loss: 0.4864, Min val loss: 0.4844, Epoch time: 0.017s.\n", "Epoch: 160/300, Train loss: 0.0663, Val loss: 0.5182, Min val loss: 0.4506, Epoch time: 0.012s.\n", "Epoch: 180/300, Train loss: 0.0457, Val loss: 0.5648, Min val loss: 0.4506, Epoch time: 0.018s.\n", "Epoch: 200/300, Train loss: 0.0188, Val loss: 0.4554, Min val loss: 0.4319, Epoch time: 0.012s.\n", "Epoch: 220/300, Train loss: 0.0421, Val loss: 0.4980, Min val loss: 0.4269, Epoch time: 0.010s.\n", "Epoch: 240/300, Train loss: 0.0202, Val loss: 0.4948, Min val loss: 0.4187, Epoch time: 0.011s.\n", "Epoch: 260/300, Train loss: 0.0269, Val loss: 0.4776, Min val loss: 0.4187, Epoch time: 0.035s.\n", "Epoch: 280/300, Train loss: 0.0621, Val loss: 0.4086, Min val loss: 0.3670, Epoch time: 0.013s.\n", "Epoch: 300/300, Train loss: 0.0098, Val loss: 0.4748, Min val loss: 0.3670, Epoch time: 0.011s.\n", "`Trainer.fit` stopped: `max_epochs=300` reached.\n", "2023-09-23 20:37:56,026 - {pytorch_tabular.tabular_model:584} - INFO - Training the model completed\n", "2023-09-23 20:37:56,027 - {pytorch_tabular.tabular_model:1258} - INFO - Loading the best model\n", "/home/xlluo/anaconda3/envs/tabular_ensemble/lib/python3.10/site-packages/pytorch_lightning/utilities/cloud_io.py:33: LightningDeprecationWarning: `pytorch_lightning.utilities.cloud_io.get_filesystem` has been deprecated in v1.8.0 and will be removed in v2.0.0. Please use `lightning_fabric.utilities.cloud_io.get_filesystem` instead.\n", " rank_zero_deprecation(\n", "Training log_loss loss: 0.01391\n", "Validation log_loss loss: 0.36698\n", "Testing log_loss loss: 0.14373\n", "Trainer saved. To load the trainer, run trainer = load_trainer(path='/tmp/tmpqcxgn2l1/output/iris/2023-09-23-20-37-50-0_UserInputConfig/trainer.pkl')\n", "\n", "-------------PytorchTabular End-------------\n", "\n", "\n", "-------------Run WideDeep-------------\n", "\n", "Training TabMlp\n", "Epoch: 1/300, Train loss: 1.0850, Val loss: 1.0164, Min val loss: 1.0164\n", "Epoch: 21/300, Train loss: 0.3373, Val loss: 0.5153, Min val loss: 0.5153\n", "Epoch: 41/300, Train loss: 0.1735, Val loss: 0.3780, Min val loss: 0.3780\n", "Epoch: 61/300, Train loss: 0.1253, Val loss: 0.3279, Min val loss: 0.3165\n", "Epoch: 81/300, Train loss: 0.0960, Val loss: 0.3603, Min val loss: 0.2951\n", "Epoch: 101/300, Train loss: 0.1037, Val loss: 0.3682, Min val loss: 0.2951\n", "Epoch: 121/300, Train loss: 0.1171, Val loss: 0.3525, Min val loss: 0.2951\n", "Epoch: 141/300, Train loss: 0.0503, Val loss: 0.3754, Min val loss: 0.2951\n", "Epoch: 161/300, Train loss: 0.0421, Val loss: 0.3265, Min val loss: 0.2951\n", "Epoch 00167: early stopping\n", "Restoring model weights from the end of the best epoch\n", "Training log_loss loss: 0.06156\n", "Validation log_loss loss: 0.29513\n", "Testing log_loss loss: 0.11595\n", "Trainer saved. To load the trainer, run trainer = load_trainer(path='/tmp/tmpqcxgn2l1/output/iris/2023-09-23-20-37-50-0_UserInputConfig/trainer.pkl')\n", "\n", "-------------WideDeep End-------------\n", "\n", "\n", "-------------Run AutoGluon-------------\n", "\n", "Training Random Forest\n", "Presets specified: ['best_quality']\n", "Warning: hyperparameter tuning is currently experimental and may cause the process to hang.\n", "Stack configuration (auto_stack=True): num_stack_levels=0, num_bag_folds=5, num_bag_sets=1\n", "Beginning AutoGluon training ...\n", "AutoGluon will save models to \"/tmp/tmpqcxgn2l1/output/iris/2023-09-23-20-37-50-0_UserInputConfig/AutoGluon/Random Forest/\"\n", "AutoGluon Version: 0.8.2\n", "Python Version: 3.10.12\n", "Operating System: Linux\n", "Platform Machine: x86_64\n", "Platform Version: #1 SMP Mon Dec 13 20:27:58 CST 2021\n", "Disk Space Avail: 148.84 GB / 502.47 GB (29.6%)\n", "Train Data Rows: 90\n", "Train Data Columns: 4\n", "Tuning Data Rows: 30\n", "Tuning Data Columns: 4\n", "Label Column: class\n", "Preprocessing data ...\n", "Train Data Class Count: 3\n", "Using Feature Generators to preprocess the data ...\n", "Fitting PipelineFeatureGenerator...\n", "\tAvailable Memory: 4021.68 MB\n", "\tTrain Data (Original) Memory Usage: 0.0 MB (0.0% of available memory)\n", "\tInferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.\n", "\tStage 1 Generators:\n", "\t\tFitting AsTypeFeatureGenerator...\n", "\tStage 2 Generators:\n", "\t\tFitting FillNaFeatureGenerator...\n", "\tStage 3 Generators:\n", "\t\tFitting IdentityFeatureGenerator...\n", "\t\t\tWarning: feature_metadata_in passed as input to fit_transform, but self.feature_metadata_in was already set. Ignoring feature_metadata_in.\n", "\tStage 4 Generators:\n", "\t\tFitting DropUniqueFeatureGenerator...\n", "\tStage 5 Generators:\n", "\t\tFitting DropDuplicatesFeatureGenerator...\n", "\tTypes of features in original data (raw dtype, special dtypes):\n", "\t\t('float', []) : 4 | ['sepal length', 'sepal width', 'petal length', 'petal width']\n", "\tTypes of features in processed data (raw dtype, special dtypes):\n", "\t\t('float', []) : 4 | ['sepal length', 'sepal width', 'petal length', 'petal width']\n", "\t0.0s = Fit runtime\n", "\t4 features in original data used to generate 4 features in processed data.\n", "\tTrain Data (Processed) Memory Usage: 0.0 MB (0.0% of available memory)\n", "Data preprocessing and feature engineering runtime = 0.06s ...\n", "AutoGluon will gauge predictive performance using evaluation metric: 'accuracy'\n", "\tTo change this, specify the eval_metric parameter of Predictor()\n", "use_bag_holdout=True, will use tuning_data as holdout (will not be used for early stopping).\n", "User-specified model hyperparameters to be fit:\n", "{\n", "\t'RF': {},\n", "}\n", "Fitting 1 L1 models ...\n", "Hyperparameter tuning model: RandomForest_BAG_L1 ...\n", "\tNo hyperparameter search space specified for RandomForest_BAG_L1. Skipping HPO. Will train one model based on the provided hyperparameters.\n", "Fitted model: RandomForest_BAG_L1 ...\n", "\t1.0\t = Validation score (accuracy)\n", "\t0.57s\t = Training runtime\n", "\t0.0s\t = Validation runtime\n", "Fitting model: WeightedEnsemble_L2 ...\n", "\t0.8\t = Validation score (accuracy)\n", "\t0.0s\t = Training runtime\n", "\t0.0s\t = Validation runtime\n", "AutoGluon training complete, total runtime = 0.69s ... Best model: \"WeightedEnsemble_L2\"\n", "TabularPredictor saved. To load, use: predictor = TabularPredictor.load(\"/tmp/tmpqcxgn2l1/output/iris/2023-09-23-20-37-50-0_UserInputConfig/AutoGluon/Random Forest/\")\n", "Training log_loss loss: 0.01002\n", "Validation log_loss loss: 0.78155\n", "Testing log_loss loss: 0.04824\n", "Trainer saved. To load the trainer, run trainer = load_trainer(path='/tmp/tmpqcxgn2l1/output/iris/2023-09-23-20-37-50-0_UserInputConfig/trainer.pkl')\n", "\n", "-------------AutoGluon End-------------\n", "\n", "PytorchTabular metrics\n", "Category Embedding 1/1\n", "WideDeep metrics\n", "TabMlp 1/1\n", "AutoGluon metrics\n", "Random Forest 1/1\n", "Trainer saved. To load the trainer, run trainer = load_trainer(path='/tmp/tmpqcxgn2l1/output/iris/2023-09-23-20-37-50-0_UserInputConfig/trainer.pkl')\n" ] }, { "data": { "text/plain": " Program Model Training ACCURACY_SCORE \\\n0 PytorchTabular Category Embedding 1.0 \n1 WideDeep TabMlp 1.0 \n2 AutoGluon Random Forest 1.0 \n\n Training BALANCED_ACCURACY_SCORE Training COHEN_KAPPA_SCORE \\\n0 1.0 1.0 \n1 1.0 1.0 \n2 1.0 1.0 \n\n Training HAMMING_LOSS Training MATTHEWS_CORRCOEF Training ZERO_ONE_LOSS \\\n0 0.0 1.0 0.0 \n1 0.0 1.0 0.0 \n2 0.0 1.0 0.0 \n\n Training PRECISION_SCORE_MACRO Training PRECISION_SCORE_MICRO ... \\\n0 1.0 1.0 ... \n1 1.0 1.0 ... \n2 1.0 1.0 ... \n\n Validation F1_SCORE_MICRO Validation F1_SCORE_WEIGHTED \\\n0 0.833333 0.837232 \n1 0.833333 0.837232 \n2 0.800000 0.804615 \n\n Validation JACCARD_SCORE_MACRO Validation JACCARD_SCORE_MICRO \\\n0 0.756944 0.714286 \n1 0.756944 0.714286 \n2 0.721154 0.666667 \n\n Validation JACCARD_SCORE_WEIGHTED Validation TOP_K_ACCURACY_SCORE \\\n0 0.732639 1.0 \n1 0.732639 1.0 \n2 0.689423 1.0 \n\n Validation LOG_LOSS Validation ROC_AUC_SCORE_OVR_MACRO \\\n0 0.366983 0.974891 \n1 0.295129 0.979747 \n2 0.781551 0.950812 \n\n Validation ROC_AUC_SCORE_OVR_WEIGHTED Validation ROC_AUC_SCORE_OVO \n0 0.971616 0.976042 \n1 0.977576 0.980833 \n2 0.941465 0.951042 \n\n[3 rows x 71 columns]", "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
ProgramModelTraining ACCURACY_SCORETraining BALANCED_ACCURACY_SCORETraining COHEN_KAPPA_SCORETraining HAMMING_LOSSTraining MATTHEWS_CORRCOEFTraining ZERO_ONE_LOSSTraining PRECISION_SCORE_MACROTraining PRECISION_SCORE_MICRO...Validation F1_SCORE_MICROValidation F1_SCORE_WEIGHTEDValidation JACCARD_SCORE_MACROValidation JACCARD_SCORE_MICROValidation JACCARD_SCORE_WEIGHTEDValidation TOP_K_ACCURACY_SCOREValidation LOG_LOSSValidation ROC_AUC_SCORE_OVR_MACROValidation ROC_AUC_SCORE_OVR_WEIGHTEDValidation ROC_AUC_SCORE_OVO
0PytorchTabularCategory Embedding1.01.01.00.01.00.01.01.0...0.8333330.8372320.7569440.7142860.7326391.00.3669830.9748910.9716160.976042
1WideDeepTabMlp1.01.01.00.01.00.01.01.0...0.8333330.8372320.7569440.7142860.7326391.00.2951290.9797470.9775760.980833
2AutoGluonRandom Forest1.01.01.00.01.00.01.01.0...0.8000000.8046150.7211540.6666670.6894231.00.7815510.9508120.9414650.951042
\n

3 rows × 71 columns

\n
" }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "trainer = Trainer(device=device)\n", "cfg = UserConfig.from_uci(\"Iris\", datafile_name=\"iris\")\n", "trainer.load_config(cfg)\n", "trainer.load_data()\n", "models = [\n", " PytorchTabular(trainer, model_subset=[\"Category Embedding\"]),\n", " WideDeep(trainer, model_subset=[\"TabMlp\"]),\n", " AutoGluon(trainer, model_subset=[\"Random Forest\"]),\n", "]\n", "trainer.add_modelbases(models)\n", "trainer.train(stderr_to_stdout=True)\n", "trainer.get_leaderboard()" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "## Using a configuration file\n", "\n", "In the above introduction, we use UCI datasets whose configuration is automatically generated. The configuration can also be loaded from a local `.py` or `.json` file. To run a minimum example, we provide a randomly generated sample dataset (`data/sample.csv`) and its configuration file (`configs/sample.py`) in the repository. See \"Dataset and configuration\" for the detailed introduction of configuration files.\n", "\n", "`tabensemb` uses paths relative to the current directory. For different IDEs (PyCharm, VSCode, etc.), the directory might be different. Set default paths to desired ones after checking the current working directory using magic commands in notebooks like `!pwd` or scripts like `import os; os.getcwd()`." ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": 13, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "path = \"../../../../\"\n", "tabensemb.setting[\"default_config_path\"] = path + \"configs\"\n", "tabensemb.setting[\"default_data_path\"] = path + \"data\"" ] }, { "cell_type": "markdown", "source": [ "Load the configuration file `sample.py` using `Trainer.load_config`, which automatically searches the file in the current directory and `tabensemb.setting[\"default_config_path\"]`." ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": 14, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The project will be saved to /tmp/tmpqcxgn2l1/output/iris/2023-09-23-20-37-58-0_sample\n", "Dataset size: 153 51 52\n", "Data saved to /tmp/tmpqcxgn2l1/output/iris/2023-09-23-20-37-58-0_sample (data.csv and tabular_data.csv).\n" ] } ], "source": [ "trainer.load_config(\"sample\")\n", "trainer.load_data()" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "Then initialize models:" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": 15, "outputs": [], "source": [ "trainer.clear_modelbase()\n", "models = [\n", " PytorchTabular(trainer, model_subset=[\"Category Embedding\"])\n", "]\n", "trainer.add_modelbases(models)" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "*Optional*: For a quick development test, changing the following global setting significantly reduces training time." ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": 16, "outputs": [], "source": [ "tabensemb.setting[\"debug_mode\"] = True" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "code", "execution_count": 17, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "-------------Run PytorchTabular-------------\n", "\n", "Training Category Embedding\n", "Global seed set to 42\n", "2023-09-23 20:37:59,305 - {pytorch_tabular.tabular_model:473} - INFO - Preparing the DataLoaders\n", "2023-09-23 20:37:59,306 - {pytorch_tabular.tabular_datamodule:290} - INFO - Setting up the datamodule for regression task\n", "2023-09-23 20:37:59,326 - {pytorch_tabular.tabular_model:521} - INFO - Preparing the Model: CategoryEmbeddingModel\n", "2023-09-23 20:37:59,350 - {pytorch_tabular.tabular_model:268} - INFO - Preparing the Trainer\n", "/home/xlluo/anaconda3/envs/tabular_ensemble/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py:589: LightningDeprecationWarning: The Trainer argument `auto_select_gpus` has been deprecated in v1.9.0 and will be removed in v2.0.0. Please use the function `pytorch_lightning.accelerators.find_usable_cuda_devices` instead.\n", " rank_zero_deprecation(\n", "Auto select gpus: [0]\n", "GPU available: True (cuda), used: True\n", "TPU available: False, using: 0 TPU cores\n", "IPU available: False, using: 0 IPUs\n", "HPU available: False, using: 0 HPUs\n", "2023-09-23 20:37:59,372 - {pytorch_tabular.tabular_model:582} - INFO - Training Started\n", "You are using a CUDA device ('NVIDIA GeForce RTX 3090') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision\n", "LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]\n", "\n", " | Name | Type | Params\n", "---------------------------------------------------------------\n", "0 | _backbone | CategoryEmbeddingBackbone | 12.3 K\n", "1 | _embedding_layer | Embedding1dLayer | 64 \n", "2 | head | LinearHead | 33 \n", "3 | loss | MSELoss | 0 \n", "---------------------------------------------------------------\n", "12.4 K Trainable params\n", "0 Non-trainable params\n", "12.4 K Total params\n", "0.049 Total estimated model params size (MB)\n", "Epoch: 1/2, Train loss: 33183.6562, Val loss: 22223.0391, Min val loss: 22223.0391, Epoch time: 0.014s.\n", "`Trainer.fit` stopped: `max_epochs=2` reached.\n", "2023-09-23 20:37:59,437 - {pytorch_tabular.tabular_model:584} - INFO - Training the model completed\n", "2023-09-23 20:37:59,438 - {pytorch_tabular.tabular_model:1258} - INFO - Loading the best model\n", "/home/xlluo/anaconda3/envs/tabular_ensemble/lib/python3.10/site-packages/pytorch_lightning/utilities/cloud_io.py:33: LightningDeprecationWarning: `pytorch_lightning.utilities.cloud_io.get_filesystem` has been deprecated in v1.8.0 and will be removed in v2.0.0. Please use `lightning_fabric.utilities.cloud_io.get_filesystem` instead.\n", " rank_zero_deprecation(\n", "Training mse loss: 33085.08333\n", "Validation mse loss: 22182.62019\n", "Testing mse loss: 29810.51082\n", "Trainer saved. To load the trainer, run trainer = load_trainer(path='/tmp/tmpqcxgn2l1/output/iris/2023-09-23-20-37-58-0_sample/trainer.pkl')\n", "\n", "-------------PytorchTabular End-------------\n", "\n", "PytorchTabular metrics\n", "Category Embedding 1/1\n", "Trainer saved. To load the trainer, run trainer = load_trainer(path='/tmp/tmpqcxgn2l1/output/iris/2023-09-23-20-37-58-0_sample/trainer.pkl')\n" ] }, { "data": { "text/plain": " Program Model Training RMSE Training MSE \\\n0 PytorchTabular Category Embedding 181.893055 33085.083331 \n\n Training MAE Training MAPE Training R2 Training MEDIAN_ABSOLUTE_ERROR \\\n0 145.409738 1.055588 -0.004359 121.139843 \n\n Training EXPLAINED_VARIANCE_SCORE Testing RMSE ... Testing R2 \\\n0 0.001236 172.657206 ... -0.005851 \n\n Testing MEDIAN_ABSOLUTE_ERROR Testing EXPLAINED_VARIANCE_SCORE \\\n0 118.665751 -0.001657 \n\n Validation RMSE Validation MSE Validation MAE Validation MAPE \\\n0 148.93831 22182.620185 121.146176 1.00906 \n\n Validation R2 Validation MEDIAN_ABSOLUTE_ERROR \\\n0 -0.001817 92.916794 \n\n Validation EXPLAINED_VARIANCE_SCORE \n0 0.001214 \n\n[1 rows x 23 columns]", "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
ProgramModelTraining RMSETraining MSETraining MAETraining MAPETraining R2Training MEDIAN_ABSOLUTE_ERRORTraining EXPLAINED_VARIANCE_SCORETesting RMSE...Testing R2Testing MEDIAN_ABSOLUTE_ERRORTesting EXPLAINED_VARIANCE_SCOREValidation RMSEValidation MSEValidation MAEValidation MAPEValidation R2Validation MEDIAN_ABSOLUTE_ERRORValidation EXPLAINED_VARIANCE_SCORE
0PytorchTabularCategory Embedding181.89305533085.083331145.4097381.055588-0.004359121.1398430.001236172.657206...-0.005851118.665751-0.001657148.9383122182.620185121.1461761.00906-0.00181792.9167940.001214
\n

1 rows × 23 columns

\n
" }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "trainer.train(stderr_to_stdout=True)\n", "trainer.get_leaderboard()" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "Clean the temporary directory of the notebook." ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": 18, "outputs": [], "source": [ "temp_path.cleanup()" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.16" } }, "nbformat": 4, "nbformat_minor": 0 }