{ "cells": [ { "cell_type": "markdown", "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } }, "source": [ "# Basics of running benchmarks\n", "\n", "Tabular Ensemble (`tabensemb`) is a benchmark platform for tabular prediction tasks. We support three well-established model bases as baselines:\n", "\n", "* `autogluon`: [Link](https://github.com/autogluon/autogluon)\n", "\n", "* `pytorch_widedeep`: [Link](https://github.com/jrzaurin/pytorch-widedeep)\n", "\n", "* `pytorch_tabular`: [Link](https://github.com/manujosephv/pytorch_tabular)\n", "\n", "Users can run benchmarks on customized datasets using customized preprocessing steps and implement customized models in the framework to run and compare their performance with baselines within a consistent procedure.\n", "\n", "In this part, minimum examples of regression, binary classification, and multiclass classification are performed to show the basic functionality of the package." ] }, { "cell_type": "markdown", "source": [ "## Regression" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "markdown", "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } }, "source": [ "### Loading packages\n", "\n", "First, import the necessary modules. Then check the validity of `CUDA` and determine the training device." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Using cuda device\n" ] } ], "source": [ "import torch\n", "from tabensemb.trainer import Trainer\n", "from tabensemb.model import *\n", "from tabensemb.config import UserConfig\n", "import tabensemb\n", "import os\n", "\n", "device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n", "print(\"Using {} device\".format(device))" ] }, { "cell_type": "markdown", "source": [ "`tabensemb` uses paths relative to the current directory. For different IDEs (PyCharm, VSCode, etc.), the directory can be different. Set default paths to desired ones.\n", "\n", "* `tabensemb.setting[\"default_output_path\"]`: It will be used to save results. This path will be created if it does not exist.\n", "* `tabensemb.setting[\"default_config_path\"]`: It should be the path to configuration files (See \"Using a configuration file\" for its case).\n", "* `tabensemb.setting[\"default_config_path\"]`: It should be the path to data files. It will also be used to save downloaded datasets (See \"Using a configuration file\" for its case).\n", "\n", "In this notebook, we use a temporary directory for cleanliness. Change `temp_path.name` to your own directory." ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": 2, "outputs": [], "source": [ "from tempfile import TemporaryDirectory\n", "\n", "temp_path = TemporaryDirectory()\n", "tabensemb.setting[\"default_output_path\"] = os.path.join(temp_path.name, \"output\")\n", "tabensemb.setting[\"default_config_path\"] = os.path.join(temp_path.name, \"configs\")\n", "tabensemb.setting[\"default_data_path\"] = os.path.join(temp_path.name, \"data\")" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "### Configuring a `Trainer`\n", "\n", "Create a `Trainer`, which acts as a bridge of data and models and provides some useful utilities." ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "trainer = Trainer(device=device)" ] }, { "cell_type": "markdown", "source": [ "As an example, we use the Auto MPG dataset from [UCI datasets](https://archive.ics.uci.edu/datasets). We can import UCI datasets through the `UserConfig` class." ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": 4, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Downloading https://archive.ics.uci.edu/static/public/9/auto+mpg.zip to /tmp/tmpqcxgn2l1/data/Auto MPG.zip\n", "cylinders is Integer and will be treated as a continuous feature.\n", "model_year is Integer and will be treated as a continuous feature.\n", "origin is Integer and will be treated as a continuous feature.\n", "Unknown values are detected in ['horsepower']. They will be treated as np.nan.\n", "The project will be saved to /tmp/tmpqcxgn2l1/output/auto-mpg/2023-09-23-20-36-00-0_UserInputConfig\n" ] } ], "source": [ "mpg_columns = [\n", " \"mpg\",\n", " \"cylinders\",\n", " \"displacement\",\n", " \"horsepower\",\n", " \"weight\",\n", " \"acceleration\",\n", " \"model_year\",\n", " \"origin\",\n", " \"car_name\",\n", "]\n", "cfg = UserConfig.from_uci(\"Auto MPG\", column_names=mpg_columns, sep=r\"\\s+\")\n", "trainer.load_config(cfg)" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } }, "source": [ "*Optional*: We provide a useful `Logging` class to record all outputs to a file located in the above project root so that users can review the training process. This step is optional but we strongly recommend using it.\n", "\n", "`Trainer.project_root` is the output directory of the `trainer`, and here we log all `stdout` and `stderr` to `log.txt` in this directory." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "from tabensemb.utils import Logging\n", "log = Logging()\n", "log.enter(os.path.join(trainer.project_root, \"log.txt\"))" ] }, { "cell_type": "markdown", "source": [ "### Viewing configurations\n", "\n", "We can view the summary of the current environment, including devices/Python version, the loaded configuration, and global settings of `tabensemb`." ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": 6, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Device:\n", "{\n", "\t'System': 'Linux',\n", "\t'Node name': 'xlluo-WS',\n", "\t'System release': '5.15.6-custom',\n", "\t'System version': '#1 SMP Mon Dec 13 20:27:58 CST 2021',\n", "\t'Machine architecture': 'x86_64',\n", "\t'Processor architecture': 'x86_64',\n", "\t'Processor model': '11th Gen Intel(R) Core(TM) i9-11900K @ 3.50GHz',\n", "\t'Physical cores': 8,\n", "\t'Total cores': 16,\n", "\t'Max core frequency': '5150.00Mhz',\n", "\t'Total memory': '31.20GB',\n", "\t'Python version': '3.10.12',\n", "\t'Python implementation': 'CPython',\n", "\t'Python compiler': 'GCC 11.2.0',\n", "\t'Cuda availability': True,\n", "\t'GPU devices': [\n", "\t\t'NVIDIA GeForce RTX 3090'\n", "\t]\n", "}\n", "Configurations:\n", "{\n", "\t'database': 'auto-mpg',\n", "\t'task': 'regression',\n", "\t'loss': None,\n", "\t'bayes_opt': False,\n", "\t'bayes_calls': 50,\n", "\t'bayes_epoch': 30,\n", "\t'patience': 100,\n", "\t'epoch': 300,\n", "\t'lr': 0.001,\n", "\t'weight_decay': 1e-09,\n", "\t'batch_size': 1024,\n", "\t'layers': [\n", "\t\t64,\n", "\t\t128,\n", "\t\t256,\n", "\t\t128,\n", "\t\t64\n", "\t],\n", "\t'SPACEs': {\n", "\t\t'lr': {\n", "\t\t\t'type': 'Real',\n", "\t\t\t'low': 0.0001,\n", "\t\t\t'high': 0.05,\n", "\t\t\t'prior': 'log-uniform'\n", "\t\t},\n", "\t\t'weight_decay': {\n", "\t\t\t'type': 'Real',\n", "\t\t\t'low': 1e-09,\n", "\t\t\t'high': 0.05,\n", "\t\t\t'prior': 'log-uniform'\n", "\t\t},\n", "\t\t'batch_size': {\n", "\t\t\t'type': 'Categorical',\n", "\t\t\t'categories': [\n", "\t\t\t\t64,\n", "\t\t\t\t128,\n", "\t\t\t\t256,\n", "\t\t\t\t512,\n", "\t\t\t\t1024,\n", "\t\t\t\t2048\n", "\t\t\t]\n", "\t\t}\n", "\t},\n", "\t'data_splitter': 'RandomSplitter',\n", "\t'split_ratio': [\n", "\t\t0.6,\n", "\t\t0.2,\n", "\t\t0.2\n", "\t],\n", "\t'data_imputer': 'MissForestImputer',\n", "\t'data_processors': [\n", "\t\t(\n", "\t\t\t'CategoricalOrdinalEncoder',\n", "\t\t\t{\n", "\t\t\t}\n", "\t\t),\n", "\t\t(\n", "\t\t\t'NaNFeatureRemover',\n", "\t\t\t{\n", "\t\t\t}\n", "\t\t),\n", "\t\t(\n", "\t\t\t'VarianceFeatureSelector',\n", "\t\t\t{\n", "\t\t\t\t'thres': 1\n", "\t\t\t}\n", "\t\t),\n", "\t\t(\n", "\t\t\t'StandardScaler',\n", "\t\t\t{\n", "\t\t\t}\n", "\t\t)\n", "\t],\n", "\t'data_derivers': [\n", "\t],\n", "\t'categorical_feature_names': [\n", "\t],\n", "\t'continuous_feature_names': [\n", "\t\t'displacement',\n", "\t\t'cylinders',\n", "\t\t'horsepower',\n", "\t\t'weight',\n", "\t\t'acceleration',\n", "\t\t'model_year',\n", "\t\t'origin'\n", "\t],\n", "\t'feature_types': {\n", "\t\t'displacement': 'Continuous',\n", "\t\t'cylinders': 'Continuous',\n", "\t\t'horsepower': 'Continuous',\n", "\t\t'weight': 'Continuous',\n", "\t\t'acceleration': 'Continuous',\n", "\t\t'model_year': 'Continuous',\n", "\t\t'origin': 'Continuous'\n", "\t},\n", "\t'unique_feature_types': [\n", "\t\t'Continuous'\n", "\t],\n", "\t'label_name': [\n", "\t\t'mpg'\n", "\t]\n", "}\n", "Global settings:\n", "{\n", "\t'random_seed': 42,\n", "\t'low_memory': True,\n", "\t'verbose_per_epoch': 20,\n", "\t'test_with_no_grad': True,\n", "\t'debug_mode': False,\n", "\t'default_output_path': '/tmp/tmpqcxgn2l1/output',\n", "\t'default_config_path': '/tmp/tmpqcxgn2l1/configs',\n", "\t'default_data_path': '/tmp/tmpqcxgn2l1/data',\n", "\t'warn_nan_metric': True,\n", "\t'raise_inconsistent_inferred_task': False,\n", "\t'matplotlib_usetex': False\n", "}\n" ] } ], "source": [ "trainer.summarize_setting()" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "### Loading data\n", "\n", "In the configuration summary above, the dataset file is defined by \"database\" under the `Configurations` category. `Trainer.load_data` automatically searches the file in the current directory and `tabensemb.setting[\"default_data_path\"]`. Now, load the Auto MPG dataset into the `Trainer`. It will process the dataset and get ready for training models:\n", "\n", "1. Data splitting (training/validation/testing sets)\n", "2. Data imputation\n", "3. Data augmentation (for features)\n", "4. Data processing\n", " * Data augmentation (for data points)\n", " * Data filtering\n", " * Feature selection\n", " * Categorical encoding\n", " * Data scaling\n", " * etc.\n", "5. Data augmentation (for features, especially multi-modal features)\n" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Dataset size: 238 80 80\n", "Data saved to /tmp/tmpqcxgn2l1/output/auto-mpg/2023-09-23-20-36-00-0_UserInputConfig (data.csv and tabular_data.csv).\n" ] } ], "source": [ "trainer.load_data()" ] }, { "cell_type": "markdown", "source": [ "### Initializing model bases\n", "\n", "Initialize model bases and add them to the `Trainer`. We only choose a subset of models in each model base for demonstration by passing the `model_subset` argument (without it, all available models will be trained)." ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "models = [\n", " PytorchTabular(trainer, model_subset=[\"Category Embedding\"]),\n", " WideDeep(trainer, model_subset=[\"TabMlp\"]),\n", " AutoGluon(trainer, model_subset=[\"Random Forest\"]),\n", "]\n", "trainer.add_modelbases(models)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } }, "source": [ "### Start training\n", "\n", "Now train the model bases. The argument `stderr_to_stdout` will redirect warnings and loggings to `stdout` and make records in the notebook clean.\n", "\n", "*Optional*: Using the following line, we can run k-fold cross-validation to get the leaderboard, where k is `cross_validation`.\n", "\n", "```python\n", "trainer.get_leaderboard(cross_validation=10, split_type=\"cv\", stderr_to_stdout=True)\n", "```\n", "\n", "**Remark**: `split_type` can be `random`, which means that the dataset is randomly split according to the given `split_ratio` in the configuration and different random seeds." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "-------------Run PytorchTabular-------------\n", "\n", "Training Category Embedding\n", "Global seed set to 42\n", "2023-09-23 20:36:01,062 - {pytorch_tabular.tabular_model:473} - INFO - Preparing the DataLoaders\n", "2023-09-23 20:36:01,062 - {pytorch_tabular.tabular_datamodule:290} - INFO - Setting up the datamodule for regression task\n", "2023-09-23 20:36:01,070 - {pytorch_tabular.tabular_model:521} - INFO - Preparing the Model: CategoryEmbeddingModel\n", "2023-09-23 20:36:01,081 - {pytorch_tabular.tabular_model:268} - INFO - Preparing the Trainer\n", "/home/xlluo/anaconda3/envs/tabular_ensemble/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py:589: LightningDeprecationWarning: The Trainer argument `auto_select_gpus` has been deprecated in v1.9.0 and will be removed in v2.0.0. Please use the function `pytorch_lightning.accelerators.find_usable_cuda_devices` instead.\n", " rank_zero_deprecation(\n", "Auto select gpus: [0]\n", "GPU available: True (cuda), used: True\n", "TPU available: False, using: 0 TPU cores\n", "IPU available: False, using: 0 IPUs\n", "HPU available: False, using: 0 HPUs\n", "2023-09-23 20:36:01,991 - {pytorch_tabular.tabular_model:582} - INFO - Training Started\n", "You are using a CUDA device ('NVIDIA GeForce RTX 3090') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision\n", "LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]\n", "\n", " | Name | Type | Params\n", "---------------------------------------------------------------\n", "0 | _backbone | CategoryEmbeddingBackbone | 11.4 K\n", "1 | _embedding_layer | Embedding1dLayer | 14 \n", "2 | head | LinearHead | 33 \n", "3 | loss | MSELoss | 0 \n", "---------------------------------------------------------------\n", "11.4 K Trainable params\n", "0 Non-trainable params\n", "11.4 K Total params\n", "0.046 Total estimated model params size (MB)\n", "Epoch: 1/300, Train loss: 677.8015, Val loss: 582.9557, Min val loss: 582.9557, Epoch time: 0.012s.\n", "Epoch: 20/300, Train loss: 353.7851, Val loss: 302.0203, Min val loss: 302.0203, Epoch time: 0.010s.\n", "Epoch: 40/300, Train loss: 85.0776, Val loss: 62.1153, Min val loss: 62.1153, Epoch time: 0.009s.\n", "Epoch: 60/300, Train loss: 45.2654, Val loss: 34.2778, Min val loss: 34.2691, Epoch time: 0.009s.\n", "Epoch: 80/300, Train loss: 33.9537, Val loss: 26.8622, Min val loss: 26.8622, Epoch time: 0.016s.\n", "Epoch: 100/300, Train loss: 26.9038, Val loss: 23.2417, Min val loss: 23.2372, Epoch time: 0.011s.\n", "Epoch: 120/300, Train loss: 24.9622, Val loss: 20.4360, Min val loss: 20.4360, Epoch time: 0.008s.\n", "Epoch: 140/300, Train loss: 24.1636, Val loss: 19.4010, Min val loss: 19.4010, Epoch time: 0.012s.\n", "Epoch: 160/300, Train loss: 22.9200, Val loss: 18.0232, Min val loss: 17.9749, Epoch time: 0.010s.\n", "Epoch: 180/300, Train loss: 19.7677, Val loss: 16.9469, Min val loss: 16.9469, Epoch time: 0.011s.\n", "Epoch: 200/300, Train loss: 17.9390, Val loss: 16.6545, Min val loss: 16.4093, Epoch time: 0.010s.\n", "Epoch: 220/300, Train loss: 19.4496, Val loss: 15.4451, Min val loss: 15.1788, Epoch time: 0.011s.\n", "Epoch: 240/300, Train loss: 16.0483, Val loss: 14.5508, Min val loss: 14.5508, Epoch time: 0.010s.\n", "Epoch: 260/300, Train loss: 16.4672, Val loss: 13.8354, Min val loss: 13.8354, Epoch time: 0.010s.\n", "Epoch: 280/300, Train loss: 13.6031, Val loss: 12.9315, Min val loss: 12.9315, Epoch time: 0.018s.\n", "Epoch: 300/300, Train loss: 16.5369, Val loss: 12.3673, Min val loss: 12.3673, Epoch time: 0.011s.\n", "`Trainer.fit` stopped: `max_epochs=300` reached.\n", "2023-09-23 20:36:07,420 - {pytorch_tabular.tabular_model:584} - INFO - Training the model completed\n", "2023-09-23 20:36:07,420 - {pytorch_tabular.tabular_model:1258} - INFO - Loading the best model\n", "/home/xlluo/anaconda3/envs/tabular_ensemble/lib/python3.10/site-packages/pytorch_lightning/utilities/cloud_io.py:33: LightningDeprecationWarning: `pytorch_lightning.utilities.cloud_io.get_filesystem` has been deprecated in v1.8.0 and will be removed in v2.0.0. Please use `lightning_fabric.utilities.cloud_io.get_filesystem` instead.\n", " rank_zero_deprecation(\n", "Training mse loss: 11.25175\n", "Validation mse loss: 12.36725\n", "Testing mse loss: 7.83801\n", "Trainer saved. To load the trainer, run trainer = load_trainer(path='/tmp/tmpqcxgn2l1/output/auto-mpg/2023-09-23-20-36-00-0_UserInputConfig/trainer.pkl')\n", "\n", "-------------PytorchTabular End-------------\n", "\n", "\n", "-------------Run WideDeep-------------\n", "\n", "Training TabMlp\n", "Epoch: 1/300, Train loss: 635.5330, Val loss: 555.4755, Min val loss: 555.4755\n", "Epoch: 21/300, Train loss: 441.6902, Val loss: 375.7337, Min val loss: 375.7337\n", "Epoch: 41/300, Train loss: 145.8623, Val loss: 119.9598, Min val loss: 119.9598\n", "Epoch: 61/300, Train loss: 45.9133, Val loss: 34.0160, Min val loss: 34.0160\n", "Epoch: 81/300, Train loss: 27.6878, Val loss: 24.1525, Min val loss: 24.1525\n", "Epoch: 101/300, Train loss: 23.0877, Val loss: 18.2096, Min val loss: 18.2096\n", "Epoch: 121/300, Train loss: 21.4056, Val loss: 17.2203, Min val loss: 17.1303\n", "Epoch: 141/300, Train loss: 21.2559, Val loss: 16.0746, Min val loss: 16.0746\n", "Epoch: 161/300, Train loss: 19.2337, Val loss: 15.3027, Min val loss: 15.3027\n", "Epoch: 181/300, Train loss: 16.1232, Val loss: 14.5777, Min val loss: 14.5777\n", "Epoch: 201/300, Train loss: 16.7095, Val loss: 14.2274, Min val loss: 14.2274\n", "Epoch: 221/300, Train loss: 15.7366, Val loss: 13.5223, Min val loss: 13.5223\n", "Epoch: 241/300, Train loss: 16.9825, Val loss: 12.9892, Min val loss: 12.9892\n", "Epoch: 261/300, Train loss: 15.3358, Val loss: 12.4278, Min val loss: 12.4278\n", "Epoch: 281/300, Train loss: 13.3989, Val loss: 12.1155, Min val loss: 12.1155\n", "Restoring model weights from the end of the best epoch\n", "Training mse loss: 10.17037\n", "Validation mse loss: 11.66271\n", "Testing mse loss: 6.43856\n", "Trainer saved. To load the trainer, run trainer = load_trainer(path='/tmp/tmpqcxgn2l1/output/auto-mpg/2023-09-23-20-36-00-0_UserInputConfig/trainer.pkl')\n", "\n", "-------------WideDeep End-------------\n", "\n", "\n", "-------------Run AutoGluon-------------\n", "\n", "Training Random Forest\n", "Presets specified: ['best_quality']\n", "Warning: hyperparameter tuning is currently experimental and may cause the process to hang.\n", "Stack configuration (auto_stack=True): num_stack_levels=0, num_bag_folds=5, num_bag_sets=1\n", "Beginning AutoGluon training ...\n", "AutoGluon will save models to \"/tmp/tmpqcxgn2l1/output/auto-mpg/2023-09-23-20-36-00-0_UserInputConfig/AutoGluon/Random Forest/\"\n", "AutoGluon Version: 0.8.2\n", "Python Version: 3.10.12\n", "Operating System: Linux\n", "Platform Machine: x86_64\n", "Platform Version: #1 SMP Mon Dec 13 20:27:58 CST 2021\n", "Disk Space Avail: 149.28 GB / 502.47 GB (29.7%)\n", "Train Data Rows: 238\n", "Train Data Columns: 7\n", "Tuning Data Rows: 80\n", "Tuning Data Columns: 7\n", "Label Column: mpg\n", "Preprocessing data ...\n", "Using Feature Generators to preprocess the data ...\n", "Fitting PipelineFeatureGenerator...\n", "\tAvailable Memory: 12533.16 MB\n", "\tTrain Data (Original) Memory Usage: 0.02 MB (0.0% of available memory)\n", "\tInferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.\n", "\tStage 1 Generators:\n", "\t\tFitting AsTypeFeatureGenerator...\n", "\tStage 2 Generators:\n", "\t\tFitting FillNaFeatureGenerator...\n", "\tStage 3 Generators:\n", "\t\tFitting IdentityFeatureGenerator...\n", "\t\t\tWarning: feature_metadata_in passed as input to fit_transform, but self.feature_metadata_in was already set. Ignoring feature_metadata_in.\n", "\tStage 4 Generators:\n", "\t\tFitting DropUniqueFeatureGenerator...\n", "\tStage 5 Generators:\n", "\t\tFitting DropDuplicatesFeatureGenerator...\n", "\tTypes of features in original data (raw dtype, special dtypes):\n", "\t\t('float', []) : 4 | ['displacement', 'horsepower', 'weight', 'acceleration']\n", "\t\t('int', []) : 3 | ['cylinders', 'model_year', 'origin']\n", "\tTypes of features in processed data (raw dtype, special dtypes):\n", "\t\t('float', []) : 4 | ['displacement', 'horsepower', 'weight', 'acceleration']\n", "\t\t('int', []) : 3 | ['cylinders', 'model_year', 'origin']\n", "\t0.0s = Fit runtime\n", "\t7 features in original data used to generate 7 features in processed data.\n", "\tTrain Data (Processed) Memory Usage: 0.02 MB (0.0% of available memory)\n", "Data preprocessing and feature engineering runtime = 0.03s ...\n", "AutoGluon will gauge predictive performance using evaluation metric: 'mean_squared_error'\n", "\tThis metric's sign has been flipped to adhere to being higher_is_better. The metric score can be multiplied by -1 to get the metric value.\n", "\tTo change this, specify the eval_metric parameter of Predictor()\n", "use_bag_holdout=True, will use tuning_data as holdout (will not be used for early stopping).\n", "User-specified model hyperparameters to be fit:\n", "{\n", "\t'RF': {},\n", "}\n", "Fitting 1 L1 models ...\n", "Hyperparameter tuning model: RandomForest_BAG_L1 ...\n", "\tNo hyperparameter search space specified for RandomForest_BAG_L1. Skipping HPO. Will train one model based on the provided hyperparameters.\n", "Fitted model: RandomForest_BAG_L1 ...\n", "\t-8.1918\t = Validation score (-mean_squared_error)\n", "\t0.28s\t = Training runtime\n", "\t0.0s\t = Validation runtime\n", "Fitting model: WeightedEnsemble_L2 ...\n", "\t-11.4141\t = Validation score (-mean_squared_error)\n", "\t0.0s\t = Training runtime\n", "\t0.0s\t = Validation runtime\n", "AutoGluon training complete, total runtime = 0.35s ... Best model: \"WeightedEnsemble_L2\"\n", "TabularPredictor saved. To load, use: predictor = TabularPredictor.load(\"/tmp/tmpqcxgn2l1/output/auto-mpg/2023-09-23-20-36-00-0_UserInputConfig/AutoGluon/Random Forest/\")\n", "Training mse loss: 1.07741\n", "Validation mse loss: 11.41409\n", "Testing mse loss: 4.19031\n", "Trainer saved. To load the trainer, run trainer = load_trainer(path='/tmp/tmpqcxgn2l1/output/auto-mpg/2023-09-23-20-36-00-0_UserInputConfig/trainer.pkl')\n", "\n", "-------------AutoGluon End-------------\n", "\n" ] } ], "source": [ "trainer.train(stderr_to_stdout=True)" ] }, { "cell_type": "markdown", "source": [ "After training finishes, check the leaderboard to see their performance.\n", "\n", "Metrics used in leaderboards can be found in `tabensemb.utils.utils.REGRESSION_METRICS/BINARY_METRICS/MULTICLASS_METRICS`. Most of the metrics are from `sklearn.metrics`." ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": 10, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "PytorchTabular metrics\n", "Category Embedding 1/1\n", "WideDeep metrics\n", "TabMlp 1/1\n", "AutoGluon metrics\n", "Random Forest 1/1\n", "Trainer saved. To load the trainer, run trainer = load_trainer(path='/tmp/tmpqcxgn2l1/output/auto-mpg/2023-09-23-20-36-00-0_UserInputConfig/trainer.pkl')\n" ] }, { "data": { "text/plain": " Program Model Training RMSE Training MSE \\\n0 AutoGluon Random Forest 1.037981 1.077405 \n1 WideDeep TabMlp 3.189102 10.170372 \n2 PytorchTabular Category Embedding 3.354362 11.251746 \n\n Training MAE Training MAPE Training R2 Training MEDIAN_ABSOLUTE_ERROR \\\n0 0.741566 0.031074 0.983285 0.529500 \n1 2.318564 0.096454 0.842218 1.669983 \n2 2.445915 0.101659 0.825442 1.775388 \n\n Training EXPLAINED_VARIANCE_SCORE Testing RMSE ... Testing R2 \\\n0 0.983293 2.047025 ... 0.922065 \n1 0.859805 2.537431 ... 0.880250 \n2 0.854523 2.799644 ... 0.854221 \n\n Testing MEDIAN_ABSOLUTE_ERROR Testing EXPLAINED_VARIANCE_SCORE \\\n0 1.156333 0.922591 \n1 1.767459 0.900587 \n2 1.963455 0.888258 \n\n Validation RMSE Validation MSE Validation MAE Validation MAPE \\\n0 3.378475 11.414091 2.269187 0.102995 \n1 3.415071 11.662707 2.539188 0.116035 \n2 3.516710 12.367250 2.731159 0.125136 \n\n Validation R2 Validation MEDIAN_ABSOLUTE_ERROR \\\n0 0.796098 1.641334 \n1 0.791657 1.904160 \n2 0.779071 2.375105 \n\n Validation EXPLAINED_VARIANCE_SCORE \n0 0.796506 \n1 0.806152 \n2 0.808039 \n\n[3 rows x 23 columns]", "text/html": "
| \n | Program | \nModel | \nTraining RMSE | \nTraining MSE | \nTraining MAE | \nTraining MAPE | \nTraining R2 | \nTraining MEDIAN_ABSOLUTE_ERROR | \nTraining EXPLAINED_VARIANCE_SCORE | \nTesting RMSE | \n... | \nTesting R2 | \nTesting MEDIAN_ABSOLUTE_ERROR | \nTesting EXPLAINED_VARIANCE_SCORE | \nValidation RMSE | \nValidation MSE | \nValidation MAE | \nValidation MAPE | \nValidation R2 | \nValidation MEDIAN_ABSOLUTE_ERROR | \nValidation EXPLAINED_VARIANCE_SCORE | \n
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \nAutoGluon | \nRandom Forest | \n1.037981 | \n1.077405 | \n0.741566 | \n0.031074 | \n0.983285 | \n0.529500 | \n0.983293 | \n2.047025 | \n... | \n0.922065 | \n1.156333 | \n0.922591 | \n3.378475 | \n11.414091 | \n2.269187 | \n0.102995 | \n0.796098 | \n1.641334 | \n0.796506 | \n
| 1 | \nWideDeep | \nTabMlp | \n3.189102 | \n10.170372 | \n2.318564 | \n0.096454 | \n0.842218 | \n1.669983 | \n0.859805 | \n2.537431 | \n... | \n0.880250 | \n1.767459 | \n0.900587 | \n3.415071 | \n11.662707 | \n2.539188 | \n0.116035 | \n0.791657 | \n1.904160 | \n0.806152 | \n
| 2 | \nPytorchTabular | \nCategory Embedding | \n3.354362 | \n11.251746 | \n2.445915 | \n0.101659 | \n0.825442 | \n1.775388 | \n0.854523 | \n2.799644 | \n... | \n0.854221 | \n1.963455 | \n0.888258 | \n3.516710 | \n12.367250 | \n2.731159 | \n0.125136 | \n0.779071 | \n2.375105 | \n0.808039 | \n
3 rows × 23 columns
\n| \n | Program | \nModel | \nTraining F1_SCORE | \nTraining PRECISION_SCORE | \nTraining RECALL_SCORE | \nTraining JACCARD_SCORE | \nTraining ACCURACY_SCORE | \nTraining BALANCED_ACCURACY_SCORE | \nTraining COHEN_KAPPA_SCORE | \nTraining HAMMING_LOSS | \n... | \nValidation ACCURACY_SCORE | \nValidation BALANCED_ACCURACY_SCORE | \nValidation COHEN_KAPPA_SCORE | \nValidation HAMMING_LOSS | \nValidation MATTHEWS_CORRCOEF | \nValidation ZERO_ONE_LOSS | \nValidation ROC_AUC_SCORE | \nValidation LOG_LOSS | \nValidation BRIER_SCORE_LOSS | \nValidation AVERAGE_PRECISION_SCORE | \n
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \nWideDeep | \nTabMlp | \n0.694200 | \n0.728505 | \n0.662981 | \n0.531628 | \n0.859388 | \n0.792321 | \n0.603167 | \n0.140612 | \n... | \n0.852426 | \n0.784474 | \n0.584884 | \n0.147574 | \n0.585738 | \n0.147574 | \n0.908951 | \n0.317288 | \n0.101612 | \n0.868420 | \n
| 1 | \nAutoGluon | \nRandom Forest | \n1.000000 | \n1.000000 | \n1.000000 | \n1.000000 | \n1.000000 | \n1.000000 | \n1.000000 | \n0.000000 | \n... | \n0.853808 | \n0.776665 | \n0.580404 | \n0.146192 | \n0.583003 | \n0.146192 | \n0.907010 | \n0.318016 | \n0.100486 | \n0.875084 | \n
| 2 | \nPytorchTabular | \nCategory Embedding | \n0.709806 | \n0.738341 | \n0.683394 | \n0.550154 | \n0.865479 | \n0.803303 | \n0.622423 | \n0.134521 | \n... | \n0.850430 | \n0.784467 | \n0.581612 | \n0.149570 | \n0.582150 | \n0.149570 | \n0.909318 | \n0.316194 | \n0.101722 | \n0.868410 | \n
3 rows × 44 columns
\n| \n | Program | \nModel | \nTraining ACCURACY_SCORE | \nTraining BALANCED_ACCURACY_SCORE | \nTraining COHEN_KAPPA_SCORE | \nTraining HAMMING_LOSS | \nTraining MATTHEWS_CORRCOEF | \nTraining ZERO_ONE_LOSS | \nTraining PRECISION_SCORE_MACRO | \nTraining PRECISION_SCORE_MICRO | \n... | \nValidation F1_SCORE_MICRO | \nValidation F1_SCORE_WEIGHTED | \nValidation JACCARD_SCORE_MACRO | \nValidation JACCARD_SCORE_MICRO | \nValidation JACCARD_SCORE_WEIGHTED | \nValidation TOP_K_ACCURACY_SCORE | \nValidation LOG_LOSS | \nValidation ROC_AUC_SCORE_OVR_MACRO | \nValidation ROC_AUC_SCORE_OVR_WEIGHTED | \nValidation ROC_AUC_SCORE_OVO | \n
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \nPytorchTabular | \nCategory Embedding | \n1.0 | \n1.0 | \n1.0 | \n0.0 | \n1.0 | \n0.0 | \n1.0 | \n1.0 | \n... | \n0.833333 | \n0.837232 | \n0.756944 | \n0.714286 | \n0.732639 | \n1.0 | \n0.366983 | \n0.974891 | \n0.971616 | \n0.976042 | \n
| 1 | \nWideDeep | \nTabMlp | \n1.0 | \n1.0 | \n1.0 | \n0.0 | \n1.0 | \n0.0 | \n1.0 | \n1.0 | \n... | \n0.833333 | \n0.837232 | \n0.756944 | \n0.714286 | \n0.732639 | \n1.0 | \n0.295129 | \n0.979747 | \n0.977576 | \n0.980833 | \n
| 2 | \nAutoGluon | \nRandom Forest | \n1.0 | \n1.0 | \n1.0 | \n0.0 | \n1.0 | \n0.0 | \n1.0 | \n1.0 | \n... | \n0.800000 | \n0.804615 | \n0.721154 | \n0.666667 | \n0.689423 | \n1.0 | \n0.781551 | \n0.950812 | \n0.941465 | \n0.951042 | \n
3 rows × 71 columns
\n| \n | Program | \nModel | \nTraining RMSE | \nTraining MSE | \nTraining MAE | \nTraining MAPE | \nTraining R2 | \nTraining MEDIAN_ABSOLUTE_ERROR | \nTraining EXPLAINED_VARIANCE_SCORE | \nTesting RMSE | \n... | \nTesting R2 | \nTesting MEDIAN_ABSOLUTE_ERROR | \nTesting EXPLAINED_VARIANCE_SCORE | \nValidation RMSE | \nValidation MSE | \nValidation MAE | \nValidation MAPE | \nValidation R2 | \nValidation MEDIAN_ABSOLUTE_ERROR | \nValidation EXPLAINED_VARIANCE_SCORE | \n
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \nPytorchTabular | \nCategory Embedding | \n181.893055 | \n33085.083331 | \n145.409738 | \n1.055588 | \n-0.004359 | \n121.139843 | \n0.001236 | \n172.657206 | \n... | \n-0.005851 | \n118.665751 | \n-0.001657 | \n148.93831 | \n22182.620185 | \n121.146176 | \n1.00906 | \n-0.001817 | \n92.916794 | \n0.001214 | \n
1 rows × 23 columns
\n