Basics of running benchmarks#

Tabular Ensemble (tabensemb) is a benchmark platform for tabular prediction tasks. We support three well-established model bases as baselines:

autogluon: Link
pytorch_widedeep: Link
pytorch_tabular: Link

Users can run benchmarks on customized datasets using customized preprocessing steps and implement customized models in the framework to run and compare their performance with baselines within a consistent procedure.

In this part, minimum examples of regression, binary classification, and multiclass classification are performed to show the basic functionality of the package.

Regression#

Loading packages#

First, import the necessary modules. Then check the validity of CUDA and determine the training device.

[1]:

import torch
from tabensemb.trainer import Trainer
from tabensemb.model import *
from tabensemb.config import UserConfig
import tabensemb
import os

device = "cuda" if torch.cuda.is_available() else "cpu"
print("Using {} device".format(device))

Using cuda device

tabensemb uses paths relative to the current directory. For different IDEs (PyCharm, VSCode, etc.), the directory can be different. Set default paths to desired ones.

tabensemb.setting["default_output_path"]: It will be used to save results. This path will be created if it does not exist.
tabensemb.setting["default_config_path"]: It should be the path to configuration files (See “Using a configuration file” for its case).
tabensemb.setting["default_config_path"]: It should be the path to data files. It will also be used to save downloaded datasets (See “Using a configuration file” for its case).

In this notebook, we use a temporary directory for cleanliness. Change temp_path.name to your own directory.

[2]:

from tempfile import TemporaryDirectory

temp_path = TemporaryDirectory()
tabensemb.setting["default_output_path"] = os.path.join(temp_path.name, "output")
tabensemb.setting["default_config_path"] = os.path.join(temp_path.name, "configs")
tabensemb.setting["default_data_path"] = os.path.join(temp_path.name, "data")

Configuring a `Trainer`#

Create a Trainer, which acts as a bridge of data and models and provides some useful utilities.

[3]:

trainer = Trainer(device=device)

As an example, we use the Auto MPG dataset from UCI datasets. We can import UCI datasets through the UserConfig class.

[4]:

mpg_columns = [
    "mpg",
    "cylinders",
    "displacement",
    "horsepower",
    "weight",
    "acceleration",
    "model_year",
    "origin",
    "car_name",
]
cfg = UserConfig.from_uci("Auto MPG", column_names=mpg_columns, sep=r"\s+")
trainer.load_config(cfg)

Downloading https://archive.ics.uci.edu/static/public/9/auto+mpg.zip to /tmp/tmpqcxgn2l1/data/Auto MPG.zip
cylinders is Integer and will be treated as a continuous feature.
model_year is Integer and will be treated as a continuous feature.
origin is Integer and will be treated as a continuous feature.
Unknown values are detected in ['horsepower']. They will be treated as np.nan.
The project will be saved to /tmp/tmpqcxgn2l1/output/auto-mpg/2023-09-23-20-36-00-0_UserInputConfig

Optional: We provide a useful Logging class to record all outputs to a file located in the above project root so that users can review the training process. This step is optional but we strongly recommend using it.

Trainer.project_root is the output directory of the trainer, and here we log all stdout and stderr to log.txt in this directory.

[5]:

from tabensemb.utils import Logging
log = Logging()
log.enter(os.path.join(trainer.project_root, "log.txt"))

Viewing configurations#

We can view the summary of the current environment, including devices/Python version, the loaded configuration, and global settings of tabensemb.

[6]:

trainer.summarize_setting()

Device:
{
        'System': 'Linux',
        'Node name': 'xlluo-WS',
        'System release': '5.15.6-custom',
        'System version': '#1 SMP Mon Dec 13 20:27:58 CST 2021',
        'Machine architecture': 'x86_64',
        'Processor architecture': 'x86_64',
        'Processor model': '11th Gen Intel(R) Core(TM) i9-11900K @ 3.50GHz',
        'Physical cores': 8,
        'Total cores': 16,
        'Max core frequency': '5150.00Mhz',
        'Total memory': '31.20GB',
        'Python version': '3.10.12',
        'Python implementation': 'CPython',
        'Python compiler': 'GCC 11.2.0',
        'Cuda availability': True,
        'GPU devices': [
                'NVIDIA GeForce RTX 3090'
        ]
}
Configurations:
{
        'database': 'auto-mpg',
        'task': 'regression',
        'loss': None,
        'bayes_opt': False,
        'bayes_calls': 50,
        'bayes_epoch': 30,
        'patience': 100,
        'epoch': 300,
        'lr': 0.001,
        'weight_decay': 1e-09,
        'batch_size': 1024,
        'layers': [
                64,
                128,
                256,
                128,
                64
        ],
        'SPACEs': {
                'lr': {
                        'type': 'Real',
                        'low': 0.0001,
                        'high': 0.05,
                        'prior': 'log-uniform'
                },
                'weight_decay': {
                        'type': 'Real',
                        'low': 1e-09,
                        'high': 0.05,
                        'prior': 'log-uniform'
                },
                'batch_size': {
                        'type': 'Categorical',
                        'categories': [
                                64,
                                128,
                                256,
                                512,
                                1024,
                                2048
                        ]
                }
        },
        'data_splitter': 'RandomSplitter',
        'split_ratio': [
                0.6,
                0.2,
                0.2
        ],
        'data_imputer': 'MissForestImputer',
        'data_processors': [
                (
                        'CategoricalOrdinalEncoder',
                        {
                        }
                ),
                (
                        'NaNFeatureRemover',
                        {
                        }
                ),
                (
                        'VarianceFeatureSelector',
                        {
                                'thres': 1
                        }
                ),
                (
                        'StandardScaler',
                        {
                        }
                )
        ],
        'data_derivers': [
        ],
        'categorical_feature_names': [
        ],
        'continuous_feature_names': [
                'displacement',
                'cylinders',
                'horsepower',
                'weight',
                'acceleration',
                'model_year',
                'origin'
        ],
        'feature_types': {
                'displacement': 'Continuous',
                'cylinders': 'Continuous',
                'horsepower': 'Continuous',
                'weight': 'Continuous',
                'acceleration': 'Continuous',
                'model_year': 'Continuous',
                'origin': 'Continuous'
        },
        'unique_feature_types': [
                'Continuous'
        ],
        'label_name': [
                'mpg'
        ]
}
Global settings:
{
        'random_seed': 42,
        'low_memory': True,
        'verbose_per_epoch': 20,
        'test_with_no_grad': True,
        'debug_mode': False,
        'default_output_path': '/tmp/tmpqcxgn2l1/output',
        'default_config_path': '/tmp/tmpqcxgn2l1/configs',
        'default_data_path': '/tmp/tmpqcxgn2l1/data',
        'warn_nan_metric': True,
        'raise_inconsistent_inferred_task': False,
        'matplotlib_usetex': False
}

Loading data#

In the configuration summary above, the dataset file is defined by “database” under the Configurations category. Trainer.load_data automatically searches the file in the current directory and tabensemb.setting["default_data_path"]. Now, load the Auto MPG dataset into the Trainer. It will process the dataset and get ready for training models:

Data splitting (training/validation/testing sets)
Data imputation
Data augmentation (for features)
Data processing
- Data augmentation (for data points)
- Data filtering
- Feature selection
- Categorical encoding
- Data scaling
- etc.
Data augmentation (for features, especially multi-modal features)

[7]:

trainer.load_data()

Dataset size: 238 80 80
Data saved to /tmp/tmpqcxgn2l1/output/auto-mpg/2023-09-23-20-36-00-0_UserInputConfig (data.csv and tabular_data.csv).

Initializing model bases#

Initialize model bases and add them to the Trainer. We only choose a subset of models in each model base for demonstration by passing the model_subset argument (without it, all available models will be trained).

[8]:

models = [
    PytorchTabular(trainer, model_subset=["Category Embedding"]),
    WideDeep(trainer, model_subset=["TabMlp"]),
    AutoGluon(trainer, model_subset=["Random Forest"]),
]
trainer.add_modelbases(models)

Start training#

Now train the model bases. The argument stderr_to_stdout will redirect warnings and loggings to stdout and make records in the notebook clean.

Optional: Using the following line, we can run k-fold cross-validation to get the leaderboard, where k is cross_validation.

trainer.get_leaderboard(cross_validation=10, split_type="cv", stderr_to_stdout=True)

Remark: split_type can be random, which means that the dataset is randomly split according to the given split_ratio in the configuration and different random seeds.

[9]:

trainer.train(stderr_to_stdout=True)


-------------Run PytorchTabular-------------

Training Category Embedding
Global seed set to 42
2023-09-23 20:36:01,062 - {pytorch_tabular.tabular_model:473} - INFO - Preparing the DataLoaders
2023-09-23 20:36:01,062 - {pytorch_tabular.tabular_datamodule:290} - INFO - Setting up the datamodule for regression task
2023-09-23 20:36:01,070 - {pytorch_tabular.tabular_model:521} - INFO - Preparing the Model: CategoryEmbeddingModel
2023-09-23 20:36:01,081 - {pytorch_tabular.tabular_model:268} - INFO - Preparing the Trainer
/home/xlluo/anaconda3/envs/tabular_ensemble/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py:589: LightningDeprecationWarning: The Trainer argument `auto_select_gpus` has been deprecated in v1.9.0 and will be removed in v2.0.0. Please use the function `pytorch_lightning.accelerators.find_usable_cuda_devices` instead.
  rank_zero_deprecation(
Auto select gpus: [0]
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
2023-09-23 20:36:01,991 - {pytorch_tabular.tabular_model:582} - INFO - Training Started
You are using a CUDA device ('NVIDIA GeForce RTX 3090') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name             | Type                      | Params
---------------------------------------------------------------
0 | _backbone        | CategoryEmbeddingBackbone | 11.4 K
1 | _embedding_layer | Embedding1dLayer          | 14
2 | head             | LinearHead                | 33
3 | loss             | MSELoss                   | 0
---------------------------------------------------------------
11.4 K    Trainable params
0         Non-trainable params
11.4 K    Total params
0.046     Total estimated model params size (MB)
Epoch: 1/300, Train loss: 677.8015, Val loss: 582.9557, Min val loss: 582.9557, Epoch time: 0.012s.
Epoch: 20/300, Train loss: 353.7851, Val loss: 302.0203, Min val loss: 302.0203, Epoch time: 0.010s.
Epoch: 40/300, Train loss: 85.0776, Val loss: 62.1153, Min val loss: 62.1153, Epoch time: 0.009s.
Epoch: 60/300, Train loss: 45.2654, Val loss: 34.2778, Min val loss: 34.2691, Epoch time: 0.009s.
Epoch: 80/300, Train loss: 33.9537, Val loss: 26.8622, Min val loss: 26.8622, Epoch time: 0.016s.
Epoch: 100/300, Train loss: 26.9038, Val loss: 23.2417, Min val loss: 23.2372, Epoch time: 0.011s.
Epoch: 120/300, Train loss: 24.9622, Val loss: 20.4360, Min val loss: 20.4360, Epoch time: 0.008s.
Epoch: 140/300, Train loss: 24.1636, Val loss: 19.4010, Min val loss: 19.4010, Epoch time: 0.012s.
Epoch: 160/300, Train loss: 22.9200, Val loss: 18.0232, Min val loss: 17.9749, Epoch time: 0.010s.
Epoch: 180/300, Train loss: 19.7677, Val loss: 16.9469, Min val loss: 16.9469, Epoch time: 0.011s.
Epoch: 200/300, Train loss: 17.9390, Val loss: 16.6545, Min val loss: 16.4093, Epoch time: 0.010s.
Epoch: 220/300, Train loss: 19.4496, Val loss: 15.4451, Min val loss: 15.1788, Epoch time: 0.011s.
Epoch: 240/300, Train loss: 16.0483, Val loss: 14.5508, Min val loss: 14.5508, Epoch time: 0.010s.
Epoch: 260/300, Train loss: 16.4672, Val loss: 13.8354, Min val loss: 13.8354, Epoch time: 0.010s.
Epoch: 280/300, Train loss: 13.6031, Val loss: 12.9315, Min val loss: 12.9315, Epoch time: 0.018s.
Epoch: 300/300, Train loss: 16.5369, Val loss: 12.3673, Min val loss: 12.3673, Epoch time: 0.011s.
`Trainer.fit` stopped: `max_epochs=300` reached.
2023-09-23 20:36:07,420 - {pytorch_tabular.tabular_model:584} - INFO - Training the model completed
2023-09-23 20:36:07,420 - {pytorch_tabular.tabular_model:1258} - INFO - Loading the best model
/home/xlluo/anaconda3/envs/tabular_ensemble/lib/python3.10/site-packages/pytorch_lightning/utilities/cloud_io.py:33: LightningDeprecationWarning: `pytorch_lightning.utilities.cloud_io.get_filesystem` has been deprecated in v1.8.0 and will be removed in v2.0.0. Please use `lightning_fabric.utilities.cloud_io.get_filesystem` instead.
  rank_zero_deprecation(
Training mse loss: 11.25175
Validation mse loss: 12.36725
Testing mse loss: 7.83801
Trainer saved. To load the trainer, run trainer = load_trainer(path='/tmp/tmpqcxgn2l1/output/auto-mpg/2023-09-23-20-36-00-0_UserInputConfig/trainer.pkl')

-------------PytorchTabular End-------------


-------------Run WideDeep-------------

Training TabMlp
Epoch: 1/300, Train loss: 635.5330, Val loss: 555.4755, Min val loss: 555.4755
Epoch: 21/300, Train loss: 441.6902, Val loss: 375.7337, Min val loss: 375.7337
Epoch: 41/300, Train loss: 145.8623, Val loss: 119.9598, Min val loss: 119.9598
Epoch: 61/300, Train loss: 45.9133, Val loss: 34.0160, Min val loss: 34.0160
Epoch: 81/300, Train loss: 27.6878, Val loss: 24.1525, Min val loss: 24.1525
Epoch: 101/300, Train loss: 23.0877, Val loss: 18.2096, Min val loss: 18.2096
Epoch: 121/300, Train loss: 21.4056, Val loss: 17.2203, Min val loss: 17.1303
Epoch: 141/300, Train loss: 21.2559, Val loss: 16.0746, Min val loss: 16.0746
Epoch: 161/300, Train loss: 19.2337, Val loss: 15.3027, Min val loss: 15.3027
Epoch: 181/300, Train loss: 16.1232, Val loss: 14.5777, Min val loss: 14.5777
Epoch: 201/300, Train loss: 16.7095, Val loss: 14.2274, Min val loss: 14.2274
Epoch: 221/300, Train loss: 15.7366, Val loss: 13.5223, Min val loss: 13.5223
Epoch: 241/300, Train loss: 16.9825, Val loss: 12.9892, Min val loss: 12.9892
Epoch: 261/300, Train loss: 15.3358, Val loss: 12.4278, Min val loss: 12.4278
Epoch: 281/300, Train loss: 13.3989, Val loss: 12.1155, Min val loss: 12.1155
Restoring model weights from the end of the best epoch
Training mse loss: 10.17037
Validation mse loss: 11.66271
Testing mse loss: 6.43856
Trainer saved. To load the trainer, run trainer = load_trainer(path='/tmp/tmpqcxgn2l1/output/auto-mpg/2023-09-23-20-36-00-0_UserInputConfig/trainer.pkl')

-------------WideDeep End-------------


-------------Run AutoGluon-------------

Training Random Forest
Presets specified: ['best_quality']
Warning: hyperparameter tuning is currently experimental and may cause the process to hang.
Stack configuration (auto_stack=True): num_stack_levels=0, num_bag_folds=5, num_bag_sets=1
Beginning AutoGluon training ...
AutoGluon will save models to "/tmp/tmpqcxgn2l1/output/auto-mpg/2023-09-23-20-36-00-0_UserInputConfig/AutoGluon/Random Forest/"
AutoGluon Version:  0.8.2
Python Version:     3.10.12
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP Mon Dec 13 20:27:58 CST 2021
Disk Space Avail:   149.28 GB / 502.47 GB (29.7%)
Train Data Rows:    238
Train Data Columns: 7
Tuning Data Rows:    80
Tuning Data Columns: 7
Label Column: mpg
Preprocessing data ...
Using Feature Generators to preprocess the data ...
Fitting PipelineFeatureGenerator...
        Available Memory:                    12533.16 MB
        Train Data (Original)  Memory Usage: 0.02 MB (0.0% of available memory)
        Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
        Stage 1 Generators:
                Fitting AsTypeFeatureGenerator...
        Stage 2 Generators:
                Fitting FillNaFeatureGenerator...
        Stage 3 Generators:
                Fitting IdentityFeatureGenerator...
                        Warning: feature_metadata_in passed as input to fit_transform, but self.feature_metadata_in was already set. Ignoring feature_metadata_in.
        Stage 4 Generators:
                Fitting DropUniqueFeatureGenerator...
        Stage 5 Generators:
                Fitting DropDuplicatesFeatureGenerator...
        Types of features in original data (raw dtype, special dtypes):
                ('float', []) : 4 | ['displacement', 'horsepower', 'weight', 'acceleration']
                ('int', [])   : 3 | ['cylinders', 'model_year', 'origin']
        Types of features in processed data (raw dtype, special dtypes):
                ('float', []) : 4 | ['displacement', 'horsepower', 'weight', 'acceleration']
                ('int', [])   : 3 | ['cylinders', 'model_year', 'origin']
        0.0s = Fit runtime
        7 features in original data used to generate 7 features in processed data.
        Train Data (Processed) Memory Usage: 0.02 MB (0.0% of available memory)
Data preprocessing and feature engineering runtime = 0.03s ...
AutoGluon will gauge predictive performance using evaluation metric: 'mean_squared_error'
        This metric's sign has been flipped to adhere to being higher_is_better. The metric score can be multiplied by -1 to get the metric value.
        To change this, specify the eval_metric parameter of Predictor()
use_bag_holdout=True, will use tuning_data as holdout (will not be used for early stopping).
User-specified model hyperparameters to be fit:
{
        'RF': {},
}
Fitting 1 L1 models ...
Hyperparameter tuning model: RandomForest_BAG_L1 ...
        No hyperparameter search space specified for RandomForest_BAG_L1. Skipping HPO. Will train one model based on the provided hyperparameters.
Fitted model: RandomForest_BAG_L1 ...
        -8.1918  = Validation score   (-mean_squared_error)
        0.28s    = Training   runtime
        0.0s     = Validation runtime
Fitting model: WeightedEnsemble_L2 ...
        -11.4141         = Validation score   (-mean_squared_error)
        0.0s     = Training   runtime
        0.0s     = Validation runtime
AutoGluon training complete, total runtime = 0.35s ... Best model: "WeightedEnsemble_L2"
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("/tmp/tmpqcxgn2l1/output/auto-mpg/2023-09-23-20-36-00-0_UserInputConfig/AutoGluon/Random Forest/")
Training mse loss: 1.07741
Validation mse loss: 11.41409
Testing mse loss: 4.19031
Trainer saved. To load the trainer, run trainer = load_trainer(path='/tmp/tmpqcxgn2l1/output/auto-mpg/2023-09-23-20-36-00-0_UserInputConfig/trainer.pkl')

-------------AutoGluon End-------------

After training finishes, check the leaderboard to see their performance.

Metrics used in leaderboards can be found in tabensemb.utils.utils.REGRESSION_METRICS/BINARY_METRICS/MULTICLASS_METRICS. Most of the metrics are from sklearn.metrics.

[10]:

trainer.get_leaderboard()

PytorchTabular metrics
Category Embedding 1/1
WideDeep metrics
TabMlp 1/1
AutoGluon metrics
Random Forest 1/1
Trainer saved. To load the trainer, run trainer = load_trainer(path='/tmp/tmpqcxgn2l1/output/auto-mpg/2023-09-23-20-36-00-0_UserInputConfig/trainer.pkl')

[10]:

	Program	Model	Training RMSE	Training MSE	Training MAE	Training MAPE	Training R2	Training MEDIAN_ABSOLUTE_ERROR	Training EXPLAINED_VARIANCE_SCORE	Testing RMSE	...	Testing R2	Testing MEDIAN_ABSOLUTE_ERROR	Testing EXPLAINED_VARIANCE_SCORE	Validation RMSE	Validation MSE	Validation MAE	Validation MAPE	Validation R2	Validation MEDIAN_ABSOLUTE_ERROR	Validation EXPLAINED_VARIANCE_SCORE
0	AutoGluon	Random Forest	1.037981	1.077405	0.741566	0.031074	0.983285	0.529500	0.983293	2.047025	...	0.922065	1.156333	0.922591	3.378475	11.414091	2.269187	0.102995	0.796098	1.641334	0.796506
1	WideDeep	TabMlp	3.189102	10.170372	2.318564	0.096454	0.842218	1.669983	0.859805	2.537431	...	0.880250	1.767459	0.900587	3.415071	11.662707	2.539188	0.116035	0.791657	1.904160	0.806152
2	PytorchTabular	Category Embedding	3.354362	11.251746	2.445915	0.101659	0.825442	1.775388	0.854523	2.799644	...	0.854221	1.963455	0.888258	3.516710	12.367250	2.731159	0.125136	0.779071	2.375105	0.808039

3 rows × 23 columns

Binary classification#

As a showcase for binary classification, we use the Adult dataset from UCI datasets. Note that the Adult dataset has an individual testing set, which will be discussed in the “Inference on an upcoming dataset” part.

[11]:

trainer = Trainer(device=device)
adult_columns = [
    "age",
    "workclass",
    "fnlwgt",
    "education",
    "education-num",
    "marital-status",
    "occupation",
    "relationship",
    "race",
    "sex",
    "capital-gain",
    "capital-loss",
    "hours-per-week",
    "native-country",
    "income",
]
cfg = UserConfig.from_uci("Adult", column_names=adult_columns, sep=", ")
trainer.load_config(cfg)
trainer.load_data()
models = [
    PytorchTabular(trainer, model_subset=["Category Embedding"]),
    WideDeep(trainer, model_subset=["TabMlp"]),
    AutoGluon(trainer, model_subset=["Random Forest"]),
]
trainer.add_modelbases(models)
trainer.train(stderr_to_stdout=True)
trainer.get_leaderboard()

Downloading https://archive.ics.uci.edu/static/public/2/adult.zip to /tmp/tmpqcxgn2l1/data/Adult.zip

/home/xlluo/hdd/tabular_ensemble/tabensemb/config/user_config.py:292: UserWarning: There exists .test file(s) ['adult.test'] which should be used for final metrics. The .zip file is left for the user to process.
  warnings.warn(
/home/xlluo/hdd/tabular_ensemble/tabensemb/utils/utils.py:464: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
  df = pd.read_csv(StringIO(s), names=names, sep=sep)

age is Integer and will be treated as a continuous feature.
fnlwgt is Integer and will be treated as a continuous feature.
education-num is Integer and will be treated as a continuous feature.
capital-gain is Integer and will be treated as a continuous feature.
capital-loss is Integer and will be treated as a continuous feature.
hours-per-week is Integer and will be treated as a continuous feature.
The project will be saved to /tmp/tmpqcxgn2l1/output/adult/2023-09-23-20-36-15-0_UserInputConfig
Dataset size: 19536 6512 6513
Data saved to /tmp/tmpqcxgn2l1/output/adult/2023-09-23-20-36-15-0_UserInputConfig (data.csv and tabular_data.csv).

-------------Run PytorchTabular-------------

Training Category Embedding
Global seed set to 42
2023-09-23 20:36:17,315 - {pytorch_tabular.tabular_model:473} - INFO - Preparing the DataLoaders
2023-09-23 20:36:17,317 - {pytorch_tabular.tabular_datamodule:290} - INFO - Setting up the datamodule for classification task
2023-09-23 20:36:17,382 - {pytorch_tabular.tabular_model:521} - INFO - Preparing the Model: CategoryEmbeddingModel
2023-09-23 20:36:17,412 - {pytorch_tabular.tabular_model:268} - INFO - Preparing the Trainer
/home/xlluo/anaconda3/envs/tabular_ensemble/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py:589: LightningDeprecationWarning: The Trainer argument `auto_select_gpus` has been deprecated in v1.9.0 and will be removed in v2.0.0. Please use the function `pytorch_lightning.accelerators.find_usable_cuda_devices` instead.
  rank_zero_deprecation(
Auto select gpus: [0]
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
2023-09-23 20:36:17,437 - {pytorch_tabular.tabular_model:582} - INFO - Training Started
You are using a CUDA device ('NVIDIA GeForce RTX 3090') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name             | Type                      | Params
---------------------------------------------------------------
0 | _backbone        | CategoryEmbeddingBackbone | 18.5 K
1 | _embedding_layer | Embedding1dLayer          | 1.4 K
2 | head             | LinearHead                | 66
3 | loss             | CrossEntropyLoss          | 0
---------------------------------------------------------------
20.0 K    Trainable params
0         Non-trainable params
20.0 K    Total params
0.080     Total estimated model params size (MB)
Epoch: 1/300, Train loss: 0.4666, Val loss: 0.3794, Min val loss: 0.3794, Epoch time: 0.465s.
Epoch: 20/300, Train loss: 0.3112, Val loss: 0.3184, Min val loss: 0.3177, Epoch time: 0.472s.
Epoch: 40/300, Train loss: 0.2972, Val loss: 0.3213, Min val loss: 0.3162, Epoch time: 0.435s.
Epoch: 60/300, Train loss: 0.2864, Val loss: 0.3248, Min val loss: 0.3162, Epoch time: 0.476s.
Epoch: 80/300, Train loss: 0.2766, Val loss: 0.3323, Min val loss: 0.3162, Epoch time: 0.530s.
Epoch: 100/300, Train loss: 0.2677, Val loss: 0.3416, Min val loss: 0.3162, Epoch time: 0.385s.
Epoch: 120/300, Train loss: 0.2602, Val loss: 0.3484, Min val loss: 0.3162, Epoch time: 0.527s.
2023-09-23 20:37:10,637 - {pytorch_tabular.tabular_model:584} - INFO - Training the model completed
2023-09-23 20:37:10,638 - {pytorch_tabular.tabular_model:1258} - INFO - Loading the best model
/home/xlluo/anaconda3/envs/tabular_ensemble/lib/python3.10/site-packages/pytorch_lightning/utilities/cloud_io.py:33: LightningDeprecationWarning: `pytorch_lightning.utilities.cloud_io.get_filesystem` has been deprecated in v1.8.0 and will be removed in v2.0.0. Please use `lightning_fabric.utilities.cloud_io.get_filesystem` instead.
  rank_zero_deprecation(
Training log_loss loss: 0.28862
Validation log_loss loss: 0.31619
Testing log_loss loss: 0.31114
Trainer saved. To load the trainer, run trainer = load_trainer(path='/tmp/tmpqcxgn2l1/output/adult/2023-09-23-20-36-15-0_UserInputConfig/trainer.pkl')

-------------PytorchTabular End-------------


-------------Run WideDeep-------------

Training TabMlp
Epoch: 1/300, Train loss: 0.5206, Val loss: 0.4274, Min val loss: 0.4274
Epoch: 21/300, Train loss: 0.3145, Val loss: 0.3161, Min val loss: 0.3161
Epoch: 41/300, Train loss: 0.3063, Val loss: 0.3192, Min val loss: 0.3161
Epoch: 61/300, Train loss: 0.2984, Val loss: 0.3209, Min val loss: 0.3161
Epoch: 81/300, Train loss: 0.2905, Val loss: 0.3227, Min val loss: 0.3161
Epoch: 101/300, Train loss: 0.2819, Val loss: 0.3266, Min val loss: 0.3161
Epoch: 121/300, Train loss: 0.2754, Val loss: 0.3293, Min val loss: 0.3161
Epoch 00121: early stopping
Restoring model weights from the end of the best epoch
Training log_loss loss: 0.30084
Validation log_loss loss: 0.31729
Testing log_loss loss: 0.31133
Trainer saved. To load the trainer, run trainer = load_trainer(path='/tmp/tmpqcxgn2l1/output/adult/2023-09-23-20-36-15-0_UserInputConfig/trainer.pkl')

-------------WideDeep End-------------


-------------Run AutoGluon-------------

Training Random Forest
Presets specified: ['best_quality']
Warning: hyperparameter tuning is currently experimental and may cause the process to hang.
Stack configuration (auto_stack=True): num_stack_levels=0, num_bag_folds=8, num_bag_sets=1
Beginning AutoGluon training ...
AutoGluon will save models to "/tmp/tmpqcxgn2l1/output/adult/2023-09-23-20-36-15-0_UserInputConfig/AutoGluon/Random Forest/"
AutoGluon Version:  0.8.2
Python Version:     3.10.12
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP Mon Dec 13 20:27:58 CST 2021
Disk Space Avail:   148.97 GB / 502.47 GB (29.6%)
Train Data Rows:    19536
Train Data Columns: 14
Tuning Data Rows:    6512
Tuning Data Columns: 14
Label Column: income
Preprocessing data ...
Selected class <--> label mapping:  class 1 = 1, class 0 = 0
Using Feature Generators to preprocess the data ...
Fitting PipelineFeatureGenerator...
        Available Memory:                    6146.04 MB
        Train Data (Original)  Memory Usage: 15.07 MB (0.2% of available memory)
        Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
        Stage 1 Generators:
                Fitting AsTypeFeatureGenerator...
                        Note: Converting 1 features to boolean dtype as they only contain 2 unique values.
        Stage 2 Generators:
                Fitting FillNaFeatureGenerator...
        Stage 3 Generators:
                Fitting IdentityFeatureGenerator...
                        Warning: feature_metadata_in passed as input to fit_transform, but self.feature_metadata_in was already set. Ignoring feature_metadata_in.
                Fitting CategoryFeatureGenerator...
                        Warning: feature_metadata_in passed as input to fit_transform, but self.feature_metadata_in was already set. Ignoring feature_metadata_in.
                        Fitting CategoryMemoryMinimizeFeatureGenerator...
        Stage 4 Generators:
                Fitting DropUniqueFeatureGenerator...
        Stage 5 Generators:
                Fitting DropDuplicatesFeatureGenerator...
        Types of features in original data (raw dtype, special dtypes):
                ('int', [])    : 6 | ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', ...]
                ('object', []) : 8 | ['education', 'marital-status', 'native-country', 'occupation', 'race', ...]
        Types of features in processed data (raw dtype, special dtypes):
                ('category', []) : 8 | ['education', 'marital-status', 'native-country', 'occupation', 'race', ...]
                ('int', [])      : 6 | ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', ...]
        0.2s = Fit runtime
        14 features in original data used to generate 14 features in processed data.
        Train Data (Processed) Memory Usage: 1.46 MB (0.0% of available memory)
Data preprocessing and feature engineering runtime = 0.18s ...
AutoGluon will gauge predictive performance using evaluation metric: 'accuracy'
        To change this, specify the eval_metric parameter of Predictor()
use_bag_holdout=True, will use tuning_data as holdout (will not be used for early stopping).
User-specified model hyperparameters to be fit:
{
        'RF': {},
}
Fitting 1 L1 models ...
Hyperparameter tuning model: RandomForest_BAG_L1 ...
        No hyperparameter search space specified for RandomForest_BAG_L1. Skipping HPO. Will train one model based on the provided hyperparameters.
Fitted model: RandomForest_BAG_L1 ...
        0.8543   = Validation score   (accuracy)
        1.92s    = Training   runtime
        0.0s     = Validation runtime
Fitting model: WeightedEnsemble_L2 ...
        0.8538   = Validation score   (accuracy)
        0.0s     = Training   runtime
        0.01s    = Validation runtime
AutoGluon training complete, total runtime = 2.4s ... Best model: "WeightedEnsemble_L2"
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("/tmp/tmpqcxgn2l1/output/adult/2023-09-23-20-36-15-0_UserInputConfig/AutoGluon/Random Forest/")
Training log_loss loss: 0.08021
Validation log_loss loss: 0.31802
Testing log_loss loss: 0.31261
Trainer saved. To load the trainer, run trainer = load_trainer(path='/tmp/tmpqcxgn2l1/output/adult/2023-09-23-20-36-15-0_UserInputConfig/trainer.pkl')

-------------AutoGluon End-------------

PytorchTabular metrics
Category Embedding 1/1
WideDeep metrics
TabMlp 1/1
AutoGluon metrics
Random Forest 1/1
Trainer saved. To load the trainer, run trainer = load_trainer(path='/tmp/tmpqcxgn2l1/output/adult/2023-09-23-20-36-15-0_UserInputConfig/trainer.pkl')

[11]:

	Program	Model	Training F1_SCORE	Training PRECISION_SCORE	Training RECALL_SCORE	Training JACCARD_SCORE	Training ACCURACY_SCORE	Training BALANCED_ACCURACY_SCORE	Training COHEN_KAPPA_SCORE	Training HAMMING_LOSS	...	Validation ACCURACY_SCORE	Validation BALANCED_ACCURACY_SCORE	Validation COHEN_KAPPA_SCORE	Validation HAMMING_LOSS	Validation MATTHEWS_CORRCOEF	Validation ZERO_ONE_LOSS	Validation ROC_AUC_SCORE	Validation LOG_LOSS	Validation BRIER_SCORE_LOSS	Validation AVERAGE_PRECISION_SCORE
0	WideDeep	TabMlp	0.694200	0.728505	0.662981	0.531628	0.859388	0.792321	0.603167	0.140612	...	0.852426	0.784474	0.584884	0.147574	0.585738	0.147574	0.908951	0.317288	0.101612	0.868420
1	AutoGluon	Random Forest	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	0.000000	...	0.853808	0.776665	0.580404	0.146192	0.583003	0.146192	0.907010	0.318016	0.100486	0.875084
2	PytorchTabular	Category Embedding	0.709806	0.738341	0.683394	0.550154	0.865479	0.803303	0.622423	0.134521	...	0.850430	0.784467	0.581612	0.149570	0.582150	0.149570	0.909318	0.316194	0.101722	0.868410

3 rows × 44 columns

Multiclass classification#

Iris is a famous multiclass classification task. It is also loaded from UCI datasets. We gave the argument column_names to from_uci in the above examples. If we do not know the column labels, column names from the UCI website are used (whose order might be wrong, such as those for the Auto MPG dataset) and the downloaded archive will not be removed after from_uci. There should be a file named xxx.name in the archive with column names in it.

[12]:

trainer = Trainer(device=device)
cfg = UserConfig.from_uci("Iris", datafile_name="iris")
trainer.load_config(cfg)
trainer.load_data()
models = [
    PytorchTabular(trainer, model_subset=["Category Embedding"]),
    WideDeep(trainer, model_subset=["TabMlp"]),
    AutoGluon(trainer, model_subset=["Random Forest"]),
]
trainer.add_modelbases(models)
trainer.train(stderr_to_stdout=True)
trainer.get_leaderboard()

Downloading https://archive.ics.uci.edu/static/public/53/iris.zip to /tmp/tmpqcxgn2l1/data/Iris.zip

/home/xlluo/hdd/tabular_ensemble/tabensemb/config/user_config.py:323: UserWarning: `column_names` is not given. The order of columns will be loaded from the website. It is highly recommended to manually set column names. The downloaded .zip is saved. Please check its .name file for the correct order.
  warnings.warn(

The project will be saved to /tmp/tmpqcxgn2l1/output/iris/2023-09-23-20-37-50-0_UserInputConfig
Dataset size: 90 30 30
Data saved to /tmp/tmpqcxgn2l1/output/iris/2023-09-23-20-37-50-0_UserInputConfig (data.csv and tabular_data.csv).

-------------Run PytorchTabular-------------

Training Category Embedding
Global seed set to 42
2023-09-23 20:37:51,096 - {pytorch_tabular.tabular_model:473} - INFO - Preparing the DataLoaders
2023-09-23 20:37:51,096 - {pytorch_tabular.tabular_datamodule:290} - INFO - Setting up the datamodule for classification task
2023-09-23 20:37:51,106 - {pytorch_tabular.tabular_model:521} - INFO - Preparing the Model: CategoryEmbeddingModel
2023-09-23 20:37:51,121 - {pytorch_tabular.tabular_model:268} - INFO - Preparing the Trainer
/home/xlluo/anaconda3/envs/tabular_ensemble/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py:589: LightningDeprecationWarning: The Trainer argument `auto_select_gpus` has been deprecated in v1.9.0 and will be removed in v2.0.0. Please use the function `pytorch_lightning.accelerators.find_usable_cuda_devices` instead.
  rank_zero_deprecation(
Auto select gpus: [0]
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
2023-09-23 20:37:51,137 - {pytorch_tabular.tabular_model:582} - INFO - Training Started
You are using a CUDA device ('NVIDIA GeForce RTX 3090') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name             | Type                      | Params
---------------------------------------------------------------
0 | _backbone        | CategoryEmbeddingBackbone | 11.0 K
1 | _embedding_layer | Embedding1dLayer          | 8
2 | head             | LinearHead                | 99
3 | loss             | CrossEntropyLoss          | 0
---------------------------------------------------------------
11.1 K    Trainable params
0         Non-trainable params
11.1 K    Total params
0.044     Total estimated model params size (MB)
Epoch: 1/300, Train loss: 1.7917, Val loss: 1.4287, Min val loss: 1.4287, Epoch time: 0.025s.
Epoch: 20/300, Train loss: 0.3055, Val loss: 0.6197, Min val loss: 0.6197, Epoch time: 0.019s.
Epoch: 40/300, Train loss: 0.2105, Val loss: 0.5668, Min val loss: 0.5668, Epoch time: 0.016s.
Epoch: 60/300, Train loss: 0.1510, Val loss: 0.5234, Min val loss: 0.5234, Epoch time: 0.013s.
Epoch: 80/300, Train loss: 0.1741, Val loss: 0.5314, Min val loss: 0.5216, Epoch time: 0.010s.
Epoch: 100/300, Train loss: 0.0870, Val loss: 0.4985, Min val loss: 0.4853, Epoch time: 0.013s.
Epoch: 120/300, Train loss: 0.0437, Val loss: 0.5143, Min val loss: 0.4853, Epoch time: 0.027s.
Epoch: 140/300, Train loss: 0.0248, Val loss: 0.4864, Min val loss: 0.4844, Epoch time: 0.017s.
Epoch: 160/300, Train loss: 0.0663, Val loss: 0.5182, Min val loss: 0.4506, Epoch time: 0.012s.
Epoch: 180/300, Train loss: 0.0457, Val loss: 0.5648, Min val loss: 0.4506, Epoch time: 0.018s.
Epoch: 200/300, Train loss: 0.0188, Val loss: 0.4554, Min val loss: 0.4319, Epoch time: 0.012s.
Epoch: 220/300, Train loss: 0.0421, Val loss: 0.4980, Min val loss: 0.4269, Epoch time: 0.010s.
Epoch: 240/300, Train loss: 0.0202, Val loss: 0.4948, Min val loss: 0.4187, Epoch time: 0.011s.
Epoch: 260/300, Train loss: 0.0269, Val loss: 0.4776, Min val loss: 0.4187, Epoch time: 0.035s.
Epoch: 280/300, Train loss: 0.0621, Val loss: 0.4086, Min val loss: 0.3670, Epoch time: 0.013s.
Epoch: 300/300, Train loss: 0.0098, Val loss: 0.4748, Min val loss: 0.3670, Epoch time: 0.011s.
`Trainer.fit` stopped: `max_epochs=300` reached.
2023-09-23 20:37:56,026 - {pytorch_tabular.tabular_model:584} - INFO - Training the model completed
2023-09-23 20:37:56,027 - {pytorch_tabular.tabular_model:1258} - INFO - Loading the best model
/home/xlluo/anaconda3/envs/tabular_ensemble/lib/python3.10/site-packages/pytorch_lightning/utilities/cloud_io.py:33: LightningDeprecationWarning: `pytorch_lightning.utilities.cloud_io.get_filesystem` has been deprecated in v1.8.0 and will be removed in v2.0.0. Please use `lightning_fabric.utilities.cloud_io.get_filesystem` instead.
  rank_zero_deprecation(
Training log_loss loss: 0.01391
Validation log_loss loss: 0.36698
Testing log_loss loss: 0.14373
Trainer saved. To load the trainer, run trainer = load_trainer(path='/tmp/tmpqcxgn2l1/output/iris/2023-09-23-20-37-50-0_UserInputConfig/trainer.pkl')

-------------PytorchTabular End-------------


-------------Run WideDeep-------------

Training TabMlp
Epoch: 1/300, Train loss: 1.0850, Val loss: 1.0164, Min val loss: 1.0164
Epoch: 21/300, Train loss: 0.3373, Val loss: 0.5153, Min val loss: 0.5153
Epoch: 41/300, Train loss: 0.1735, Val loss: 0.3780, Min val loss: 0.3780
Epoch: 61/300, Train loss: 0.1253, Val loss: 0.3279, Min val loss: 0.3165
Epoch: 81/300, Train loss: 0.0960, Val loss: 0.3603, Min val loss: 0.2951
Epoch: 101/300, Train loss: 0.1037, Val loss: 0.3682, Min val loss: 0.2951
Epoch: 121/300, Train loss: 0.1171, Val loss: 0.3525, Min val loss: 0.2951
Epoch: 141/300, Train loss: 0.0503, Val loss: 0.3754, Min val loss: 0.2951
Epoch: 161/300, Train loss: 0.0421, Val loss: 0.3265, Min val loss: 0.2951
Epoch 00167: early stopping
Restoring model weights from the end of the best epoch
Training log_loss loss: 0.06156
Validation log_loss loss: 0.29513
Testing log_loss loss: 0.11595
Trainer saved. To load the trainer, run trainer = load_trainer(path='/tmp/tmpqcxgn2l1/output/iris/2023-09-23-20-37-50-0_UserInputConfig/trainer.pkl')

-------------WideDeep End-------------


-------------Run AutoGluon-------------

Training Random Forest
Presets specified: ['best_quality']
Warning: hyperparameter tuning is currently experimental and may cause the process to hang.
Stack configuration (auto_stack=True): num_stack_levels=0, num_bag_folds=5, num_bag_sets=1
Beginning AutoGluon training ...
AutoGluon will save models to "/tmp/tmpqcxgn2l1/output/iris/2023-09-23-20-37-50-0_UserInputConfig/AutoGluon/Random Forest/"
AutoGluon Version:  0.8.2
Python Version:     3.10.12
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP Mon Dec 13 20:27:58 CST 2021
Disk Space Avail:   148.84 GB / 502.47 GB (29.6%)
Train Data Rows:    90
Train Data Columns: 4
Tuning Data Rows:    30
Tuning Data Columns: 4
Label Column: class
Preprocessing data ...
Train Data Class Count: 3
Using Feature Generators to preprocess the data ...
Fitting PipelineFeatureGenerator...
        Available Memory:                    4021.68 MB
        Train Data (Original)  Memory Usage: 0.0 MB (0.0% of available memory)
        Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
        Stage 1 Generators:
                Fitting AsTypeFeatureGenerator...
        Stage 2 Generators:
                Fitting FillNaFeatureGenerator...
        Stage 3 Generators:
                Fitting IdentityFeatureGenerator...
                        Warning: feature_metadata_in passed as input to fit_transform, but self.feature_metadata_in was already set. Ignoring feature_metadata_in.
        Stage 4 Generators:
                Fitting DropUniqueFeatureGenerator...
        Stage 5 Generators:
                Fitting DropDuplicatesFeatureGenerator...
        Types of features in original data (raw dtype, special dtypes):
                ('float', []) : 4 | ['sepal length', 'sepal width', 'petal length', 'petal width']
        Types of features in processed data (raw dtype, special dtypes):
                ('float', []) : 4 | ['sepal length', 'sepal width', 'petal length', 'petal width']
        0.0s = Fit runtime
        4 features in original data used to generate 4 features in processed data.
        Train Data (Processed) Memory Usage: 0.0 MB (0.0% of available memory)
Data preprocessing and feature engineering runtime = 0.06s ...
AutoGluon will gauge predictive performance using evaluation metric: 'accuracy'
        To change this, specify the eval_metric parameter of Predictor()
use_bag_holdout=True, will use tuning_data as holdout (will not be used for early stopping).
User-specified model hyperparameters to be fit:
{
        'RF': {},
}
Fitting 1 L1 models ...
Hyperparameter tuning model: RandomForest_BAG_L1 ...
        No hyperparameter search space specified for RandomForest_BAG_L1. Skipping HPO. Will train one model based on the provided hyperparameters.
Fitted model: RandomForest_BAG_L1 ...
        1.0      = Validation score   (accuracy)
        0.57s    = Training   runtime
        0.0s     = Validation runtime
Fitting model: WeightedEnsemble_L2 ...
        0.8      = Validation score   (accuracy)
        0.0s     = Training   runtime
        0.0s     = Validation runtime
AutoGluon training complete, total runtime = 0.69s ... Best model: "WeightedEnsemble_L2"
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("/tmp/tmpqcxgn2l1/output/iris/2023-09-23-20-37-50-0_UserInputConfig/AutoGluon/Random Forest/")
Training log_loss loss: 0.01002
Validation log_loss loss: 0.78155
Testing log_loss loss: 0.04824
Trainer saved. To load the trainer, run trainer = load_trainer(path='/tmp/tmpqcxgn2l1/output/iris/2023-09-23-20-37-50-0_UserInputConfig/trainer.pkl')

-------------AutoGluon End-------------

PytorchTabular metrics
Category Embedding 1/1
WideDeep metrics
TabMlp 1/1
AutoGluon metrics
Random Forest 1/1
Trainer saved. To load the trainer, run trainer = load_trainer(path='/tmp/tmpqcxgn2l1/output/iris/2023-09-23-20-37-50-0_UserInputConfig/trainer.pkl')

[12]:

	Program	Model	Training ACCURACY_SCORE	Training BALANCED_ACCURACY_SCORE	Training COHEN_KAPPA_SCORE	Training MATTHEWS_CORRCOEF	Training PRECISION_SCORE_MACRO	Training PRECISION_SCORE_MICRO	...	Validation F1_SCORE_MICRO	Validation F1_SCORE_WEIGHTED	Validation JACCARD_SCORE_MACRO	Validation JACCARD_SCORE_MICRO	Validation JACCARD_SCORE_WEIGHTED	Validation TOP_K_ACCURACY_SCORE	Validation LOG_LOSS	Validation ROC_AUC_SCORE_OVR_MACRO	Validation ROC_AUC_SCORE_OVR_WEIGHTED	Validation ROC_AUC_SCORE_OVO
0	PytorchTabular	Category Embedding	1.0	1.0	1.0	1.0	1.0	1.0	...	0.833333	0.837232	0.756944	0.714286	0.732639	1.0	0.366983	0.974891	0.971616	0.976042
1	WideDeep	TabMlp	1.0	1.0	1.0	1.0	1.0	1.0	...	0.833333	0.837232	0.756944	0.714286	0.732639	1.0	0.295129	0.979747	0.977576	0.980833
2	AutoGluon	Random Forest	1.0	1.0	1.0	1.0	1.0	1.0	...	0.800000	0.804615	0.721154	0.666667	0.689423	1.0	0.781551	0.950812	0.941465	0.951042

3 rows × 71 columns

Using a configuration file#

In the above introduction, we use UCI datasets whose configuration is automatically generated. The configuration can also be loaded from a local .py or .json file. To run a minimum example, we provide a randomly generated sample dataset (data/sample.csv) and its configuration file (configs/sample.py) in the repository. See “Dataset and configuration” for the detailed introduction of configuration files.

tabensemb uses paths relative to the current directory. For different IDEs (PyCharm, VSCode, etc.), the directory might be different. Set default paths to desired ones after checking the current working directory using magic commands in notebooks like !pwd or scripts like import os; os.getcwd().

[13]:

path = "../../../../"
tabensemb.setting["default_config_path"] = path + "configs"
tabensemb.setting["default_data_path"] = path + "data"

Load the configuration file sample.py using Trainer.load_config, which automatically searches the file in the current directory and tabensemb.setting["default_config_path"].

[14]:

trainer.load_config("sample")
trainer.load_data()

The project will be saved to /tmp/tmpqcxgn2l1/output/iris/2023-09-23-20-37-58-0_sample
Dataset size: 153 51 52
Data saved to /tmp/tmpqcxgn2l1/output/iris/2023-09-23-20-37-58-0_sample (data.csv and tabular_data.csv).

Then initialize models:

[15]:

trainer.clear_modelbase()
models = [
    PytorchTabular(trainer, model_subset=["Category Embedding"])
]
trainer.add_modelbases(models)

Optional: For a quick development test, changing the following global setting significantly reduces training time.

[16]:

tabensemb.setting["debug_mode"] = True

[17]:

trainer.train(stderr_to_stdout=True)
trainer.get_leaderboard()


-------------Run PytorchTabular-------------

Training Category Embedding
Global seed set to 42
2023-09-23 20:37:59,305 - {pytorch_tabular.tabular_model:473} - INFO - Preparing the DataLoaders
2023-09-23 20:37:59,306 - {pytorch_tabular.tabular_datamodule:290} - INFO - Setting up the datamodule for regression task
2023-09-23 20:37:59,326 - {pytorch_tabular.tabular_model:521} - INFO - Preparing the Model: CategoryEmbeddingModel
2023-09-23 20:37:59,350 - {pytorch_tabular.tabular_model:268} - INFO - Preparing the Trainer
/home/xlluo/anaconda3/envs/tabular_ensemble/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py:589: LightningDeprecationWarning: The Trainer argument `auto_select_gpus` has been deprecated in v1.9.0 and will be removed in v2.0.0. Please use the function `pytorch_lightning.accelerators.find_usable_cuda_devices` instead.
  rank_zero_deprecation(
Auto select gpus: [0]
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
2023-09-23 20:37:59,372 - {pytorch_tabular.tabular_model:582} - INFO - Training Started
You are using a CUDA device ('NVIDIA GeForce RTX 3090') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name             | Type                      | Params
---------------------------------------------------------------
0 | _backbone        | CategoryEmbeddingBackbone | 12.3 K
1 | _embedding_layer | Embedding1dLayer          | 64
2 | head             | LinearHead                | 33
3 | loss             | MSELoss                   | 0
---------------------------------------------------------------
12.4 K    Trainable params
0         Non-trainable params
12.4 K    Total params
0.049     Total estimated model params size (MB)
Epoch: 1/2, Train loss: 33183.6562, Val loss: 22223.0391, Min val loss: 22223.0391, Epoch time: 0.014s.
`Trainer.fit` stopped: `max_epochs=2` reached.
2023-09-23 20:37:59,437 - {pytorch_tabular.tabular_model:584} - INFO - Training the model completed
2023-09-23 20:37:59,438 - {pytorch_tabular.tabular_model:1258} - INFO - Loading the best model
/home/xlluo/anaconda3/envs/tabular_ensemble/lib/python3.10/site-packages/pytorch_lightning/utilities/cloud_io.py:33: LightningDeprecationWarning: `pytorch_lightning.utilities.cloud_io.get_filesystem` has been deprecated in v1.8.0 and will be removed in v2.0.0. Please use `lightning_fabric.utilities.cloud_io.get_filesystem` instead.
  rank_zero_deprecation(
Training mse loss: 33085.08333
Validation mse loss: 22182.62019
Testing mse loss: 29810.51082
Trainer saved. To load the trainer, run trainer = load_trainer(path='/tmp/tmpqcxgn2l1/output/iris/2023-09-23-20-37-58-0_sample/trainer.pkl')

-------------PytorchTabular End-------------

PytorchTabular metrics
Category Embedding 1/1
Trainer saved. To load the trainer, run trainer = load_trainer(path='/tmp/tmpqcxgn2l1/output/iris/2023-09-23-20-37-58-0_sample/trainer.pkl')

[17]:

	Program	Model	Training RMSE	Training MSE	Training MAE	Training MAPE	Training R2	Training MEDIAN_ABSOLUTE_ERROR	Training EXPLAINED_VARIANCE_SCORE	Testing RMSE	...	Testing R2	Testing MEDIAN_ABSOLUTE_ERROR	Testing EXPLAINED_VARIANCE_SCORE	Validation RMSE	Validation MSE	Validation MAE	Validation MAPE	Validation R2	Validation MEDIAN_ABSOLUTE_ERROR	Validation EXPLAINED_VARIANCE_SCORE
0	PytorchTabular	Category Embedding	181.893055	33085.083331	145.409738	1.055588	-0.004359	121.139843	0.001236	172.657206	...	-0.005851	118.665751	-0.001657	148.93831	22182.620185	121.146176	1.00906	-0.001817	92.916794	0.001214

1 rows × 23 columns

Clean the temporary directory of the notebook.

[18]:

temp_path.cleanup()