Inference on an upcoming dataset#

In this part, we will simulate the real deployment of the package and make inferences on an upcoming dataset. We use the Adult dataset from UCI datasets which has an individual testing set.

Training models#

Similar to the first example, we initialize a Trainer and model bases, then train all models.

[1]:

import torch
from tabensemb.trainer import Trainer
from tabensemb.model import *
import tabensemb
from tabensemb.config import UserConfig
import os
from tempfile import TemporaryDirectory

temp_path = TemporaryDirectory()
tabensemb.setting["default_output_path"] = os.path.join(temp_path.name, "output")
tabensemb.setting["default_config_path"] = os.path.join(temp_path.name, "configs")
tabensemb.setting["default_data_path"] = os.path.join(temp_path.name, "data")

device = "cuda" if torch.cuda.is_available() else "cpu"
print("Using {} device".format(device))

trainer = Trainer(device=device)
adult_columns = [
    "age",
    "workclass",
    "fnlwgt",
    "education",
    "education-num",
    "marital-status",
    "occupation",
    "relationship",
    "race",
    "sex",
    "capital-gain",
    "capital-loss",
    "hours-per-week",
    "native-country",
    "income",
]
cfg = UserConfig.from_uci("Adult", column_names=adult_columns, sep=", ")
trainer.load_config(cfg)
trainer.load_data()
models = [
    PytorchTabular(trainer, model_subset=["Category Embedding"]),
    WideDeep(trainer, model_subset=["TabMlp"]),
    AutoGluon(trainer, model_subset=["Random Forest"]),
]
trainer.add_modelbases(models)
trainer.train(stderr_to_stdout=True)
trainer.get_leaderboard()

Using cuda device
Downloading https://archive.ics.uci.edu/static/public/2/adult.zip to /tmp/tmpkc_l71f1/data/Adult.zip

/home/xlluo/hdd/tabular_ensemble/tabensemb/config/user_config.py:292: UserWarning: There exists .test file(s) ['adult.test'] which should be used for final metrics. The .zip file is left for the user to process.
  warnings.warn(
/home/xlluo/hdd/tabular_ensemble/tabensemb/utils/utils.py:464: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
  df = pd.read_csv(StringIO(s), names=names, sep=sep)

age is Integer and will be treated as a continuous feature.
fnlwgt is Integer and will be treated as a continuous feature.
education-num is Integer and will be treated as a continuous feature.
capital-gain is Integer and will be treated as a continuous feature.
capital-loss is Integer and will be treated as a continuous feature.
hours-per-week is Integer and will be treated as a continuous feature.
The project will be saved to /tmp/tmpkc_l71f1/output/adult/2023-09-23-20-34-58-0_UserInputConfig
Dataset size: 19536 6512 6513
Data saved to /tmp/tmpkc_l71f1/output/adult/2023-09-23-20-34-58-0_UserInputConfig (data.csv and tabular_data.csv).

-------------Run PytorchTabular-------------

Training Category Embedding
Global seed set to 42
2023-09-23 20:35:00,222 - {pytorch_tabular.tabular_model:473} - INFO - Preparing the DataLoaders
2023-09-23 20:35:00,223 - {pytorch_tabular.tabular_datamodule:290} - INFO - Setting up the datamodule for classification task
2023-09-23 20:35:00,288 - {pytorch_tabular.tabular_model:521} - INFO - Preparing the Model: CategoryEmbeddingModel
2023-09-23 20:35:00,307 - {pytorch_tabular.tabular_model:268} - INFO - Preparing the Trainer
/home/xlluo/anaconda3/envs/tabular_ensemble/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py:589: LightningDeprecationWarning: The Trainer argument `auto_select_gpus` has been deprecated in v1.9.0 and will be removed in v2.0.0. Please use the function `pytorch_lightning.accelerators.find_usable_cuda_devices` instead.
  rank_zero_deprecation(
Auto select gpus: [0]
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
2023-09-23 20:35:01,129 - {pytorch_tabular.tabular_model:582} - INFO - Training Started
You are using a CUDA device ('NVIDIA GeForce RTX 3090') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name             | Type                      | Params
---------------------------------------------------------------
0 | _backbone        | CategoryEmbeddingBackbone | 18.5 K
1 | _embedding_layer | Embedding1dLayer          | 1.4 K
2 | head             | LinearHead                | 66
3 | loss             | CrossEntropyLoss          | 0
---------------------------------------------------------------
20.0 K    Trainable params
0         Non-trainable params
20.0 K    Total params
0.080     Total estimated model params size (MB)
Epoch: 1/300, Train loss: 0.4666, Val loss: 0.3794, Min val loss: 0.3794, Epoch time: 0.376s.
Epoch: 20/300, Train loss: 0.3112, Val loss: 0.3184, Min val loss: 0.3177, Epoch time: 0.406s.
Epoch: 40/300, Train loss: 0.2972, Val loss: 0.3213, Min val loss: 0.3162, Epoch time: 0.326s.
Epoch: 60/300, Train loss: 0.2864, Val loss: 0.3248, Min val loss: 0.3162, Epoch time: 0.330s.
Epoch: 80/300, Train loss: 0.2766, Val loss: 0.3323, Min val loss: 0.3162, Epoch time: 0.386s.
Epoch: 100/300, Train loss: 0.2677, Val loss: 0.3416, Min val loss: 0.3162, Epoch time: 0.540s.
Epoch: 120/300, Train loss: 0.2602, Val loss: 0.3484, Min val loss: 0.3162, Epoch time: 0.357s.
2023-09-23 20:35:50,914 - {pytorch_tabular.tabular_model:584} - INFO - Training the model completed
2023-09-23 20:35:50,915 - {pytorch_tabular.tabular_model:1258} - INFO - Loading the best model
/home/xlluo/anaconda3/envs/tabular_ensemble/lib/python3.10/site-packages/pytorch_lightning/utilities/cloud_io.py:33: LightningDeprecationWarning: `pytorch_lightning.utilities.cloud_io.get_filesystem` has been deprecated in v1.8.0 and will be removed in v2.0.0. Please use `lightning_fabric.utilities.cloud_io.get_filesystem` instead.
  rank_zero_deprecation(
Training log_loss loss: 0.28862
Validation log_loss loss: 0.31619
Testing log_loss loss: 0.31114
Trainer saved. To load the trainer, run trainer = load_trainer(path='/tmp/tmpkc_l71f1/output/adult/2023-09-23-20-34-58-0_UserInputConfig/trainer.pkl')

-------------PytorchTabular End-------------


-------------Run WideDeep-------------

Training TabMlp
Epoch: 1/300, Train loss: 0.5206, Val loss: 0.4274, Min val loss: 0.4274
Epoch: 21/300, Train loss: 0.3145, Val loss: 0.3161, Min val loss: 0.3161
Epoch: 41/300, Train loss: 0.3063, Val loss: 0.3192, Min val loss: 0.3161
Epoch: 61/300, Train loss: 0.2984, Val loss: 0.3209, Min val loss: 0.3161
Epoch: 81/300, Train loss: 0.2905, Val loss: 0.3227, Min val loss: 0.3161
Epoch: 101/300, Train loss: 0.2819, Val loss: 0.3266, Min val loss: 0.3161
Epoch: 121/300, Train loss: 0.2754, Val loss: 0.3293, Min val loss: 0.3161
Epoch 00121: early stopping
Restoring model weights from the end of the best epoch
Training log_loss loss: 0.30084
Validation log_loss loss: 0.31729
Testing log_loss loss: 0.31133
Trainer saved. To load the trainer, run trainer = load_trainer(path='/tmp/tmpkc_l71f1/output/adult/2023-09-23-20-34-58-0_UserInputConfig/trainer.pkl')

-------------WideDeep End-------------


-------------Run AutoGluon-------------

Training Random Forest
Presets specified: ['best_quality']
Warning: hyperparameter tuning is currently experimental and may cause the process to hang.
Stack configuration (auto_stack=True): num_stack_levels=0, num_bag_folds=8, num_bag_sets=1
Beginning AutoGluon training ...
AutoGluon will save models to "/tmp/tmpkc_l71f1/output/adult/2023-09-23-20-34-58-0_UserInputConfig/AutoGluon/Random Forest/"
AutoGluon Version:  0.8.2
Python Version:     3.10.12
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP Mon Dec 13 20:27:58 CST 2021
Disk Space Avail:   149.26 GB / 502.47 GB (29.7%)
Train Data Rows:    19536
Train Data Columns: 14
Tuning Data Rows:    6512
Tuning Data Columns: 14
Label Column: income
Preprocessing data ...
Selected class <--> label mapping:  class 1 = 1, class 0 = 0
Using Feature Generators to preprocess the data ...
Fitting PipelineFeatureGenerator...
        Available Memory:                    12442.89 MB
        Train Data (Original)  Memory Usage: 15.07 MB (0.1% of available memory)
        Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
        Stage 1 Generators:
                Fitting AsTypeFeatureGenerator...
                        Note: Converting 1 features to boolean dtype as they only contain 2 unique values.
        Stage 2 Generators:
                Fitting FillNaFeatureGenerator...
        Stage 3 Generators:
                Fitting IdentityFeatureGenerator...
                        Warning: feature_metadata_in passed as input to fit_transform, but self.feature_metadata_in was already set. Ignoring feature_metadata_in.
                Fitting CategoryFeatureGenerator...
                        Warning: feature_metadata_in passed as input to fit_transform, but self.feature_metadata_in was already set. Ignoring feature_metadata_in.
                        Fitting CategoryMemoryMinimizeFeatureGenerator...
        Stage 4 Generators:
                Fitting DropUniqueFeatureGenerator...
        Stage 5 Generators:
                Fitting DropDuplicatesFeatureGenerator...
        Types of features in original data (raw dtype, special dtypes):
                ('int', [])    : 6 | ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', ...]
                ('object', []) : 8 | ['education', 'marital-status', 'native-country', 'occupation', 'race', ...]
        Types of features in processed data (raw dtype, special dtypes):
                ('category', []) : 8 | ['education', 'marital-status', 'native-country', 'occupation', 'race', ...]
                ('int', [])      : 6 | ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', ...]
        0.1s = Fit runtime
        14 features in original data used to generate 14 features in processed data.
        Train Data (Processed) Memory Usage: 1.46 MB (0.0% of available memory)
Data preprocessing and feature engineering runtime = 0.15s ...
AutoGluon will gauge predictive performance using evaluation metric: 'accuracy'
        To change this, specify the eval_metric parameter of Predictor()
use_bag_holdout=True, will use tuning_data as holdout (will not be used for early stopping).
User-specified model hyperparameters to be fit:
{
        'RF': {},
}
Fitting 1 L1 models ...
Hyperparameter tuning model: RandomForest_BAG_L1 ...
        No hyperparameter search space specified for RandomForest_BAG_L1. Skipping HPO. Will train one model based on the provided hyperparameters.
Fitted model: RandomForest_BAG_L1 ...
        0.8543   = Validation score   (accuracy)
        1.19s    = Training   runtime
        0.0s     = Validation runtime
Fitting model: WeightedEnsemble_L2 ...
        0.8538   = Validation score   (accuracy)
        0.0s     = Training   runtime
        0.01s    = Validation runtime
AutoGluon training complete, total runtime = 1.54s ... Best model: "WeightedEnsemble_L2"
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("/tmp/tmpkc_l71f1/output/adult/2023-09-23-20-34-58-0_UserInputConfig/AutoGluon/Random Forest/")
Training log_loss loss: 0.08021
Validation log_loss loss: 0.31802
Testing log_loss loss: 0.31261
Trainer saved. To load the trainer, run trainer = load_trainer(path='/tmp/tmpkc_l71f1/output/adult/2023-09-23-20-34-58-0_UserInputConfig/trainer.pkl')

-------------AutoGluon End-------------

PytorchTabular metrics
Category Embedding 1/1
WideDeep metrics
TabMlp 1/1
AutoGluon metrics
Random Forest 1/1
Trainer saved. To load the trainer, run trainer = load_trainer(path='/tmp/tmpkc_l71f1/output/adult/2023-09-23-20-34-58-0_UserInputConfig/trainer.pkl')

[1]:

	Program	Model	Training F1_SCORE	Training PRECISION_SCORE	Training RECALL_SCORE	Training JACCARD_SCORE	Training ACCURACY_SCORE	Training BALANCED_ACCURACY_SCORE	Training COHEN_KAPPA_SCORE	Training HAMMING_LOSS	...	Validation ACCURACY_SCORE	Validation BALANCED_ACCURACY_SCORE	Validation COHEN_KAPPA_SCORE	Validation HAMMING_LOSS	Validation MATTHEWS_CORRCOEF	Validation ZERO_ONE_LOSS	Validation ROC_AUC_SCORE	Validation LOG_LOSS	Validation BRIER_SCORE_LOSS	Validation AVERAGE_PRECISION_SCORE
0	WideDeep	TabMlp	0.694200	0.728505	0.662981	0.531628	0.859388	0.792321	0.603167	0.140612	...	0.852426	0.784474	0.584884	0.147574	0.585738	0.147574	0.908951	0.317288	0.101612	0.868420
1	AutoGluon	Random Forest	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	0.000000	...	0.853808	0.776665	0.580404	0.146192	0.583003	0.146192	0.907010	0.318016	0.100486	0.875084
2	PytorchTabular	Category Embedding	0.709806	0.738341	0.683394	0.550154	0.865479	0.803303	0.622423	0.134521	...	0.850430	0.784467	0.581612	0.149570	0.582150	0.149570	0.909318	0.316194	0.101722	0.868410

3 rows × 44 columns

Selecting and storing a model#

From the leaderboard, we can check the performance of each model and select one of the models for deployment. Say we want to choose Random Forest from AutoGluon, we detach the model from the heavy trainer. It is stored locally in a separate directory.

[2]:

trainer_of_one_model = trainer.detach_model(program="AutoGluon", model_name="Random Forest")

Trainer saved. To load the trainer, run trainer = load_trainer(path='/tmp/tmpkc_l71f1/output/adult/2023-09-23-20-34-58-0_UserInputConfig-I1/trainer.pkl')

The detached trainer now has only one model base.

[3]:

# Model bases of the detached trainer
trainer_of_one_model.modelbases

[3]:

[<tabensemb.model.autogluon.AutoGluon at 0x7f581fc3bd00>]

[4]:

# The model in the model base
trainer_of_one_model.get_modelbase("AutoGluon_Random Forest").model["Random Forest"]

[4]:

<autogluon.tabular.predictor.predictor.TabularPredictor at 0x7f5644bdbfa0>

Loading the model#

Now the Trainer containing a single model is stored in a separate directory. Assume that we want to load the local trainer in a separate script for inference. In the following line, the argument path of load_trainer is the path to trainer.pkl, which is already printed when detaching the model or training the model bases. Here we just use the directory of the detached trainer trainer_of_one_model.

Remark: You can move the directory to any other place (or other devices if the version of the package and the environment are all consistent) and rename the folder. tabensemb automatically configures the path.

[5]:

from tabensemb.trainer import load_trainer

trainer = load_trainer(path=os.path.join(trainer_of_one_model.project_root, "trainer.pkl"))

[6]:

trainer.get_modelbase("AutoGluon_Random Forest").model["Random Forest"]

[6]:

<autogluon.tabular.predictor.predictor.TabularPredictor at 0x7f5644bd8130>

Inference#

Assume that we have a new DataFrame representing an upcoming dataset. For demonstration, we use the testing set here. The classification target is ordinal encoded by trainer.datamodule.label_ordinal_encoder.

[7]:

df = trainer.df.loc[trainer.test_indices, :]
truth = trainer.df.loc[trainer.test_indices, trainer.label_name].values.flatten()
truth

[7]:

array([0, 0, 1, ..., 1, 0, 0])

Use the functionality of the model base to do inference. The returned result should be

[8]:

import pandas as pd

result = trainer.get_modelbase("AutoGluon_Random Forest").predict(df, model_name="Random Forest")
result

[8]:

array([['<=50K'],
       ['<=50K'],
       ['>50K'],
       ...,
       ['>50K'],
       ['<=50K'],
       ['<=50K']], dtype=object)

You can see the F1 score on the “new” (testing) dataset is the same as that in the above leaderboard. The result should be ordinal-encoded first to calculate metrics. We provide DataModule.label_categories_transform to achieve this (and DataModule.label_categories_inverse_transform to do the inverse transform).

The auto_metric_sklearn automatically calculates different kinds of sklearn.metrics, which is extremely useful for classification tasks

[9]:

from tabensemb.utils import auto_metric_sklearn

encoded_result = trainer.datamodule.label_categories_transform(pd.DataFrame(result, columns=trainer.label_name)).values
# You can also use trainer.datamodule.label_ordinal_encoder.transform(result) to get the same result.
auto_metric_sklearn(truth, encoded_result, "f1_score", "binary"), trainer.leaderboard.loc[trainer.leaderboard["Model"]=="Random Forest", "Testing F1_SCORE"].values

[9]:

(0.6936416184971098, array([0.69364162]))

Inference on the individual testing set#

When loading from UCI datasets, UserConfig.from_uci finds that an individual testing dataset exists, so the downloaded .zip file is not removed. We can load the archive using zipfile.

[10]:

import zipfile

zipf = zipfile.ZipFile(os.path.join(tabensemb.setting["default_data_path"], "Adult.zip"))
zipf.namelist()

[10]:

['Index', 'adult.data', 'adult.names', 'adult.test', 'old.adult.names']

Now check the content of adult.test. It is a .csv-like file, just like the adult.data file but has an additional row at the front, and an additional “.” at the end of each line.

[11]:

file = zipf.read("adult.test").decode()
print(file[:500])

|1x3 Cross validator
25, Private, 226802, 11th, 7, Never-married, Machine-op-inspct, Own-child, Black, Male, 0, 0, 40, United-States, <=50K.
38, Private, 89814, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 50, United-States, <=50K.
28, Local-gov, 336951, Assoc-acdm, 12, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 40, United-States, >50K.
44, Private, 160323, Some-college, 10, Married-civ-spouse, Machine-op-inspct, Husband, Black, Male, 7688, 0

We provide the str_to_dataframe function that translates the string object to a DataFrame and replaces illegal values with np.nan.

[12]:

from tabensemb.utils import str_to_dataframe

file = file.replace("|1x3 Cross validator\n","").replace(".\n", "\n")
testing_df = str_to_dataframe(file, sep=", ", names=trainer.df.columns, check_nan_on=trainer.cont_feature_names)
testing_df

/home/xlluo/hdd/tabular_ensemble/tabensemb/utils/utils.py:464: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
  df = pd.read_csv(StringIO(s), names=names, sep=sep)

[12]:

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	capital-loss	hours-per-week	native-country	income
0	25	Private	226802	11th	7	Never-married	Machine-op-inspct	Own-child	Black	Male	0	0	40	United-States	<=50K
1	38	Private	89814	HS-grad	9	Married-civ-spouse	Farming-fishing	Husband	White	Male	0	0	50	United-States	<=50K
2	28	Local-gov	336951	Assoc-acdm	12	Married-civ-spouse	Protective-serv	Husband	White	Male	0	0	40	United-States	>50K
3	44	Private	160323	Some-college	10	Married-civ-spouse	Machine-op-inspct	Husband	Black	Male	7688	0	40	United-States	>50K
4	18	?	103497	Some-college	10	Never-married	?	Own-child	White	Female	0	0	30	United-States	<=50K
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
16276	39	Private	215419	Bachelors	13	Divorced	Prof-specialty	Not-in-family	White	Female	0	0	36	United-States	<=50K
16277	64	?	321403	HS-grad	9	Widowed	?	Other-relative	Black	Male	0	0	40	United-States	<=50K
16278	38	Private	374983	Bachelors	13	Married-civ-spouse	Prof-specialty	Husband	White	Male	0	0	50	United-States	<=50K
16279	44	Private	83891	Bachelors	13	Divorced	Adm-clerical	Own-child	Asian-Pac-Islander	Male	5455	0	40	United-States	<=50K
16280	35	Self-emp-inc	182148	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	0	60	United-States	>50K

16281 rows × 15 columns

The inference is similar.

[13]:

result = trainer.get_modelbase("AutoGluon_Random Forest").predict(testing_df, model_name="Random Forest")
result

[13]:

array([['<=50K'],
       ['<=50K'],
       ['<=50K'],
       ...,
       ['>50K'],
       ['<=50K'],
       ['>50K']], dtype=object)

Both the truth and the result require to be ordinal-encoded to calculate metrics.

[14]:

encoded_truth = trainer.datamodule.label_categories_transform(testing_df[trainer.label_name]).values
encoded_result = trainer.datamodule.label_categories_transform(pd.DataFrame(result, columns=trainer.label_name)).values
auto_metric_sklearn(encoded_truth, encoded_result, "f1_score", "binary"), auto_metric_sklearn(encoded_truth, encoded_result, "roc_auc_score", "binary")

[14]:

(0.6690190543401553, 0.7731706276694975)