Inference on an upcoming dataset#
In this part, we will simulate the real deployment of the package and make inferences on an upcoming dataset. We use the Adult dataset from UCI datasets which has an individual testing set.
Training models#
Similar to the first example, we initialize a Trainer and model bases, then train all models.
[1]:
import torch
from tabensemb.trainer import Trainer
from tabensemb.model import *
import tabensemb
from tabensemb.config import UserConfig
import os
from tempfile import TemporaryDirectory
temp_path = TemporaryDirectory()
tabensemb.setting["default_output_path"] = os.path.join(temp_path.name, "output")
tabensemb.setting["default_config_path"] = os.path.join(temp_path.name, "configs")
tabensemb.setting["default_data_path"] = os.path.join(temp_path.name, "data")
device = "cuda" if torch.cuda.is_available() else "cpu"
print("Using {} device".format(device))
trainer = Trainer(device=device)
adult_columns = [
"age",
"workclass",
"fnlwgt",
"education",
"education-num",
"marital-status",
"occupation",
"relationship",
"race",
"sex",
"capital-gain",
"capital-loss",
"hours-per-week",
"native-country",
"income",
]
cfg = UserConfig.from_uci("Adult", column_names=adult_columns, sep=", ")
trainer.load_config(cfg)
trainer.load_data()
models = [
PytorchTabular(trainer, model_subset=["Category Embedding"]),
WideDeep(trainer, model_subset=["TabMlp"]),
AutoGluon(trainer, model_subset=["Random Forest"]),
]
trainer.add_modelbases(models)
trainer.train(stderr_to_stdout=True)
trainer.get_leaderboard()
Using cuda device
Downloading https://archive.ics.uci.edu/static/public/2/adult.zip to /tmp/tmpkc_l71f1/data/Adult.zip
/home/xlluo/hdd/tabular_ensemble/tabensemb/config/user_config.py:292: UserWarning: There exists .test file(s) ['adult.test'] which should be used for final metrics. The .zip file is left for the user to process.
warnings.warn(
/home/xlluo/hdd/tabular_ensemble/tabensemb/utils/utils.py:464: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
df = pd.read_csv(StringIO(s), names=names, sep=sep)
age is Integer and will be treated as a continuous feature.
fnlwgt is Integer and will be treated as a continuous feature.
education-num is Integer and will be treated as a continuous feature.
capital-gain is Integer and will be treated as a continuous feature.
capital-loss is Integer and will be treated as a continuous feature.
hours-per-week is Integer and will be treated as a continuous feature.
The project will be saved to /tmp/tmpkc_l71f1/output/adult/2023-09-23-20-34-58-0_UserInputConfig
Dataset size: 19536 6512 6513
Data saved to /tmp/tmpkc_l71f1/output/adult/2023-09-23-20-34-58-0_UserInputConfig (data.csv and tabular_data.csv).
-------------Run PytorchTabular-------------
Training Category Embedding
Global seed set to 42
2023-09-23 20:35:00,222 - {pytorch_tabular.tabular_model:473} - INFO - Preparing the DataLoaders
2023-09-23 20:35:00,223 - {pytorch_tabular.tabular_datamodule:290} - INFO - Setting up the datamodule for classification task
2023-09-23 20:35:00,288 - {pytorch_tabular.tabular_model:521} - INFO - Preparing the Model: CategoryEmbeddingModel
2023-09-23 20:35:00,307 - {pytorch_tabular.tabular_model:268} - INFO - Preparing the Trainer
/home/xlluo/anaconda3/envs/tabular_ensemble/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py:589: LightningDeprecationWarning: The Trainer argument `auto_select_gpus` has been deprecated in v1.9.0 and will be removed in v2.0.0. Please use the function `pytorch_lightning.accelerators.find_usable_cuda_devices` instead.
rank_zero_deprecation(
Auto select gpus: [0]
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
2023-09-23 20:35:01,129 - {pytorch_tabular.tabular_model:582} - INFO - Training Started
You are using a CUDA device ('NVIDIA GeForce RTX 3090') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
| Name | Type | Params
---------------------------------------------------------------
0 | _backbone | CategoryEmbeddingBackbone | 18.5 K
1 | _embedding_layer | Embedding1dLayer | 1.4 K
2 | head | LinearHead | 66
3 | loss | CrossEntropyLoss | 0
---------------------------------------------------------------
20.0 K Trainable params
0 Non-trainable params
20.0 K Total params
0.080 Total estimated model params size (MB)
Epoch: 1/300, Train loss: 0.4666, Val loss: 0.3794, Min val loss: 0.3794, Epoch time: 0.376s.
Epoch: 20/300, Train loss: 0.3112, Val loss: 0.3184, Min val loss: 0.3177, Epoch time: 0.406s.
Epoch: 40/300, Train loss: 0.2972, Val loss: 0.3213, Min val loss: 0.3162, Epoch time: 0.326s.
Epoch: 60/300, Train loss: 0.2864, Val loss: 0.3248, Min val loss: 0.3162, Epoch time: 0.330s.
Epoch: 80/300, Train loss: 0.2766, Val loss: 0.3323, Min val loss: 0.3162, Epoch time: 0.386s.
Epoch: 100/300, Train loss: 0.2677, Val loss: 0.3416, Min val loss: 0.3162, Epoch time: 0.540s.
Epoch: 120/300, Train loss: 0.2602, Val loss: 0.3484, Min val loss: 0.3162, Epoch time: 0.357s.
2023-09-23 20:35:50,914 - {pytorch_tabular.tabular_model:584} - INFO - Training the model completed
2023-09-23 20:35:50,915 - {pytorch_tabular.tabular_model:1258} - INFO - Loading the best model
/home/xlluo/anaconda3/envs/tabular_ensemble/lib/python3.10/site-packages/pytorch_lightning/utilities/cloud_io.py:33: LightningDeprecationWarning: `pytorch_lightning.utilities.cloud_io.get_filesystem` has been deprecated in v1.8.0 and will be removed in v2.0.0. Please use `lightning_fabric.utilities.cloud_io.get_filesystem` instead.
rank_zero_deprecation(
Training log_loss loss: 0.28862
Validation log_loss loss: 0.31619
Testing log_loss loss: 0.31114
Trainer saved. To load the trainer, run trainer = load_trainer(path='/tmp/tmpkc_l71f1/output/adult/2023-09-23-20-34-58-0_UserInputConfig/trainer.pkl')
-------------PytorchTabular End-------------
-------------Run WideDeep-------------
Training TabMlp
Epoch: 1/300, Train loss: 0.5206, Val loss: 0.4274, Min val loss: 0.4274
Epoch: 21/300, Train loss: 0.3145, Val loss: 0.3161, Min val loss: 0.3161
Epoch: 41/300, Train loss: 0.3063, Val loss: 0.3192, Min val loss: 0.3161
Epoch: 61/300, Train loss: 0.2984, Val loss: 0.3209, Min val loss: 0.3161
Epoch: 81/300, Train loss: 0.2905, Val loss: 0.3227, Min val loss: 0.3161
Epoch: 101/300, Train loss: 0.2819, Val loss: 0.3266, Min val loss: 0.3161
Epoch: 121/300, Train loss: 0.2754, Val loss: 0.3293, Min val loss: 0.3161
Epoch 00121: early stopping
Restoring model weights from the end of the best epoch
Training log_loss loss: 0.30084
Validation log_loss loss: 0.31729
Testing log_loss loss: 0.31133
Trainer saved. To load the trainer, run trainer = load_trainer(path='/tmp/tmpkc_l71f1/output/adult/2023-09-23-20-34-58-0_UserInputConfig/trainer.pkl')
-------------WideDeep End-------------
-------------Run AutoGluon-------------
Training Random Forest
Presets specified: ['best_quality']
Warning: hyperparameter tuning is currently experimental and may cause the process to hang.
Stack configuration (auto_stack=True): num_stack_levels=0, num_bag_folds=8, num_bag_sets=1
Beginning AutoGluon training ...
AutoGluon will save models to "/tmp/tmpkc_l71f1/output/adult/2023-09-23-20-34-58-0_UserInputConfig/AutoGluon/Random Forest/"
AutoGluon Version: 0.8.2
Python Version: 3.10.12
Operating System: Linux
Platform Machine: x86_64
Platform Version: #1 SMP Mon Dec 13 20:27:58 CST 2021
Disk Space Avail: 149.26 GB / 502.47 GB (29.7%)
Train Data Rows: 19536
Train Data Columns: 14
Tuning Data Rows: 6512
Tuning Data Columns: 14
Label Column: income
Preprocessing data ...
Selected class <--> label mapping: class 1 = 1, class 0 = 0
Using Feature Generators to preprocess the data ...
Fitting PipelineFeatureGenerator...
Available Memory: 12442.89 MB
Train Data (Original) Memory Usage: 15.07 MB (0.1% of available memory)
Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
Stage 1 Generators:
Fitting AsTypeFeatureGenerator...
Note: Converting 1 features to boolean dtype as they only contain 2 unique values.
Stage 2 Generators:
Fitting FillNaFeatureGenerator...
Stage 3 Generators:
Fitting IdentityFeatureGenerator...
Warning: feature_metadata_in passed as input to fit_transform, but self.feature_metadata_in was already set. Ignoring feature_metadata_in.
Fitting CategoryFeatureGenerator...
Warning: feature_metadata_in passed as input to fit_transform, but self.feature_metadata_in was already set. Ignoring feature_metadata_in.
Fitting CategoryMemoryMinimizeFeatureGenerator...
Stage 4 Generators:
Fitting DropUniqueFeatureGenerator...
Stage 5 Generators:
Fitting DropDuplicatesFeatureGenerator...
Types of features in original data (raw dtype, special dtypes):
('int', []) : 6 | ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', ...]
('object', []) : 8 | ['education', 'marital-status', 'native-country', 'occupation', 'race', ...]
Types of features in processed data (raw dtype, special dtypes):
('category', []) : 8 | ['education', 'marital-status', 'native-country', 'occupation', 'race', ...]
('int', []) : 6 | ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', ...]
0.1s = Fit runtime
14 features in original data used to generate 14 features in processed data.
Train Data (Processed) Memory Usage: 1.46 MB (0.0% of available memory)
Data preprocessing and feature engineering runtime = 0.15s ...
AutoGluon will gauge predictive performance using evaluation metric: 'accuracy'
To change this, specify the eval_metric parameter of Predictor()
use_bag_holdout=True, will use tuning_data as holdout (will not be used for early stopping).
User-specified model hyperparameters to be fit:
{
'RF': {},
}
Fitting 1 L1 models ...
Hyperparameter tuning model: RandomForest_BAG_L1 ...
No hyperparameter search space specified for RandomForest_BAG_L1. Skipping HPO. Will train one model based on the provided hyperparameters.
Fitted model: RandomForest_BAG_L1 ...
0.8543 = Validation score (accuracy)
1.19s = Training runtime
0.0s = Validation runtime
Fitting model: WeightedEnsemble_L2 ...
0.8538 = Validation score (accuracy)
0.0s = Training runtime
0.01s = Validation runtime
AutoGluon training complete, total runtime = 1.54s ... Best model: "WeightedEnsemble_L2"
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("/tmp/tmpkc_l71f1/output/adult/2023-09-23-20-34-58-0_UserInputConfig/AutoGluon/Random Forest/")
Training log_loss loss: 0.08021
Validation log_loss loss: 0.31802
Testing log_loss loss: 0.31261
Trainer saved. To load the trainer, run trainer = load_trainer(path='/tmp/tmpkc_l71f1/output/adult/2023-09-23-20-34-58-0_UserInputConfig/trainer.pkl')
-------------AutoGluon End-------------
PytorchTabular metrics
Category Embedding 1/1
WideDeep metrics
TabMlp 1/1
AutoGluon metrics
Random Forest 1/1
Trainer saved. To load the trainer, run trainer = load_trainer(path='/tmp/tmpkc_l71f1/output/adult/2023-09-23-20-34-58-0_UserInputConfig/trainer.pkl')
[1]:
| Program | Model | Training F1_SCORE | Training PRECISION_SCORE | Training RECALL_SCORE | Training JACCARD_SCORE | Training ACCURACY_SCORE | Training BALANCED_ACCURACY_SCORE | Training COHEN_KAPPA_SCORE | Training HAMMING_LOSS | ... | Validation ACCURACY_SCORE | Validation BALANCED_ACCURACY_SCORE | Validation COHEN_KAPPA_SCORE | Validation HAMMING_LOSS | Validation MATTHEWS_CORRCOEF | Validation ZERO_ONE_LOSS | Validation ROC_AUC_SCORE | Validation LOG_LOSS | Validation BRIER_SCORE_LOSS | Validation AVERAGE_PRECISION_SCORE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | WideDeep | TabMlp | 0.694200 | 0.728505 | 0.662981 | 0.531628 | 0.859388 | 0.792321 | 0.603167 | 0.140612 | ... | 0.852426 | 0.784474 | 0.584884 | 0.147574 | 0.585738 | 0.147574 | 0.908951 | 0.317288 | 0.101612 | 0.868420 |
| 1 | AutoGluon | Random Forest | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 0.000000 | ... | 0.853808 | 0.776665 | 0.580404 | 0.146192 | 0.583003 | 0.146192 | 0.907010 | 0.318016 | 0.100486 | 0.875084 |
| 2 | PytorchTabular | Category Embedding | 0.709806 | 0.738341 | 0.683394 | 0.550154 | 0.865479 | 0.803303 | 0.622423 | 0.134521 | ... | 0.850430 | 0.784467 | 0.581612 | 0.149570 | 0.582150 | 0.149570 | 0.909318 | 0.316194 | 0.101722 | 0.868410 |
3 rows × 44 columns
Selecting and storing a model#
From the leaderboard, we can check the performance of each model and select one of the models for deployment. Say we want to choose Random Forest from AutoGluon, we detach the model from the heavy trainer. It is stored locally in a separate directory.
[2]:
trainer_of_one_model = trainer.detach_model(program="AutoGluon", model_name="Random Forest")
Trainer saved. To load the trainer, run trainer = load_trainer(path='/tmp/tmpkc_l71f1/output/adult/2023-09-23-20-34-58-0_UserInputConfig-I1/trainer.pkl')
The detached trainer now has only one model base.
[3]:
# Model bases of the detached trainer
trainer_of_one_model.modelbases
[3]:
[<tabensemb.model.autogluon.AutoGluon at 0x7f581fc3bd00>]
[4]:
# The model in the model base
trainer_of_one_model.get_modelbase("AutoGluon_Random Forest").model["Random Forest"]
[4]:
<autogluon.tabular.predictor.predictor.TabularPredictor at 0x7f5644bdbfa0>
Loading the model#
Now the Trainer containing a single model is stored in a separate directory. Assume that we want to load the local trainer in a separate script for inference. In the following line, the argument path of load_trainer is the path to trainer.pkl, which is already printed when detaching the model or training the model bases. Here we just use the directory of the detached trainer trainer_of_one_model.
Remark: You can move the directory to any other place (or other devices if the version of the package and the environment are all consistent) and rename the folder. tabensemb automatically configures the path.
[5]:
from tabensemb.trainer import load_trainer
trainer = load_trainer(path=os.path.join(trainer_of_one_model.project_root, "trainer.pkl"))
[6]:
trainer.get_modelbase("AutoGluon_Random Forest").model["Random Forest"]
[6]:
<autogluon.tabular.predictor.predictor.TabularPredictor at 0x7f5644bd8130>
Inference#
Assume that we have a new DataFrame representing an upcoming dataset. For demonstration, we use the testing set here. The classification target is ordinal encoded by trainer.datamodule.label_ordinal_encoder.
[7]:
df = trainer.df.loc[trainer.test_indices, :]
truth = trainer.df.loc[trainer.test_indices, trainer.label_name].values.flatten()
truth
[7]:
array([0, 0, 1, ..., 1, 0, 0])
Use the functionality of the model base to do inference. The returned result should be
[8]:
import pandas as pd
result = trainer.get_modelbase("AutoGluon_Random Forest").predict(df, model_name="Random Forest")
result
[8]:
array([['<=50K'],
['<=50K'],
['>50K'],
...,
['>50K'],
['<=50K'],
['<=50K']], dtype=object)
You can see the F1 score on the “new” (testing) dataset is the same as that in the above leaderboard. The result should be ordinal-encoded first to calculate metrics. We provide DataModule.label_categories_transform to achieve this (and DataModule.label_categories_inverse_transform to do the inverse transform).
The auto_metric_sklearn automatically calculates different kinds of sklearn.metrics, which is extremely useful for classification tasks
[9]:
from tabensemb.utils import auto_metric_sklearn
encoded_result = trainer.datamodule.label_categories_transform(pd.DataFrame(result, columns=trainer.label_name)).values
# You can also use trainer.datamodule.label_ordinal_encoder.transform(result) to get the same result.
auto_metric_sklearn(truth, encoded_result, "f1_score", "binary"), trainer.leaderboard.loc[trainer.leaderboard["Model"]=="Random Forest", "Testing F1_SCORE"].values
[9]:
(0.6936416184971098, array([0.69364162]))
Inference on the individual testing set#
When loading from UCI datasets, UserConfig.from_uci finds that an individual testing dataset exists, so the downloaded .zip file is not removed. We can load the archive using zipfile.
[10]:
import zipfile
zipf = zipfile.ZipFile(os.path.join(tabensemb.setting["default_data_path"], "Adult.zip"))
zipf.namelist()
[10]:
['Index', 'adult.data', 'adult.names', 'adult.test', 'old.adult.names']
Now check the content of adult.test. It is a .csv-like file, just like the adult.data file but has an additional row at the front, and an additional “.” at the end of each line.
[11]:
file = zipf.read("adult.test").decode()
print(file[:500])
|1x3 Cross validator
25, Private, 226802, 11th, 7, Never-married, Machine-op-inspct, Own-child, Black, Male, 0, 0, 40, United-States, <=50K.
38, Private, 89814, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 50, United-States, <=50K.
28, Local-gov, 336951, Assoc-acdm, 12, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 40, United-States, >50K.
44, Private, 160323, Some-college, 10, Married-civ-spouse, Machine-op-inspct, Husband, Black, Male, 7688, 0
We provide the str_to_dataframe function that translates the string object to a DataFrame and replaces illegal values with np.nan.
[12]:
from tabensemb.utils import str_to_dataframe
file = file.replace("|1x3 Cross validator\n","").replace(".\n", "\n")
testing_df = str_to_dataframe(file, sep=", ", names=trainer.df.columns, check_nan_on=trainer.cont_feature_names)
testing_df
/home/xlluo/hdd/tabular_ensemble/tabensemb/utils/utils.py:464: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
df = pd.read_csv(StringIO(s), names=names, sep=sep)
[12]:
| age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | income | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 25 | Private | 226802 | 11th | 7 | Never-married | Machine-op-inspct | Own-child | Black | Male | 0 | 0 | 40 | United-States | <=50K |
| 1 | 38 | Private | 89814 | HS-grad | 9 | Married-civ-spouse | Farming-fishing | Husband | White | Male | 0 | 0 | 50 | United-States | <=50K |
| 2 | 28 | Local-gov | 336951 | Assoc-acdm | 12 | Married-civ-spouse | Protective-serv | Husband | White | Male | 0 | 0 | 40 | United-States | >50K |
| 3 | 44 | Private | 160323 | Some-college | 10 | Married-civ-spouse | Machine-op-inspct | Husband | Black | Male | 7688 | 0 | 40 | United-States | >50K |
| 4 | 18 | ? | 103497 | Some-college | 10 | Never-married | ? | Own-child | White | Female | 0 | 0 | 30 | United-States | <=50K |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 16276 | 39 | Private | 215419 | Bachelors | 13 | Divorced | Prof-specialty | Not-in-family | White | Female | 0 | 0 | 36 | United-States | <=50K |
| 16277 | 64 | ? | 321403 | HS-grad | 9 | Widowed | ? | Other-relative | Black | Male | 0 | 0 | 40 | United-States | <=50K |
| 16278 | 38 | Private | 374983 | Bachelors | 13 | Married-civ-spouse | Prof-specialty | Husband | White | Male | 0 | 0 | 50 | United-States | <=50K |
| 16279 | 44 | Private | 83891 | Bachelors | 13 | Divorced | Adm-clerical | Own-child | Asian-Pac-Islander | Male | 5455 | 0 | 40 | United-States | <=50K |
| 16280 | 35 | Self-emp-inc | 182148 | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 60 | United-States | >50K |
16281 rows × 15 columns
The inference is similar.
[13]:
result = trainer.get_modelbase("AutoGluon_Random Forest").predict(testing_df, model_name="Random Forest")
result
[13]:
array([['<=50K'],
['<=50K'],
['<=50K'],
...,
['>50K'],
['<=50K'],
['>50K']], dtype=object)
Both the truth and the result require to be ordinal-encoded to calculate metrics.
[14]:
encoded_truth = trainer.datamodule.label_categories_transform(testing_df[trainer.label_name]).values
encoded_result = trainer.datamodule.label_categories_transform(pd.DataFrame(result, columns=trainer.label_name)).values
auto_metric_sklearn(encoded_truth, encoded_result, "f1_score", "binary"), auto_metric_sklearn(encoded_truth, encoded_result, "roc_auc_score", "binary")
[14]:
(0.6690190543401553, 0.7731706276694975)