{ "cells": [ { "cell_type": "markdown", "source": [ "# Inference on an upcoming dataset\n", "\n", "In this part, we will simulate the real deployment of the package and make inferences on an upcoming dataset. We use the Adult dataset from UCI datasets which has an individual testing set.\n", "\n", "## Training models\n", "\n", "Similar to the first example, we initialize a `Trainer` and model bases, then train all models." ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true, "pycharm": { "name": "#%%\n" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Using cuda device\n", "Downloading https://archive.ics.uci.edu/static/public/2/adult.zip to /tmp/tmpkc_l71f1/data/Adult.zip\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/xlluo/hdd/tabular_ensemble/tabensemb/config/user_config.py:292: UserWarning: There exists .test file(s) ['adult.test'] which should be used for final metrics. The .zip file is left for the user to process.\n", " warnings.warn(\n", "/home/xlluo/hdd/tabular_ensemble/tabensemb/utils/utils.py:464: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.\n", " df = pd.read_csv(StringIO(s), names=names, sep=sep)\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "age is Integer and will be treated as a continuous feature.\n", "fnlwgt is Integer and will be treated as a continuous feature.\n", "education-num is Integer and will be treated as a continuous feature.\n", "capital-gain is Integer and will be treated as a continuous feature.\n", "capital-loss is Integer and will be treated as a continuous feature.\n", "hours-per-week is Integer and will be treated as a continuous feature.\n", "The project will be saved to /tmp/tmpkc_l71f1/output/adult/2023-09-23-20-34-58-0_UserInputConfig\n", "Dataset size: 19536 6512 6513\n", "Data saved to /tmp/tmpkc_l71f1/output/adult/2023-09-23-20-34-58-0_UserInputConfig (data.csv and tabular_data.csv).\n", "\n", "-------------Run PytorchTabular-------------\n", "\n", "Training Category Embedding\n", "Global seed set to 42\n", "2023-09-23 20:35:00,222 - {pytorch_tabular.tabular_model:473} - INFO - Preparing the DataLoaders\n", "2023-09-23 20:35:00,223 - {pytorch_tabular.tabular_datamodule:290} - INFO - Setting up the datamodule for classification task\n", "2023-09-23 20:35:00,288 - {pytorch_tabular.tabular_model:521} - INFO - Preparing the Model: CategoryEmbeddingModel\n", "2023-09-23 20:35:00,307 - {pytorch_tabular.tabular_model:268} - INFO - Preparing the Trainer\n", "/home/xlluo/anaconda3/envs/tabular_ensemble/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py:589: LightningDeprecationWarning: The Trainer argument `auto_select_gpus` has been deprecated in v1.9.0 and will be removed in v2.0.0. Please use the function `pytorch_lightning.accelerators.find_usable_cuda_devices` instead.\n", " rank_zero_deprecation(\n", "Auto select gpus: [0]\n", "GPU available: True (cuda), used: True\n", "TPU available: False, using: 0 TPU cores\n", "IPU available: False, using: 0 IPUs\n", "HPU available: False, using: 0 HPUs\n", "2023-09-23 20:35:01,129 - {pytorch_tabular.tabular_model:582} - INFO - Training Started\n", "You are using a CUDA device ('NVIDIA GeForce RTX 3090') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision\n", "LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]\n", "\n", " | Name | Type | Params\n", "---------------------------------------------------------------\n", "0 | _backbone | CategoryEmbeddingBackbone | 18.5 K\n", "1 | _embedding_layer | Embedding1dLayer | 1.4 K \n", "2 | head | LinearHead | 66 \n", "3 | loss | CrossEntropyLoss | 0 \n", "---------------------------------------------------------------\n", "20.0 K Trainable params\n", "0 Non-trainable params\n", "20.0 K Total params\n", "0.080 Total estimated model params size (MB)\n", "Epoch: 1/300, Train loss: 0.4666, Val loss: 0.3794, Min val loss: 0.3794, Epoch time: 0.376s.\n", "Epoch: 20/300, Train loss: 0.3112, Val loss: 0.3184, Min val loss: 0.3177, Epoch time: 0.406s.\n", "Epoch: 40/300, Train loss: 0.2972, Val loss: 0.3213, Min val loss: 0.3162, Epoch time: 0.326s.\n", "Epoch: 60/300, Train loss: 0.2864, Val loss: 0.3248, Min val loss: 0.3162, Epoch time: 0.330s.\n", "Epoch: 80/300, Train loss: 0.2766, Val loss: 0.3323, Min val loss: 0.3162, Epoch time: 0.386s.\n", "Epoch: 100/300, Train loss: 0.2677, Val loss: 0.3416, Min val loss: 0.3162, Epoch time: 0.540s.\n", "Epoch: 120/300, Train loss: 0.2602, Val loss: 0.3484, Min val loss: 0.3162, Epoch time: 0.357s.\n", "2023-09-23 20:35:50,914 - {pytorch_tabular.tabular_model:584} - INFO - Training the model completed\n", "2023-09-23 20:35:50,915 - {pytorch_tabular.tabular_model:1258} - INFO - Loading the best model\n", "/home/xlluo/anaconda3/envs/tabular_ensemble/lib/python3.10/site-packages/pytorch_lightning/utilities/cloud_io.py:33: LightningDeprecationWarning: `pytorch_lightning.utilities.cloud_io.get_filesystem` has been deprecated in v1.8.0 and will be removed in v2.0.0. Please use `lightning_fabric.utilities.cloud_io.get_filesystem` instead.\n", " rank_zero_deprecation(\n", "Training log_loss loss: 0.28862\n", "Validation log_loss loss: 0.31619\n", "Testing log_loss loss: 0.31114\n", "Trainer saved. To load the trainer, run trainer = load_trainer(path='/tmp/tmpkc_l71f1/output/adult/2023-09-23-20-34-58-0_UserInputConfig/trainer.pkl')\n", "\n", "-------------PytorchTabular End-------------\n", "\n", "\n", "-------------Run WideDeep-------------\n", "\n", "Training TabMlp\n", "Epoch: 1/300, Train loss: 0.5206, Val loss: 0.4274, Min val loss: 0.4274\n", "Epoch: 21/300, Train loss: 0.3145, Val loss: 0.3161, Min val loss: 0.3161\n", "Epoch: 41/300, Train loss: 0.3063, Val loss: 0.3192, Min val loss: 0.3161\n", "Epoch: 61/300, Train loss: 0.2984, Val loss: 0.3209, Min val loss: 0.3161\n", "Epoch: 81/300, Train loss: 0.2905, Val loss: 0.3227, Min val loss: 0.3161\n", "Epoch: 101/300, Train loss: 0.2819, Val loss: 0.3266, Min val loss: 0.3161\n", "Epoch: 121/300, Train loss: 0.2754, Val loss: 0.3293, Min val loss: 0.3161\n", "Epoch 00121: early stopping\n", "Restoring model weights from the end of the best epoch\n", "Training log_loss loss: 0.30084\n", "Validation log_loss loss: 0.31729\n", "Testing log_loss loss: 0.31133\n", "Trainer saved. To load the trainer, run trainer = load_trainer(path='/tmp/tmpkc_l71f1/output/adult/2023-09-23-20-34-58-0_UserInputConfig/trainer.pkl')\n", "\n", "-------------WideDeep End-------------\n", "\n", "\n", "-------------Run AutoGluon-------------\n", "\n", "Training Random Forest\n", "Presets specified: ['best_quality']\n", "Warning: hyperparameter tuning is currently experimental and may cause the process to hang.\n", "Stack configuration (auto_stack=True): num_stack_levels=0, num_bag_folds=8, num_bag_sets=1\n", "Beginning AutoGluon training ...\n", "AutoGluon will save models to \"/tmp/tmpkc_l71f1/output/adult/2023-09-23-20-34-58-0_UserInputConfig/AutoGluon/Random Forest/\"\n", "AutoGluon Version: 0.8.2\n", "Python Version: 3.10.12\n", "Operating System: Linux\n", "Platform Machine: x86_64\n", "Platform Version: #1 SMP Mon Dec 13 20:27:58 CST 2021\n", "Disk Space Avail: 149.26 GB / 502.47 GB (29.7%)\n", "Train Data Rows: 19536\n", "Train Data Columns: 14\n", "Tuning Data Rows: 6512\n", "Tuning Data Columns: 14\n", "Label Column: income\n", "Preprocessing data ...\n", "Selected class <--> label mapping: class 1 = 1, class 0 = 0\n", "Using Feature Generators to preprocess the data ...\n", "Fitting PipelineFeatureGenerator...\n", "\tAvailable Memory: 12442.89 MB\n", "\tTrain Data (Original) Memory Usage: 15.07 MB (0.1% of available memory)\n", "\tInferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.\n", "\tStage 1 Generators:\n", "\t\tFitting AsTypeFeatureGenerator...\n", "\t\t\tNote: Converting 1 features to boolean dtype as they only contain 2 unique values.\n", "\tStage 2 Generators:\n", "\t\tFitting FillNaFeatureGenerator...\n", "\tStage 3 Generators:\n", "\t\tFitting IdentityFeatureGenerator...\n", "\t\t\tWarning: feature_metadata_in passed as input to fit_transform, but self.feature_metadata_in was already set. Ignoring feature_metadata_in.\n", "\t\tFitting CategoryFeatureGenerator...\n", "\t\t\tWarning: feature_metadata_in passed as input to fit_transform, but self.feature_metadata_in was already set. Ignoring feature_metadata_in.\n", "\t\t\tFitting CategoryMemoryMinimizeFeatureGenerator...\n", "\tStage 4 Generators:\n", "\t\tFitting DropUniqueFeatureGenerator...\n", "\tStage 5 Generators:\n", "\t\tFitting DropDuplicatesFeatureGenerator...\n", "\tTypes of features in original data (raw dtype, special dtypes):\n", "\t\t('int', []) : 6 | ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', ...]\n", "\t\t('object', []) : 8 | ['education', 'marital-status', 'native-country', 'occupation', 'race', ...]\n", "\tTypes of features in processed data (raw dtype, special dtypes):\n", "\t\t('category', []) : 8 | ['education', 'marital-status', 'native-country', 'occupation', 'race', ...]\n", "\t\t('int', []) : 6 | ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', ...]\n", "\t0.1s = Fit runtime\n", "\t14 features in original data used to generate 14 features in processed data.\n", "\tTrain Data (Processed) Memory Usage: 1.46 MB (0.0% of available memory)\n", "Data preprocessing and feature engineering runtime = 0.15s ...\n", "AutoGluon will gauge predictive performance using evaluation metric: 'accuracy'\n", "\tTo change this, specify the eval_metric parameter of Predictor()\n", "use_bag_holdout=True, will use tuning_data as holdout (will not be used for early stopping).\n", "User-specified model hyperparameters to be fit:\n", "{\n", "\t'RF': {},\n", "}\n", "Fitting 1 L1 models ...\n", "Hyperparameter tuning model: RandomForest_BAG_L1 ...\n", "\tNo hyperparameter search space specified for RandomForest_BAG_L1. Skipping HPO. Will train one model based on the provided hyperparameters.\n", "Fitted model: RandomForest_BAG_L1 ...\n", "\t0.8543\t = Validation score (accuracy)\n", "\t1.19s\t = Training runtime\n", "\t0.0s\t = Validation runtime\n", "Fitting model: WeightedEnsemble_L2 ...\n", "\t0.8538\t = Validation score (accuracy)\n", "\t0.0s\t = Training runtime\n", "\t0.01s\t = Validation runtime\n", "AutoGluon training complete, total runtime = 1.54s ... Best model: \"WeightedEnsemble_L2\"\n", "TabularPredictor saved. To load, use: predictor = TabularPredictor.load(\"/tmp/tmpkc_l71f1/output/adult/2023-09-23-20-34-58-0_UserInputConfig/AutoGluon/Random Forest/\")\n", "Training log_loss loss: 0.08021\n", "Validation log_loss loss: 0.31802\n", "Testing log_loss loss: 0.31261\n", "Trainer saved. To load the trainer, run trainer = load_trainer(path='/tmp/tmpkc_l71f1/output/adult/2023-09-23-20-34-58-0_UserInputConfig/trainer.pkl')\n", "\n", "-------------AutoGluon End-------------\n", "\n", "PytorchTabular metrics\n", "Category Embedding 1/1\n", "WideDeep metrics\n", "TabMlp 1/1\n", "AutoGluon metrics\n", "Random Forest 1/1\n", "Trainer saved. To load the trainer, run trainer = load_trainer(path='/tmp/tmpkc_l71f1/output/adult/2023-09-23-20-34-58-0_UserInputConfig/trainer.pkl')\n" ] }, { "data": { "text/plain": " Program Model Training F1_SCORE \\\n0 WideDeep TabMlp 0.694200 \n1 AutoGluon Random Forest 1.000000 \n2 PytorchTabular Category Embedding 0.709806 \n\n Training PRECISION_SCORE Training RECALL_SCORE Training JACCARD_SCORE \\\n0 0.728505 0.662981 0.531628 \n1 1.000000 1.000000 1.000000 \n2 0.738341 0.683394 0.550154 \n\n Training ACCURACY_SCORE Training BALANCED_ACCURACY_SCORE \\\n0 0.859388 0.792321 \n1 1.000000 1.000000 \n2 0.865479 0.803303 \n\n Training COHEN_KAPPA_SCORE Training HAMMING_LOSS ... \\\n0 0.603167 0.140612 ... \n1 1.000000 0.000000 ... \n2 0.622423 0.134521 ... \n\n Validation ACCURACY_SCORE Validation BALANCED_ACCURACY_SCORE \\\n0 0.852426 0.784474 \n1 0.853808 0.776665 \n2 0.850430 0.784467 \n\n Validation COHEN_KAPPA_SCORE Validation HAMMING_LOSS \\\n0 0.584884 0.147574 \n1 0.580404 0.146192 \n2 0.581612 0.149570 \n\n Validation MATTHEWS_CORRCOEF Validation ZERO_ONE_LOSS \\\n0 0.585738 0.147574 \n1 0.583003 0.146192 \n2 0.582150 0.149570 \n\n Validation ROC_AUC_SCORE Validation LOG_LOSS Validation BRIER_SCORE_LOSS \\\n0 0.908951 0.317288 0.101612 \n1 0.907010 0.318016 0.100486 \n2 0.909318 0.316194 0.101722 \n\n Validation AVERAGE_PRECISION_SCORE \n0 0.868420 \n1 0.875084 \n2 0.868410 \n\n[3 rows x 44 columns]", "text/html": "
| \n | Program | \nModel | \nTraining F1_SCORE | \nTraining PRECISION_SCORE | \nTraining RECALL_SCORE | \nTraining JACCARD_SCORE | \nTraining ACCURACY_SCORE | \nTraining BALANCED_ACCURACY_SCORE | \nTraining COHEN_KAPPA_SCORE | \nTraining HAMMING_LOSS | \n... | \nValidation ACCURACY_SCORE | \nValidation BALANCED_ACCURACY_SCORE | \nValidation COHEN_KAPPA_SCORE | \nValidation HAMMING_LOSS | \nValidation MATTHEWS_CORRCOEF | \nValidation ZERO_ONE_LOSS | \nValidation ROC_AUC_SCORE | \nValidation LOG_LOSS | \nValidation BRIER_SCORE_LOSS | \nValidation AVERAGE_PRECISION_SCORE | \n
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \nWideDeep | \nTabMlp | \n0.694200 | \n0.728505 | \n0.662981 | \n0.531628 | \n0.859388 | \n0.792321 | \n0.603167 | \n0.140612 | \n... | \n0.852426 | \n0.784474 | \n0.584884 | \n0.147574 | \n0.585738 | \n0.147574 | \n0.908951 | \n0.317288 | \n0.101612 | \n0.868420 | \n
| 1 | \nAutoGluon | \nRandom Forest | \n1.000000 | \n1.000000 | \n1.000000 | \n1.000000 | \n1.000000 | \n1.000000 | \n1.000000 | \n0.000000 | \n... | \n0.853808 | \n0.776665 | \n0.580404 | \n0.146192 | \n0.583003 | \n0.146192 | \n0.907010 | \n0.318016 | \n0.100486 | \n0.875084 | \n
| 2 | \nPytorchTabular | \nCategory Embedding | \n0.709806 | \n0.738341 | \n0.683394 | \n0.550154 | \n0.865479 | \n0.803303 | \n0.622423 | \n0.134521 | \n... | \n0.850430 | \n0.784467 | \n0.581612 | \n0.149570 | \n0.582150 | \n0.149570 | \n0.909318 | \n0.316194 | \n0.101722 | \n0.868410 | \n
3 rows × 44 columns
\n| \n | age | \nworkclass | \nfnlwgt | \neducation | \neducation-num | \nmarital-status | \noccupation | \nrelationship | \nrace | \nsex | \ncapital-gain | \ncapital-loss | \nhours-per-week | \nnative-country | \nincome | \n
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n25 | \nPrivate | \n226802 | \n11th | \n7 | \nNever-married | \nMachine-op-inspct | \nOwn-child | \nBlack | \nMale | \n0 | \n0 | \n40 | \nUnited-States | \n<=50K | \n
| 1 | \n38 | \nPrivate | \n89814 | \nHS-grad | \n9 | \nMarried-civ-spouse | \nFarming-fishing | \nHusband | \nWhite | \nMale | \n0 | \n0 | \n50 | \nUnited-States | \n<=50K | \n
| 2 | \n28 | \nLocal-gov | \n336951 | \nAssoc-acdm | \n12 | \nMarried-civ-spouse | \nProtective-serv | \nHusband | \nWhite | \nMale | \n0 | \n0 | \n40 | \nUnited-States | \n>50K | \n
| 3 | \n44 | \nPrivate | \n160323 | \nSome-college | \n10 | \nMarried-civ-spouse | \nMachine-op-inspct | \nHusband | \nBlack | \nMale | \n7688 | \n0 | \n40 | \nUnited-States | \n>50K | \n
| 4 | \n18 | \n? | \n103497 | \nSome-college | \n10 | \nNever-married | \n? | \nOwn-child | \nWhite | \nFemale | \n0 | \n0 | \n30 | \nUnited-States | \n<=50K | \n
| ... | \n... | \n... | \n... | \n... | \n... | \n... | \n... | \n... | \n... | \n... | \n... | \n... | \n... | \n... | \n... | \n
| 16276 | \n39 | \nPrivate | \n215419 | \nBachelors | \n13 | \nDivorced | \nProf-specialty | \nNot-in-family | \nWhite | \nFemale | \n0 | \n0 | \n36 | \nUnited-States | \n<=50K | \n
| 16277 | \n64 | \n? | \n321403 | \nHS-grad | \n9 | \nWidowed | \n? | \nOther-relative | \nBlack | \nMale | \n0 | \n0 | \n40 | \nUnited-States | \n<=50K | \n
| 16278 | \n38 | \nPrivate | \n374983 | \nBachelors | \n13 | \nMarried-civ-spouse | \nProf-specialty | \nHusband | \nWhite | \nMale | \n0 | \n0 | \n50 | \nUnited-States | \n<=50K | \n
| 16279 | \n44 | \nPrivate | \n83891 | \nBachelors | \n13 | \nDivorced | \nAdm-clerical | \nOwn-child | \nAsian-Pac-Islander | \nMale | \n5455 | \n0 | \n40 | \nUnited-States | \n<=50K | \n
| 16280 | \n35 | \nSelf-emp-inc | \n182148 | \nBachelors | \n13 | \nMarried-civ-spouse | \nExec-managerial | \nHusband | \nWhite | \nMale | \n0 | \n0 | \n60 | \nUnited-States | \n>50K | \n
16281 rows × 15 columns
\n