{ "cells": [ { "cell_type": "markdown", "source": [ "# Inference on an upcoming dataset\n", "\n", "In this part, we will simulate the real deployment of the package and make inferences on an upcoming dataset. We use the Adult dataset from UCI datasets which has an individual testing set.\n", "\n", "## Training models\n", "\n", "Similar to the first example, we initialize a `Trainer` and model bases, then train all models." ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true, "pycharm": { "name": "#%%\n" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Using cuda device\n", "Downloading https://archive.ics.uci.edu/static/public/2/adult.zip to /tmp/tmpkc_l71f1/data/Adult.zip\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/xlluo/hdd/tabular_ensemble/tabensemb/config/user_config.py:292: UserWarning: There exists .test file(s) ['adult.test'] which should be used for final metrics. The .zip file is left for the user to process.\n", " warnings.warn(\n", "/home/xlluo/hdd/tabular_ensemble/tabensemb/utils/utils.py:464: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.\n", " df = pd.read_csv(StringIO(s), names=names, sep=sep)\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "age is Integer and will be treated as a continuous feature.\n", "fnlwgt is Integer and will be treated as a continuous feature.\n", "education-num is Integer and will be treated as a continuous feature.\n", "capital-gain is Integer and will be treated as a continuous feature.\n", "capital-loss is Integer and will be treated as a continuous feature.\n", "hours-per-week is Integer and will be treated as a continuous feature.\n", "The project will be saved to /tmp/tmpkc_l71f1/output/adult/2023-09-23-20-34-58-0_UserInputConfig\n", "Dataset size: 19536 6512 6513\n", "Data saved to /tmp/tmpkc_l71f1/output/adult/2023-09-23-20-34-58-0_UserInputConfig (data.csv and tabular_data.csv).\n", "\n", "-------------Run PytorchTabular-------------\n", "\n", "Training Category Embedding\n", "Global seed set to 42\n", "2023-09-23 20:35:00,222 - {pytorch_tabular.tabular_model:473} - INFO - Preparing the DataLoaders\n", "2023-09-23 20:35:00,223 - {pytorch_tabular.tabular_datamodule:290} - INFO - Setting up the datamodule for classification task\n", "2023-09-23 20:35:00,288 - {pytorch_tabular.tabular_model:521} - INFO - Preparing the Model: CategoryEmbeddingModel\n", "2023-09-23 20:35:00,307 - {pytorch_tabular.tabular_model:268} - INFO - Preparing the Trainer\n", "/home/xlluo/anaconda3/envs/tabular_ensemble/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py:589: LightningDeprecationWarning: The Trainer argument `auto_select_gpus` has been deprecated in v1.9.0 and will be removed in v2.0.0. Please use the function `pytorch_lightning.accelerators.find_usable_cuda_devices` instead.\n", " rank_zero_deprecation(\n", "Auto select gpus: [0]\n", "GPU available: True (cuda), used: True\n", "TPU available: False, using: 0 TPU cores\n", "IPU available: False, using: 0 IPUs\n", "HPU available: False, using: 0 HPUs\n", "2023-09-23 20:35:01,129 - {pytorch_tabular.tabular_model:582} - INFO - Training Started\n", "You are using a CUDA device ('NVIDIA GeForce RTX 3090') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision\n", "LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]\n", "\n", " | Name | Type | Params\n", "---------------------------------------------------------------\n", "0 | _backbone | CategoryEmbeddingBackbone | 18.5 K\n", "1 | _embedding_layer | Embedding1dLayer | 1.4 K \n", "2 | head | LinearHead | 66 \n", "3 | loss | CrossEntropyLoss | 0 \n", "---------------------------------------------------------------\n", "20.0 K Trainable params\n", "0 Non-trainable params\n", "20.0 K Total params\n", "0.080 Total estimated model params size (MB)\n", "Epoch: 1/300, Train loss: 0.4666, Val loss: 0.3794, Min val loss: 0.3794, Epoch time: 0.376s.\n", "Epoch: 20/300, Train loss: 0.3112, Val loss: 0.3184, Min val loss: 0.3177, Epoch time: 0.406s.\n", "Epoch: 40/300, Train loss: 0.2972, Val loss: 0.3213, Min val loss: 0.3162, Epoch time: 0.326s.\n", "Epoch: 60/300, Train loss: 0.2864, Val loss: 0.3248, Min val loss: 0.3162, Epoch time: 0.330s.\n", "Epoch: 80/300, Train loss: 0.2766, Val loss: 0.3323, Min val loss: 0.3162, Epoch time: 0.386s.\n", "Epoch: 100/300, Train loss: 0.2677, Val loss: 0.3416, Min val loss: 0.3162, Epoch time: 0.540s.\n", "Epoch: 120/300, Train loss: 0.2602, Val loss: 0.3484, Min val loss: 0.3162, Epoch time: 0.357s.\n", "2023-09-23 20:35:50,914 - {pytorch_tabular.tabular_model:584} - INFO - Training the model completed\n", "2023-09-23 20:35:50,915 - {pytorch_tabular.tabular_model:1258} - INFO - Loading the best model\n", "/home/xlluo/anaconda3/envs/tabular_ensemble/lib/python3.10/site-packages/pytorch_lightning/utilities/cloud_io.py:33: LightningDeprecationWarning: `pytorch_lightning.utilities.cloud_io.get_filesystem` has been deprecated in v1.8.0 and will be removed in v2.0.0. Please use `lightning_fabric.utilities.cloud_io.get_filesystem` instead.\n", " rank_zero_deprecation(\n", "Training log_loss loss: 0.28862\n", "Validation log_loss loss: 0.31619\n", "Testing log_loss loss: 0.31114\n", "Trainer saved. To load the trainer, run trainer = load_trainer(path='/tmp/tmpkc_l71f1/output/adult/2023-09-23-20-34-58-0_UserInputConfig/trainer.pkl')\n", "\n", "-------------PytorchTabular End-------------\n", "\n", "\n", "-------------Run WideDeep-------------\n", "\n", "Training TabMlp\n", "Epoch: 1/300, Train loss: 0.5206, Val loss: 0.4274, Min val loss: 0.4274\n", "Epoch: 21/300, Train loss: 0.3145, Val loss: 0.3161, Min val loss: 0.3161\n", "Epoch: 41/300, Train loss: 0.3063, Val loss: 0.3192, Min val loss: 0.3161\n", "Epoch: 61/300, Train loss: 0.2984, Val loss: 0.3209, Min val loss: 0.3161\n", "Epoch: 81/300, Train loss: 0.2905, Val loss: 0.3227, Min val loss: 0.3161\n", "Epoch: 101/300, Train loss: 0.2819, Val loss: 0.3266, Min val loss: 0.3161\n", "Epoch: 121/300, Train loss: 0.2754, Val loss: 0.3293, Min val loss: 0.3161\n", "Epoch 00121: early stopping\n", "Restoring model weights from the end of the best epoch\n", "Training log_loss loss: 0.30084\n", "Validation log_loss loss: 0.31729\n", "Testing log_loss loss: 0.31133\n", "Trainer saved. To load the trainer, run trainer = load_trainer(path='/tmp/tmpkc_l71f1/output/adult/2023-09-23-20-34-58-0_UserInputConfig/trainer.pkl')\n", "\n", "-------------WideDeep End-------------\n", "\n", "\n", "-------------Run AutoGluon-------------\n", "\n", "Training Random Forest\n", "Presets specified: ['best_quality']\n", "Warning: hyperparameter tuning is currently experimental and may cause the process to hang.\n", "Stack configuration (auto_stack=True): num_stack_levels=0, num_bag_folds=8, num_bag_sets=1\n", "Beginning AutoGluon training ...\n", "AutoGluon will save models to \"/tmp/tmpkc_l71f1/output/adult/2023-09-23-20-34-58-0_UserInputConfig/AutoGluon/Random Forest/\"\n", "AutoGluon Version: 0.8.2\n", "Python Version: 3.10.12\n", "Operating System: Linux\n", "Platform Machine: x86_64\n", "Platform Version: #1 SMP Mon Dec 13 20:27:58 CST 2021\n", "Disk Space Avail: 149.26 GB / 502.47 GB (29.7%)\n", "Train Data Rows: 19536\n", "Train Data Columns: 14\n", "Tuning Data Rows: 6512\n", "Tuning Data Columns: 14\n", "Label Column: income\n", "Preprocessing data ...\n", "Selected class <--> label mapping: class 1 = 1, class 0 = 0\n", "Using Feature Generators to preprocess the data ...\n", "Fitting PipelineFeatureGenerator...\n", "\tAvailable Memory: 12442.89 MB\n", "\tTrain Data (Original) Memory Usage: 15.07 MB (0.1% of available memory)\n", "\tInferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.\n", "\tStage 1 Generators:\n", "\t\tFitting AsTypeFeatureGenerator...\n", "\t\t\tNote: Converting 1 features to boolean dtype as they only contain 2 unique values.\n", "\tStage 2 Generators:\n", "\t\tFitting FillNaFeatureGenerator...\n", "\tStage 3 Generators:\n", "\t\tFitting IdentityFeatureGenerator...\n", "\t\t\tWarning: feature_metadata_in passed as input to fit_transform, but self.feature_metadata_in was already set. Ignoring feature_metadata_in.\n", "\t\tFitting CategoryFeatureGenerator...\n", "\t\t\tWarning: feature_metadata_in passed as input to fit_transform, but self.feature_metadata_in was already set. Ignoring feature_metadata_in.\n", "\t\t\tFitting CategoryMemoryMinimizeFeatureGenerator...\n", "\tStage 4 Generators:\n", "\t\tFitting DropUniqueFeatureGenerator...\n", "\tStage 5 Generators:\n", "\t\tFitting DropDuplicatesFeatureGenerator...\n", "\tTypes of features in original data (raw dtype, special dtypes):\n", "\t\t('int', []) : 6 | ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', ...]\n", "\t\t('object', []) : 8 | ['education', 'marital-status', 'native-country', 'occupation', 'race', ...]\n", "\tTypes of features in processed data (raw dtype, special dtypes):\n", "\t\t('category', []) : 8 | ['education', 'marital-status', 'native-country', 'occupation', 'race', ...]\n", "\t\t('int', []) : 6 | ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', ...]\n", "\t0.1s = Fit runtime\n", "\t14 features in original data used to generate 14 features in processed data.\n", "\tTrain Data (Processed) Memory Usage: 1.46 MB (0.0% of available memory)\n", "Data preprocessing and feature engineering runtime = 0.15s ...\n", "AutoGluon will gauge predictive performance using evaluation metric: 'accuracy'\n", "\tTo change this, specify the eval_metric parameter of Predictor()\n", "use_bag_holdout=True, will use tuning_data as holdout (will not be used for early stopping).\n", "User-specified model hyperparameters to be fit:\n", "{\n", "\t'RF': {},\n", "}\n", "Fitting 1 L1 models ...\n", "Hyperparameter tuning model: RandomForest_BAG_L1 ...\n", "\tNo hyperparameter search space specified for RandomForest_BAG_L1. Skipping HPO. Will train one model based on the provided hyperparameters.\n", "Fitted model: RandomForest_BAG_L1 ...\n", "\t0.8543\t = Validation score (accuracy)\n", "\t1.19s\t = Training runtime\n", "\t0.0s\t = Validation runtime\n", "Fitting model: WeightedEnsemble_L2 ...\n", "\t0.8538\t = Validation score (accuracy)\n", "\t0.0s\t = Training runtime\n", "\t0.01s\t = Validation runtime\n", "AutoGluon training complete, total runtime = 1.54s ... Best model: \"WeightedEnsemble_L2\"\n", "TabularPredictor saved. To load, use: predictor = TabularPredictor.load(\"/tmp/tmpkc_l71f1/output/adult/2023-09-23-20-34-58-0_UserInputConfig/AutoGluon/Random Forest/\")\n", "Training log_loss loss: 0.08021\n", "Validation log_loss loss: 0.31802\n", "Testing log_loss loss: 0.31261\n", "Trainer saved. To load the trainer, run trainer = load_trainer(path='/tmp/tmpkc_l71f1/output/adult/2023-09-23-20-34-58-0_UserInputConfig/trainer.pkl')\n", "\n", "-------------AutoGluon End-------------\n", "\n", "PytorchTabular metrics\n", "Category Embedding 1/1\n", "WideDeep metrics\n", "TabMlp 1/1\n", "AutoGluon metrics\n", "Random Forest 1/1\n", "Trainer saved. To load the trainer, run trainer = load_trainer(path='/tmp/tmpkc_l71f1/output/adult/2023-09-23-20-34-58-0_UserInputConfig/trainer.pkl')\n" ] }, { "data": { "text/plain": " Program Model Training F1_SCORE \\\n0 WideDeep TabMlp 0.694200 \n1 AutoGluon Random Forest 1.000000 \n2 PytorchTabular Category Embedding 0.709806 \n\n Training PRECISION_SCORE Training RECALL_SCORE Training JACCARD_SCORE \\\n0 0.728505 0.662981 0.531628 \n1 1.000000 1.000000 1.000000 \n2 0.738341 0.683394 0.550154 \n\n Training ACCURACY_SCORE Training BALANCED_ACCURACY_SCORE \\\n0 0.859388 0.792321 \n1 1.000000 1.000000 \n2 0.865479 0.803303 \n\n Training COHEN_KAPPA_SCORE Training HAMMING_LOSS ... \\\n0 0.603167 0.140612 ... \n1 1.000000 0.000000 ... \n2 0.622423 0.134521 ... \n\n Validation ACCURACY_SCORE Validation BALANCED_ACCURACY_SCORE \\\n0 0.852426 0.784474 \n1 0.853808 0.776665 \n2 0.850430 0.784467 \n\n Validation COHEN_KAPPA_SCORE Validation HAMMING_LOSS \\\n0 0.584884 0.147574 \n1 0.580404 0.146192 \n2 0.581612 0.149570 \n\n Validation MATTHEWS_CORRCOEF Validation ZERO_ONE_LOSS \\\n0 0.585738 0.147574 \n1 0.583003 0.146192 \n2 0.582150 0.149570 \n\n Validation ROC_AUC_SCORE Validation LOG_LOSS Validation BRIER_SCORE_LOSS \\\n0 0.908951 0.317288 0.101612 \n1 0.907010 0.318016 0.100486 \n2 0.909318 0.316194 0.101722 \n\n Validation AVERAGE_PRECISION_SCORE \n0 0.868420 \n1 0.875084 \n2 0.868410 \n\n[3 rows x 44 columns]", "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
ProgramModelTraining F1_SCORETraining PRECISION_SCORETraining RECALL_SCORETraining JACCARD_SCORETraining ACCURACY_SCORETraining BALANCED_ACCURACY_SCORETraining COHEN_KAPPA_SCORETraining HAMMING_LOSS...Validation ACCURACY_SCOREValidation BALANCED_ACCURACY_SCOREValidation COHEN_KAPPA_SCOREValidation HAMMING_LOSSValidation MATTHEWS_CORRCOEFValidation ZERO_ONE_LOSSValidation ROC_AUC_SCOREValidation LOG_LOSSValidation BRIER_SCORE_LOSSValidation AVERAGE_PRECISION_SCORE
0WideDeepTabMlp0.6942000.7285050.6629810.5316280.8593880.7923210.6031670.140612...0.8524260.7844740.5848840.1475740.5857380.1475740.9089510.3172880.1016120.868420
1AutoGluonRandom Forest1.0000001.0000001.0000001.0000001.0000001.0000001.0000000.000000...0.8538080.7766650.5804040.1461920.5830030.1461920.9070100.3180160.1004860.875084
2PytorchTabularCategory Embedding0.7098060.7383410.6833940.5501540.8654790.8033030.6224230.134521...0.8504300.7844670.5816120.1495700.5821500.1495700.9093180.3161940.1017220.868410
\n

3 rows × 44 columns

\n
" }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import torch\n", "from tabensemb.trainer import Trainer\n", "from tabensemb.model import *\n", "import tabensemb\n", "from tabensemb.config import UserConfig\n", "import os\n", "from tempfile import TemporaryDirectory\n", "\n", "temp_path = TemporaryDirectory()\n", "tabensemb.setting[\"default_output_path\"] = os.path.join(temp_path.name, \"output\")\n", "tabensemb.setting[\"default_config_path\"] = os.path.join(temp_path.name, \"configs\")\n", "tabensemb.setting[\"default_data_path\"] = os.path.join(temp_path.name, \"data\")\n", "\n", "device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n", "print(\"Using {} device\".format(device))\n", "\n", "trainer = Trainer(device=device)\n", "adult_columns = [\n", " \"age\",\n", " \"workclass\",\n", " \"fnlwgt\",\n", " \"education\",\n", " \"education-num\",\n", " \"marital-status\",\n", " \"occupation\",\n", " \"relationship\",\n", " \"race\",\n", " \"sex\",\n", " \"capital-gain\",\n", " \"capital-loss\",\n", " \"hours-per-week\",\n", " \"native-country\",\n", " \"income\",\n", "]\n", "cfg = UserConfig.from_uci(\"Adult\", column_names=adult_columns, sep=\", \")\n", "trainer.load_config(cfg)\n", "trainer.load_data()\n", "models = [\n", " PytorchTabular(trainer, model_subset=[\"Category Embedding\"]),\n", " WideDeep(trainer, model_subset=[\"TabMlp\"]),\n", " AutoGluon(trainer, model_subset=[\"Random Forest\"]),\n", "]\n", "trainer.add_modelbases(models)\n", "trainer.train(stderr_to_stdout=True)\n", "trainer.get_leaderboard()" ] }, { "cell_type": "markdown", "source": [ "## Selecting and storing a model\n", "\n", "From the leaderboard, we can check the performance of each model and select one of the models for deployment. Say we want to choose `Random Forest` from `AutoGluon`, we detach the model from the heavy `trainer`. It is stored locally in a separate directory." ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": 2, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Trainer saved. To load the trainer, run trainer = load_trainer(path='/tmp/tmpkc_l71f1/output/adult/2023-09-23-20-34-58-0_UserInputConfig-I1/trainer.pkl')\n" ] } ], "source": [ "trainer_of_one_model = trainer.detach_model(program=\"AutoGluon\", model_name=\"Random Forest\")" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "The detached trainer now has only one model base." ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": 3, "outputs": [ { "data": { "text/plain": "[]" }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Model bases of the detached trainer\n", "trainer_of_one_model.modelbases" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "code", "execution_count": 4, "outputs": [ { "data": { "text/plain": "" }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# The model in the model base\n", "trainer_of_one_model.get_modelbase(\"AutoGluon_Random Forest\").model[\"Random Forest\"]" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "## Loading the model\n", "\n", "Now the `Trainer` containing a single model is stored in a separate directory. Assume that we want to load the local trainer in a separate script for inference. In the following line, the argument `path` of `load_trainer` is the path to `trainer.pkl`, which is already printed when detaching the model or training the model bases. Here we just use the directory of the detached trainer `trainer_of_one_model`.\n", "\n", "**Remark**: You can move the directory to any other place (or other devices if the version of the package and the environment are all consistent) and rename the folder. `tabensemb` automatically configures the path." ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": 5, "outputs": [], "source": [ "from tabensemb.trainer import load_trainer\n", "\n", "trainer = load_trainer(path=os.path.join(trainer_of_one_model.project_root, \"trainer.pkl\"))" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "code", "execution_count": 6, "outputs": [ { "data": { "text/plain": "" }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "trainer.get_modelbase(\"AutoGluon_Random Forest\").model[\"Random Forest\"]" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "## Inference\n", "\n", "Assume that we have a new `DataFrame` representing an upcoming dataset. For demonstration, we use the testing set here. The classification target is ordinal encoded by `trainer.datamodule.label_ordinal_encoder`." ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": 7, "outputs": [ { "data": { "text/plain": "array([0, 0, 1, ..., 1, 0, 0])" }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = trainer.df.loc[trainer.test_indices, :]\n", "truth = trainer.df.loc[trainer.test_indices, trainer.label_name].values.flatten()\n", "truth" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "Use the functionality of the model base to do inference. The returned result should be" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": 8, "outputs": [ { "data": { "text/plain": "array([['<=50K'],\n ['<=50K'],\n ['>50K'],\n ...,\n ['>50K'],\n ['<=50K'],\n ['<=50K']], dtype=object)" }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "result = trainer.get_modelbase(\"AutoGluon_Random Forest\").predict(df, model_name=\"Random Forest\")\n", "result" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "You can see the F1 score on the \"new\" (testing) dataset is the same as that in the above leaderboard. The result should be ordinal-encoded first to calculate metrics. We provide `DataModule.label_categories_transform` to achieve this (and `DataModule.label_categories_inverse_transform` to do the inverse transform).\n", "\n", "The `auto_metric_sklearn` automatically calculates different kinds of `sklearn.metrics`, which is extremely useful for classification tasks" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": 9, "outputs": [ { "data": { "text/plain": "(0.6936416184971098, array([0.69364162]))" }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from tabensemb.utils import auto_metric_sklearn\n", "\n", "encoded_result = trainer.datamodule.label_categories_transform(pd.DataFrame(result, columns=trainer.label_name)).values\n", "# You can also use trainer.datamodule.label_ordinal_encoder.transform(result) to get the same result.\n", "auto_metric_sklearn(truth, encoded_result, \"f1_score\", \"binary\"), trainer.leaderboard.loc[trainer.leaderboard[\"Model\"]==\"Random Forest\", \"Testing F1_SCORE\"].values" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "## Inference on the individual testing set\n", "\n", "When loading from UCI datasets, `UserConfig.from_uci` finds that an individual testing dataset exists, so the downloaded .zip file is not removed. We can load the archive using `zipfile`." ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": 10, "outputs": [ { "data": { "text/plain": "['Index', 'adult.data', 'adult.names', 'adult.test', 'old.adult.names']" }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import zipfile\n", "\n", "zipf = zipfile.ZipFile(os.path.join(tabensemb.setting[\"default_data_path\"], \"Adult.zip\"))\n", "zipf.namelist()" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "Now check the content of `adult.test`. It is a .csv-like file, just like the `adult.data` file but has an additional row at the front, and an additional \".\" at the end of each line." ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": 11, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "|1x3 Cross validator\n", "25, Private, 226802, 11th, 7, Never-married, Machine-op-inspct, Own-child, Black, Male, 0, 0, 40, United-States, <=50K.\n", "38, Private, 89814, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 50, United-States, <=50K.\n", "28, Local-gov, 336951, Assoc-acdm, 12, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 40, United-States, >50K.\n", "44, Private, 160323, Some-college, 10, Married-civ-spouse, Machine-op-inspct, Husband, Black, Male, 7688, 0\n" ] } ], "source": [ "file = zipf.read(\"adult.test\").decode()\n", "print(file[:500])" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "We provide the `str_to_dataframe` function that translates the string object to a `DataFrame` and replaces illegal values with `np.nan`." ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": 12, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/xlluo/hdd/tabular_ensemble/tabensemb/utils/utils.py:464: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.\n", " df = pd.read_csv(StringIO(s), names=names, sep=sep)\n" ] }, { "data": { "text/plain": " age workclass fnlwgt education education-num \\\n0 25 Private 226802 11th 7 \n1 38 Private 89814 HS-grad 9 \n2 28 Local-gov 336951 Assoc-acdm 12 \n3 44 Private 160323 Some-college 10 \n4 18 ? 103497 Some-college 10 \n... ... ... ... ... ... \n16276 39 Private 215419 Bachelors 13 \n16277 64 ? 321403 HS-grad 9 \n16278 38 Private 374983 Bachelors 13 \n16279 44 Private 83891 Bachelors 13 \n16280 35 Self-emp-inc 182148 Bachelors 13 \n\n marital-status occupation relationship \\\n0 Never-married Machine-op-inspct Own-child \n1 Married-civ-spouse Farming-fishing Husband \n2 Married-civ-spouse Protective-serv Husband \n3 Married-civ-spouse Machine-op-inspct Husband \n4 Never-married ? Own-child \n... ... ... ... \n16276 Divorced Prof-specialty Not-in-family \n16277 Widowed ? Other-relative \n16278 Married-civ-spouse Prof-specialty Husband \n16279 Divorced Adm-clerical Own-child \n16280 Married-civ-spouse Exec-managerial Husband \n\n race sex capital-gain capital-loss hours-per-week \\\n0 Black Male 0 0 40 \n1 White Male 0 0 50 \n2 White Male 0 0 40 \n3 Black Male 7688 0 40 \n4 White Female 0 0 30 \n... ... ... ... ... ... \n16276 White Female 0 0 36 \n16277 Black Male 0 0 40 \n16278 White Male 0 0 50 \n16279 Asian-Pac-Islander Male 5455 0 40 \n16280 White Male 0 0 60 \n\n native-country income \n0 United-States <=50K \n1 United-States <=50K \n2 United-States >50K \n3 United-States >50K \n4 United-States <=50K \n... ... ... \n16276 United-States <=50K \n16277 United-States <=50K \n16278 United-States <=50K \n16279 United-States <=50K \n16280 United-States >50K \n\n[16281 rows x 15 columns]", "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
ageworkclassfnlwgteducationeducation-nummarital-statusoccupationrelationshipracesexcapital-gaincapital-losshours-per-weeknative-countryincome
025Private22680211th7Never-marriedMachine-op-inspctOwn-childBlackMale0040United-States<=50K
138Private89814HS-grad9Married-civ-spouseFarming-fishingHusbandWhiteMale0050United-States<=50K
228Local-gov336951Assoc-acdm12Married-civ-spouseProtective-servHusbandWhiteMale0040United-States>50K
344Private160323Some-college10Married-civ-spouseMachine-op-inspctHusbandBlackMale7688040United-States>50K
418?103497Some-college10Never-married?Own-childWhiteFemale0030United-States<=50K
................................................
1627639Private215419Bachelors13DivorcedProf-specialtyNot-in-familyWhiteFemale0036United-States<=50K
1627764?321403HS-grad9Widowed?Other-relativeBlackMale0040United-States<=50K
1627838Private374983Bachelors13Married-civ-spouseProf-specialtyHusbandWhiteMale0050United-States<=50K
1627944Private83891Bachelors13DivorcedAdm-clericalOwn-childAsian-Pac-IslanderMale5455040United-States<=50K
1628035Self-emp-inc182148Bachelors13Married-civ-spouseExec-managerialHusbandWhiteMale0060United-States>50K
\n

16281 rows × 15 columns

\n
" }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from tabensemb.utils import str_to_dataframe\n", "\n", "file = file.replace(\"|1x3 Cross validator\\n\",\"\").replace(\".\\n\", \"\\n\")\n", "testing_df = str_to_dataframe(file, sep=\", \", names=trainer.df.columns, check_nan_on=trainer.cont_feature_names)\n", "testing_df" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "The inference is similar." ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": 13, "outputs": [ { "data": { "text/plain": "array([['<=50K'],\n ['<=50K'],\n ['<=50K'],\n ...,\n ['>50K'],\n ['<=50K'],\n ['>50K']], dtype=object)" }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "result = trainer.get_modelbase(\"AutoGluon_Random Forest\").predict(testing_df, model_name=\"Random Forest\")\n", "result" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "Both the truth and the result require to be ordinal-encoded to calculate metrics." ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": 14, "outputs": [ { "data": { "text/plain": "(0.6690190543401553, 0.7731706276694975)" }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "encoded_truth = trainer.datamodule.label_categories_transform(testing_df[trainer.label_name]).values\n", "encoded_result = trainer.datamodule.label_categories_transform(pd.DataFrame(result, columns=trainer.label_name)).values\n", "auto_metric_sklearn(encoded_truth, encoded_result, \"f1_score\", \"binary\"), auto_metric_sklearn(encoded_truth, encoded_result, \"roc_auc_score\", \"binary\")" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.6" } }, "nbformat": 4, "nbformat_minor": 0 }