Plotting#

tabensemb.trainer.Trainer provides some useful plotting methods to analyse the dataset or results.

[1]:

import torch
from tabensemb.trainer import Trainer
from tabensemb.model import *
from tabensemb.config import UserConfig
import tabensemb
import os
from tempfile import TemporaryDirectory

device = "cuda" if torch.cuda.is_available() else "cpu"
print("Using {} device".format(device))

temp_path = TemporaryDirectory()
tabensemb.setting["default_output_path"] = os.path.join(temp_path.name, "output")
tabensemb.setting["default_config_path"] = os.path.join(temp_path.name, "configs")
tabensemb.setting["default_data_path"] = os.path.join(temp_path.name, "data")

trainer = Trainer(device=device)
mpg_columns = [
    "mpg",
    "cylinders",
    "displacement",
    "horsepower",
    "weight",
    "acceleration",
    "model_year",
    "origin",
    "car_name",
]
cfg = UserConfig.from_uci("Auto MPG", column_names=mpg_columns, sep=r"\s+")
trainer.load_config(cfg)
trainer.load_data()
models = [
    PytorchTabular(trainer, model_subset=["Category Embedding"]),
    CatEmbed(trainer, model_subset=["Category Embedding"])
]
trainer.add_modelbases(models)
trainer.train(stderr_to_stdout=True)

Using cuda device
Downloading https://archive.ics.uci.edu/static/public/9/auto+mpg.zip to /tmp/tmp7ery7st7/data/Auto MPG.zip
cylinders is Integer and will be treated as a continuous feature.
model_year is Integer and will be treated as a continuous feature.
origin is Integer and will be treated as a continuous feature.
Unknown values are detected in ['horsepower']. They will be treated as np.nan.
The project will be saved to /tmp/tmp7ery7st7/output/auto-mpg/2023-10-08-13-00-16-0_UserInputConfig
Dataset size: 238 80 80
Data saved to /tmp/tmp7ery7st7/output/auto-mpg/2023-10-08-13-00-16-0_UserInputConfig (data.csv and tabular_data.csv).

-------------Run PytorchTabular-------------

Training Category Embedding
Global seed set to 42
2023-10-08 13:00:17,416 - {pytorch_tabular.tabular_model:473} - INFO - Preparing the DataLoaders
2023-10-08 13:00:17,417 - {pytorch_tabular.tabular_datamodule:290} - INFO - Setting up the datamodule for regression task
2023-10-08 13:00:17,424 - {pytorch_tabular.tabular_model:521} - INFO - Preparing the Model: CategoryEmbeddingModel
2023-10-08 13:00:17,434 - {pytorch_tabular.tabular_model:268} - INFO - Preparing the Trainer
/home/xlluo/anaconda3/envs/tabular_ensemble/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py:589: LightningDeprecationWarning: The Trainer argument `auto_select_gpus` has been deprecated in v1.9.0 and will be removed in v2.0.0. Please use the function `pytorch_lightning.accelerators.find_usable_cuda_devices` instead.
  rank_zero_deprecation(
Auto select gpus: [0]
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
2023-10-08 13:00:18,152 - {pytorch_tabular.tabular_model:582} - INFO - Training Started
You are using a CUDA device ('NVIDIA GeForce RTX 3090') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name             | Type                      | Params
---------------------------------------------------------------
0 | _backbone        | CategoryEmbeddingBackbone | 11.4 K
1 | _embedding_layer | Embedding1dLayer          | 14
2 | head             | LinearHead                | 33
3 | loss             | MSELoss                   | 0
---------------------------------------------------------------
11.4 K    Trainable params
0         Non-trainable params
11.4 K    Total params
0.046     Total estimated model params size (MB)
Epoch: 1/300, Train loss: 677.8015, Val loss: 582.9557, Min val loss: 582.9557, Epoch time: 0.010s.
Epoch: 20/300, Train loss: 353.7851, Val loss: 302.0203, Min val loss: 302.0203, Epoch time: 0.007s.
Epoch: 40/300, Train loss: 85.0776, Val loss: 62.1153, Min val loss: 62.1153, Epoch time: 0.007s.
Epoch: 60/300, Train loss: 45.2654, Val loss: 34.2778, Min val loss: 34.2691, Epoch time: 0.007s.
Epoch: 80/300, Train loss: 33.9537, Val loss: 26.8622, Min val loss: 26.8622, Epoch time: 0.007s.
Epoch: 100/300, Train loss: 26.9038, Val loss: 23.2417, Min val loss: 23.2372, Epoch time: 0.007s.
Epoch: 120/300, Train loss: 24.9622, Val loss: 20.4360, Min val loss: 20.4360, Epoch time: 0.007s.
Epoch: 140/300, Train loss: 24.1636, Val loss: 19.4010, Min val loss: 19.4010, Epoch time: 0.007s.
Epoch: 160/300, Train loss: 22.9200, Val loss: 18.0232, Min val loss: 17.9749, Epoch time: 0.007s.
Epoch: 180/300, Train loss: 19.7677, Val loss: 16.9469, Min val loss: 16.9469, Epoch time: 0.007s.
Epoch: 200/300, Train loss: 17.9390, Val loss: 16.6545, Min val loss: 16.4093, Epoch time: 0.007s.
Epoch: 220/300, Train loss: 19.4496, Val loss: 15.4451, Min val loss: 15.1788, Epoch time: 0.007s.
Epoch: 240/300, Train loss: 16.0483, Val loss: 14.5508, Min val loss: 14.5508, Epoch time: 0.008s.
Epoch: 260/300, Train loss: 16.4672, Val loss: 13.8354, Min val loss: 13.8354, Epoch time: 0.007s.
Epoch: 280/300, Train loss: 13.6031, Val loss: 12.9315, Min val loss: 12.9315, Epoch time: 0.007s.
Epoch: 300/300, Train loss: 16.5369, Val loss: 12.3673, Min val loss: 12.3673, Epoch time: 0.007s.
`Trainer.fit` stopped: `max_epochs=300` reached.
2023-10-08 13:00:21,684 - {pytorch_tabular.tabular_model:584} - INFO - Training the model completed
2023-10-08 13:00:21,684 - {pytorch_tabular.tabular_model:1258} - INFO - Loading the best model
/home/xlluo/anaconda3/envs/tabular_ensemble/lib/python3.10/site-packages/pytorch_lightning/utilities/cloud_io.py:33: LightningDeprecationWarning: `pytorch_lightning.utilities.cloud_io.get_filesystem` has been deprecated in v1.8.0 and will be removed in v2.0.0. Please use `lightning_fabric.utilities.cloud_io.get_filesystem` instead.
  rank_zero_deprecation(
Training mse loss: 11.25175
Validation mse loss: 12.36725
Testing mse loss: 7.83801
Trainer saved. To load the trainer, run trainer = load_trainer(path='/tmp/tmp7ery7st7/output/auto-mpg/2023-10-08-13-00-16-0_UserInputConfig/trainer.pkl')

-------------PytorchTabular End-------------


-------------Run CatEmbed-------------

Training Category Embedding
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
You are using a CUDA device ('NVIDIA GeForce RTX 3090') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name                | Type        | Params
----------------------------------------------------
0 | default_loss_fn     | MSELoss     | 0
1 | default_output_norm | Identity    | 0
2 | linear              | Sequential  | 11.4 K
3 | embed               | Embedding1d | 14
4 | head                | Linear      | 33
----------------------------------------------------
11.4 K    Trainable params
9         Non-trainable params
11.4 K    Total params
0.046     Total estimated model params size (MB)
Epoch: 1/300, Train loss: 677.8015, Val loss: 583.1656, Min val loss: 583.1656, Min ES val loss: 583.1656, Epoch time: 0.006s.
Epoch: 20/300, Train loss: 353.7850, Val loss: 301.6606, Min val loss: 301.6606, Min ES val loss: 301.6606, Epoch time: 0.005s.
Epoch: 40/300, Train loss: 85.0776, Val loss: 62.0647, Min val loss: 62.0647, Min ES val loss: 62.0647, Epoch time: 0.005s.
Epoch: 60/300, Train loss: 45.2654, Val loss: 34.2774, Min val loss: 34.2689, Min ES val loss: 34.2689, Epoch time: 0.005s.
Epoch: 80/300, Train loss: 33.9537, Val loss: 26.8621, Min val loss: 26.8621, Min ES val loss: 26.8621, Epoch time: 0.005s.
Epoch: 100/300, Train loss: 26.9038, Val loss: 23.2417, Min val loss: 23.2372, Min ES val loss: 23.2372, Epoch time: 0.005s.
Epoch: 120/300, Train loss: 24.9622, Val loss: 20.4360, Min val loss: 20.4360, Min ES val loss: 20.4360, Epoch time: 0.005s.
Epoch: 140/300, Train loss: 24.1636, Val loss: 19.4010, Min val loss: 19.4010, Min ES val loss: 19.4010, Epoch time: 0.005s.
Epoch: 160/300, Train loss: 22.9200, Val loss: 18.0232, Min val loss: 17.9749, Min ES val loss: 17.9749, Epoch time: 0.005s.
Epoch: 180/300, Train loss: 19.7677, Val loss: 16.9469, Min val loss: 16.9469, Min ES val loss: 16.9469, Epoch time: 0.005s.
Epoch: 200/300, Train loss: 17.9390, Val loss: 16.6545, Min val loss: 16.4093, Min ES val loss: 16.4093, Epoch time: 0.005s.
Epoch: 220/300, Train loss: 19.4496, Val loss: 15.4451, Min val loss: 15.1788, Min ES val loss: 15.1788, Epoch time: 0.005s.
Epoch: 240/300, Train loss: 16.0483, Val loss: 14.5508, Min val loss: 14.5508, Min ES val loss: 14.5508, Epoch time: 0.005s.
Epoch: 260/300, Train loss: 16.4672, Val loss: 13.8354, Min val loss: 13.8354, Min ES val loss: 13.8354, Epoch time: 0.005s.
Epoch: 280/300, Train loss: 13.6031, Val loss: 12.9315, Min val loss: 12.9315, Min ES val loss: 12.9315, Epoch time: 0.005s.
Epoch: 300/300, Train loss: 16.5369, Val loss: 12.3673, Min val loss: 12.3673, Min ES val loss: 12.3673, Epoch time: 0.005s.
`Trainer.fit` stopped: `max_epochs=300` reached.
Training mse loss: 11.25176
Validation mse loss: 12.36726
Testing mse loss: 7.83801
Trainer saved. To load the trainer, run trainer = load_trainer(path='/tmp/tmp7ery7st7/output/auto-mpg/2023-10-08-13-00-16-0_UserInputConfig/trainer.pkl')

-------------CatEmbed End-------------

If LaTeX is detected, matplotlib.rc("text", usetex=True) is called to use LaTeX for a better text appearance. However, if there exist "_" in feature names, LaTeX will throw errors. Here we reset defaults of matplotlib.rcParams to disable LaTeX.

[2]:

import matplotlib
import matplotlib.pyplot as plt
matplotlib.rcParams.update(matplotlib.rcParamsDefault)
plt.rcParams["figure.autolayout"] = True

Trainer.plot_fill_rating will show a histogram of the filling ratio of data points. There are several data points with missing entries.

Remark: All methods introduced in this part pass arguments, such as figure_kwargs, bar_kwargs, pairplot_kwargs, etc., to corresponding functions for limited but sufficient customization. See the API docs for their meanings.

Remark: Colors can be globally controlled by replacing tabensemb.utils.global_palette. For some methods, clr can be given to control colors. One can also pass kwargs to arguments like bar_kwargs to over-ride default behaviors including colors.

Remark: The argument category is used to distinguish different data points in the plot. The argument is available for many of the methods introduced in this part.

[3]:

trainer.plot_fill_rating(figure_kwargs=dict(figsize=(4,3), dpi=150), category="origin")

../../_images/examples_get_started_plotting_5_0.png

[3]:

<Axes: xlabel='Fill rating', ylabel='Density'>

Trainer.plot_presence_ratio will show the filling ratio of features. There are missing entries only in the horsepower feature.

[4]:

trainer.plot_presence_ratio()

../../_images/examples_get_started_plotting_7_0.png

[4]:

<Axes: xlabel='Data presence ratio'>

Trainer.plot_feature_box will show the box plot of features.

[5]:

trainer.plot_feature_box(imputed=True, figure_kwargs=dict(figsize=(4, 4)))

../../_images/examples_get_started_plotting_9_0.png

[5]:

<Axes: xlabel='Values (Scaled)'>

Trainer.plot_pairplot will show correlations between each two features. It uses seaborn.pairplot to achieve this, so the plot can be customized by passing the pairplot_kwargs argument. See the documentation for possible arguments.

[6]:

trainer.plot_pairplot(pairplot_kwargs=dict(height=1))

/home/xlluo/anaconda3/envs/tabular_ensemble/lib/python3.10/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
/home/xlluo/hdd/tabular_ensemble/tabensemb/trainer/trainer.py:3848: UserWarning: The figure layout has changed to tight
  plt.tight_layout()

../../_images/examples_get_started_plotting_11_1.png

[6]:

<seaborn.axisgrid.PairGrid at 0x7f61ac713820>

Trainer.plot_corr will show the correlation coefficients between each two features. It uses the Pearson correlation by default.

[7]:

trainer.plot_corr(imputed=True, figure_kwargs=dict(figsize=(5, 5)), imshow_kwargs=dict(cmap="RdBu"))

../../_images/examples_get_started_plotting_13_0.png

[7]:

<Axes: >

Trainer.plot_corr_with_label focuses on the correlation between the target and each feature.

[8]:

trainer.plot_corr_with_label(imputed=True, figure_kwargs={"figsize": (6, 5)}, order="descending")

../../_images/examples_get_started_plotting_15_0.png

[8]:

<Axes: xlabel='Correlation with mpg'>

Trainer.plot_pca_2d_visual will show a scatter plot where the x and y axes are the first and second principal components from PCA, respectively.

[9]:

trainer.plot_pca_2d_visual(features=trainer.cont_feature_names)

../../_images/examples_get_started_plotting_17_0.png

[9]:

<Axes: xlabel='1st principal component', ylabel='2nd principal component'>

Trainer.plot_scatter offers the utility to plot one feature against another. Here is an example that plots scatters between weight and horsepower whose model_year is in the range \([70, 74]\).

Remark: select_by_value_kwargs is the argument of DataModule.select_by_value which returns indices of datapoints that have specified values in the specified columns.

[10]:

trainer.plot_scatter(x_col="weight", y_col="horsepower", scatter_kwargs={"color": "#635380", "s": 15, "marker": "s"}, select_by_value_kwargs={"selection": {"model_year": (70, 74)}}, kde_color=True)

../../_images/examples_get_started_plotting_19_0.png

[10]:

<Axes: xlabel='weight', ylabel='horsepower'>

Trainer.plot_on_one_axes can plot multiple items (like scatters) on one matplotlib.axes.Axes instance. Here is an example that calls Trainer.plot_scatter twice to plots a group of scatters whose model_year is in the range \([70, 74]\) and the other group of scatters whose model_year is in the range \([75, 80]\).

Remark: Most methods in this part have the argument ax which should be a matplotlib.axes.Axes instance. If it is given, no new figure will be created and the item will be plotted on the ax directly. This is how we implement Trainer.plot_on_one_axes.

[11]:

trainer.plot_on_one_axes(
    meth_name="plot_scatter",
    meth_kwargs_ls=[
        dict(scatter_kwargs={"color": "#635380", "s": 15, "marker": "s", "label": "[70, 74]"}, select_by_value_kwargs={"selection": {"model_year": (70, 74)}}),
        dict(scatter_kwargs={"color": "#ACE4AA", "s": 15, "marker": "v", "label": "[75, 80]"}, select_by_value_kwargs={"selection": {"model_year": (75, 80)}})
    ],
    meth_fix_kwargs=dict(
        x_col="weight",
        y_col="horsepower",
        imputed=False,
    ),
    xlabel="weight",
    ylabel="horsepower",
    legend=True,
)

../../_images/examples_get_started_plotting_21_0.png

[11]:

<Axes: xlabel='weight', ylabel='horsepower'>

Trainer.plot_hist_all shows the histogram for each feature. If the argument category is given, histograms of each unique value of the category column will be plotted separately and stacked together.

Remark: All methods with the postfix _all introduced in this part utilize corresponding methods without the postfix. For example, Trainer.plot_hist_all calls Trainer.plot_hist for multiple times.

[12]:

_ = trainer.plot_hist_all(imputed=False, kde=True, hist_kwargs={"bins": 5}, category="origin", legend_kwargs={"loc": "upper right"})

../../_images/examples_get_started_plotting_23_0.png

Trainer.plot_kde_all shows 1d KDE results for each feature.

[13]:

_ = trainer.plot_kde_all(imputed=False)

../../_images/examples_get_started_plotting_25_0.png

Trainer.plot_kde can also plot bi-variate gaussian kernel density.

[14]:

trainer.plot_kde(x_col="displacement", y_col="cylinders", kdeplot_kwargs=dict(fill=True, thresh=0, levels=100))

../../_images/examples_get_started_plotting_27_0.png

[14]:

<Axes: xlabel='displacement', ylabel='cylinders'>

Trainer.plot_truth_pred will show the comparison between ground truth values and predictions.

[15]:

trainer.plot_truth_pred(program="PytorchTabular", model_name="Category Embedding", log_trans=False, legend_kwargs=dict(fontsize="x-small"), figure_kwargs=dict(figsize=(4, 4)))

Training MSE Loss: 11.2517, R2: 0.8254
Validation MSE Loss: 12.3672, R2: 0.7791
Testing MSE Loss: 7.8380, R2: 0.8542

../../_images/examples_get_started_plotting_29_1.png

[15]:

<Axes: xlabel='Ground truth', ylabel='Prediction'>

Trainer.plot_loss shows the curves of training loss and/or validation loss. It is supported for PytorchTabular, WideDeep or any TorchModel-based models.

[16]:

trainer.plot_loss(program="PytorchTabular", model_name="Category Embedding", train_val="both", restored_epoch_mark_if_last=True)

../../_images/examples_get_started_plotting_31_0.png

[16]:

<Axes: xlabel='Epoch', ylabel='MSE loss'>

Trainer.plot_feature_importance calculates and plots feature importance. Two methods are supported to calculate feature importance: permutation and shap. Permutation feature importance is the decrease of the metric when permuting (shuffling) the feature. SHAP is a game theory approach. They might get different results.

For pytorch based models, we use captum (link) and shap.DeepExplainer (link) for faster calculations.

[17]:

trainer.plot_feature_importance(program="PytorchTabular", model_name="Category Embedding", figure_kwargs=dict(figsize=(5, 4)))

../../_images/examples_get_started_plotting_33_0.png

[17]:

<Axes: xlabel='Permutation feature importance'>

[18]:

import warnings

with warnings.catch_warnings():
    warnings.filterwarnings("ignore")
    trainer.plot_feature_importance(program="CatEmbed", model_name="Category Embedding", method="shap", figure_kwargs=dict(figsize=(5, 4)))

Feature importance less than 1e-5: ['Unscaled-0', 'Unscaled-1', 'Unscaled-2', 'Unscaled-3', 'Unscaled-4', 'Unscaled-5', 'Unscaled-6']

../../_images/examples_get_started_plotting_34_1.png

Trainer.plot_partial_dependence_all uses bootstrap sampling to resample the dataset, fits the model on the resampled dataset, and obtains sequential predictions when assigning sequential values to a feature, to see the dependency of predictions on a feature.

[19]:

with warnings.catch_warnings():
    warnings.filterwarnings("ignore")
    trainer.plot_partial_dependence_all(program="PytorchTabular", model_name="Category Embedding", n_bootstrap=3, grid_size=10, log_trans=False, upper_lim=9, lower_lim=2, CI=0.95)

Calculate PDP:  displacement
Calculate PDP:  cylinders
Calculate PDP:  horsepower
Calculate PDP:  weight
Calculate PDP:  acceleration
Calculate PDP:  model_year
Calculate PDP:  origin

../../_images/examples_get_started_plotting_36_1.png

Trainer.plot_err_hist shows the distribution of errors. If the argument category is given, one can try to analyse the relationship between categories and errors.

[20]:

trainer.plot_err_hist(program="PytorchTabular", model_name="Category Embedding", metric="mse", category="origin", legend_kwargs={"title": "origin"})

../../_images/examples_get_started_plotting_38_0.png

[20]:

<Axes: xlabel='MSE', ylabel='Density'>

Trainer.plot_partial_err_all shows the distribution of absolute error with respect to feature values. If the density of high error predictions is high in a certain range of a certain feature, data augmentation or additional experiments might be required.

[21]:

_ = trainer.plot_partial_err_all(program="PytorchTabular", model_name="Category Embedding")

../../_images/examples_get_started_plotting_40_0.png