tabensemb.model.AbstractModel#

class tabensemb.model.AbstractModel(trainer: Trainer, program: str | None = None, model_subset: List[str] | None = None, exclude_models: List[str] | None = None, store_in_harddisk: bool = True, optimizers: Dict[str, Tuple] | None = None, lr_schedulers: Dict[str, Tuple] | None = None, **kwargs)[source]#

Bases: object

The base class for all model bases.

Attributes:
exclude_models

The names of models that should not be trained.

init_params

Arguments passed to __init__(). See save_kwargs().

limit_batch_size

If batch_size // len(training set) < limit_batch_size, the batch_size is forced to be len(training set) to avoid potential numerical issues. For Tabnet, this is extremely important because a small batch may cause NaNs and further CUDA device-side assert in the sparsemax function. Set to -1 to turn off this check (NOT RECOMMENDED!!). Note: Setting drop_last=True for torch.utils.data.DataLoader is fine, but I think (i) having access to all data points in one epoch is beneficial for some models, (ii) If using a large dataset and a large batch_size, it is possible that the last batch is so large that contains essential information, (iii) the user should have full control for this. If you want to use drop_last in your code, use the original_batch_size in kwargs passed to AbstractModel methods.

train_losses

The training loss during training of each model.

val_losses

The validation loss during training of each model.

restored_epochs

The best epoch from where the model is restored after training.

model

A dictionary of models.

model_params

Hyperparameters that contain all keys in _initial_values() for each model. In cross validation runs, the parameters in the previous run will be loaded for the current run.

model_subset

The names of models selected to be trained in the model base.

program

The name of the model base.

root

The place where all files of the model base are stored.

store_in_harddisk

Whether to save models in the hard disk.

trainer

A tabensemb.trainer.Trainer instance.

optimizers

A dictionary of optimizer names (choose from those in torch.optim) and their hyperparameters for each model. Remember to change _initial_values() and _space() to optimize its hyperparameters.

lr_schedulers

A dictionary of lr scheduler names (choose from those in torch.optim.lr_scheduler) and their hyperparameters for each model. Remember to change _initial_values() and _space() to optimize its hyperparameters.

device

The device set in the linked Trainer.

Methods

__init__(trainer: Trainer, program: str | None = None, model_subset: List[str] | None = None, exclude_models: List[str] | None = None, store_in_harddisk: bool = True, optimizers: Dict[str, Tuple] | None = None, lr_schedulers: Dict[str, Tuple] | None = None, **kwargs)[source]#
Parameters:
trainer:

A Trainer instance that contains all information and datasets and will be linked to the model base. The trainer has loaded configs and data.

program:

The name of the model base. If None, the name from _get_program_name() is used.

model_subset:

The names of models selected to be trained in the model base.

exclude_models:

The names of models that should not be trained. Only one of model_subset and exclude_models can be specified.

store_in_harddisk:

Whether to save models in the hard disk. If the global setting tabensemb.setting["low_memory"] is True, True is used.

optimizers

A dictionary of optimizer names (choose from those in torch.optim) and their hyperparameters for each model. Remember to change _initial_values() and _space() to optimize its hyperparameters.

lr_schedulers

A dictionary of lr scheduler names (choose from those in torch.optim.lr_scheduler) and their hyperparameters for each model. Remember to change _initial_values() and _space() to optimize its hyperparameters.

**kwargs:

Ignored.

cal_feature_importance(model_name, method[, ...])

Calculate feature importance using a specified model.

cal_shap(model_name[, return_importance, ...])

Calculate SHAP values using a specified model.

detach_model(model_name[, program])

Detach the chosen model to a separate model base with the same linked Trainer.

fit(df, cont_feature_names, ...[, ...])

Fit all models using a tabular dataset.

get_model_names()

Get names of available models based on _get_model_names() and the arguments model_subset or exclude_models of __init__().

inspect_attr(model_name, attributes[, df, ...])

Get attributes of the model after evaluating the model on training, validation, and testing sets respectively.

new_model(model_name, verbose, **kwargs)

A wrapper method to generate a new model while keeping the random seed constant.

predict(df, model_name[, model, ...])

Make inferences on a new dataset using the selected model.

predict_proba(*args, **kwargs)

Predict probabilities of each class.

required_models(model_name)

The names of models required by the requested model.

reset()

Reset the model base by calling __init__ with the recorded kwargs from save_kwargs().

save_kwargs([d, ignore])

Save all args and kwargs of the caller except for those in ignore.

set_path(path)

Set the path of the model base (usually a trained one), including the paths of its models.

train(*args[, stderr_to_stdout])

Train the model base using the dataset in the linked Trainer directly.

_bayes_eval(model, X_train, y_train, X_val, ...)

Evaluate the model for Bayesian optimization iterations.

_check_params(model_name, **kwargs)

Check the validity of hyperparameters.

_check_space()

Check if all parameters defined in _initial_values() have corresponding search spaces defined in _space().

_check_train_status()

Raise exception if _predict is called and the model base is not trained.

_conditional_validity(model_name)

Check the validity of a model.

_custom_training_params(model_name)

Customized training settings to override settings in the configuration.

_data_preprocess(df, derived_data, model_name)

Perform the same preprocessing as in _train_data_preprocess() on a new dataset.

_default_metric_sklearn(y_true, y_pred)

Calculate MSE loss for regression tasks and log loss for classification tasks using sklearn APIs.

_get_model_names()

Get names of all available models implemented in the model base.

_get_params(model_name[, verbose])

Load default parameters or optimized parameters (if Bayesian optimization is performed) of the selected model.

_get_program_name()

Get the default name of the model base.

_get_required_models(model_name)

Extract models specified in required_models().

_initial_values(model_name)

Initial values of hyperparameters to be optimized.

_mkdir()

Create a directory for the model base under the root of the linked Trainer.

_new_model(model_name, verbose, **kwargs)

Generate a new selected model based on kwargs.

_pred_single_model(model, X_test, verbose, ...)

Predict using the model trained in _train_single_model().

_predict(df, model_name[, derived_data, model])

Make prediction based on a tabular dataset using the selected model.

_predict_all([verbose, test_data_only])

Make inferences on training/validation/testing datasets to evaluate the performance of all models.

_predict_model(model_name[, test_data_only])

Get predictions of a model on all partitions.

_predict_model_on_partition(model_name, ...)

Get predictions of a model on the selected partition.

_space(model_name)

A list of scikit-optimize search spaces for the selected model.

_train([model_subset, dump_trainer, ...])

The basic framework of training models, including processing the dataset, training each model (with/without bayesian hyperparameter optimization), and evaluating them on the dataset.

_train_data_preprocess(model_name[, warm_start])

Processing the data from self.trainer.datamodule for training.

_train_single_model(model, model_name, ...)

Training the model (initialized in _new_model()).

_update_optimizer_lr_scheduler_params(...)

Update parameters of the optimizer and the lr_scheduler according to the input hyperparameters when initializing a model.