tabensemb.model.AbstractModel#

class tabensemb.model.AbstractModel(trainer: Trainer, program: str | None = None, model_subset: List[str] | None = None, exclude_models: List[str] | None = None, store_in_harddisk: bool = True, optimizers: Dict[str, Tuple] | None = None, lr_schedulers: Dict[str, Tuple] | None = None, **kwargs)[source]#

Bases: object

The base class for all model bases.

Attributes:

exclude_models: The names of models that should not be trained.
init_params: Arguments passed to __init__(). See save_kwargs().
limit_batch_size: If batch_size // len(training set) < limit_batch_size, the batch_size is forced to be len(training set) to avoid potential numerical issues. For Tabnet, this is extremely important because a small batch may cause NaNs and further CUDA device-side assert in the sparsemax function. Set to -1 to turn off this check (NOT RECOMMENDED!!). Note: Setting drop_last=True for torch.utils.data.DataLoader is fine, but I think (i) having access to all data points in one epoch is beneficial for some models, (ii) If using a large dataset and a large batch_size, it is possible that the last batch is so large that contains essential information, (iii) the user should have full control for this. If you want to use drop_last in your code, use the original_batch_size in kwargs passed to AbstractModel methods.
train_losses: The training loss during training of each model.
val_losses: The validation loss during training of each model.
restored_epochs: The best epoch from where the model is restored after training.
model: A dictionary of models.
model_params: Hyperparameters that contain all keys in _initial_values() for each model. In cross validation runs, the parameters in the previous run will be loaded for the current run.
model_subset: The names of models selected to be trained in the model base.
program: The name of the model base.
root: The place where all files of the model base are stored.
store_in_harddisk: Whether to save models in the hard disk.
trainer: A tabensemb.trainer.Trainer instance.
optimizers: A dictionary of optimizer names (choose from those in torch.optim) and their hyperparameters for each model. Remember to change _initial_values() and _space() to optimize its hyperparameters.
lr_schedulers: A dictionary of lr scheduler names (choose from those in torch.optim.lr_scheduler) and their hyperparameters for each model. Remember to change _initial_values() and _space() to optimize its hyperparameters.
device: The device set in the linked Trainer.

Methods

__init__(trainer: Trainer, program: str | None = None, model_subset: List[str] | None = None, exclude_models: List[str] | None = None, store_in_harddisk: bool = True, optimizers: Dict[str, Tuple] | None = None, lr_schedulers: Dict[str, Tuple] | None = None, **kwargs)[source]#

Parameters:

trainer:: A Trainer instance that contains all information and datasets and will be linked to the model base. The trainer has loaded configs and data.
program:: The name of the model base. If None, the name from _get_program_name() is used.
model_subset:: The names of models selected to be trained in the model base.
exclude_models:: The names of models that should not be trained. Only one of model_subset and exclude_models can be specified.
store_in_harddisk:: Whether to save models in the hard disk. If the global setting tabensemb.setting["low_memory"] is True, True is used.
optimizers: A dictionary of optimizer names (choose from those in torch.optim) and their hyperparameters for each model. Remember to change _initial_values() and _space() to optimize its hyperparameters.
lr_schedulers: A dictionary of lr scheduler names (choose from those in torch.optim.lr_scheduler) and their hyperparameters for each model. Remember to change _initial_values() and _space() to optimize its hyperparameters.
**kwargs:: Ignored.

`cal_feature_importance`(model_name, method[, ...])	Calculate feature importance using a specified model.
`cal_shap`(model_name[, return_importance, ...])	Calculate SHAP values using a specified model.
`detach_model`(model_name[, program])	Detach the chosen model to a separate model base with the same linked `Trainer`.
`fit`(df, cont_feature_names, ...[, ...])	Fit all models using a tabular dataset.
`get_model_names`()	Get names of available models based on `_get_model_names()` and the arguments `model_subset` or `exclude_models` of `__init__()`.
`inspect_attr`(model_name, attributes[, df, ...])	Get attributes of the model after evaluating the model on training, validation, and testing sets respectively.
`new_model`(model_name, verbose, **kwargs)	A wrapper method to generate a new model while keeping the random seed constant.
`predict`(df, model_name[, model, ...])	Make inferences on a new dataset using the selected model.
`predict_proba`(args, *kwargs)	Predict probabilities of each class.
`required_models`(model_name)	The names of models required by the requested model.
`reset`()	Reset the model base by calling __init__ with the recorded kwargs from `save_kwargs()`.
`save_kwargs`([d, ignore])	Save all args and kwargs of the caller except for those in `ignore`.
`set_path`(path)	Set the path of the model base (usually a trained one), including the paths of its models.
`train`(*args[, stderr_to_stdout])	Train the model base using the dataset in the linked `Trainer` directly.
`_bayes_eval`(model, X_train, y_train, X_val, ...)	Evaluate the model for Bayesian optimization iterations.
`_check_params`(model_name, **kwargs)	Check the validity of hyperparameters.
`_check_space`()	Check if all parameters defined in `_initial_values()` have corresponding search spaces defined in `_space()`.
`_check_train_status`()	Raise exception if _predict is called and the model base is not trained.
`_conditional_validity`(model_name)	Check the validity of a model.
`_custom_training_params`(model_name)	Customized training settings to override settings in the configuration.
`_data_preprocess`(df, derived_data, model_name)	Perform the same preprocessing as in `_train_data_preprocess()` on a new dataset.
`_default_metric_sklearn`(y_true, y_pred)	Calculate MSE loss for regression tasks and log loss for classification tasks using sklearn APIs.
`_get_model_names`()	Get names of all available models implemented in the model base.
`_get_params`(model_name[, verbose])	Load default parameters or optimized parameters (if Bayesian optimization is performed) of the selected model.
`_get_program_name`()	Get the default name of the model base.
`_get_required_models`(model_name)	Extract models specified in `required_models()`.
`_initial_values`(model_name)	Initial values of hyperparameters to be optimized.
`_mkdir`()	Create a directory for the model base under the root of the linked `Trainer`.
`_new_model`(model_name, verbose, **kwargs)	Generate a new selected model based on kwargs.
`_pred_single_model`(model, X_test, verbose, ...)	Predict using the model trained in `_train_single_model()`.
`_predict`(df, model_name[, derived_data, model])	Make prediction based on a tabular dataset using the selected model.
`_predict_all`([verbose, test_data_only])	Make inferences on training/validation/testing datasets to evaluate the performance of all models.
`_predict_model`(model_name[, test_data_only])	Get predictions of a model on all partitions.
`_predict_model_on_partition`(model_name, ...)	Get predictions of a model on the selected partition.
`_space`(model_name)	A list of `scikit-optimize` search spaces for the selected model.
`_train`([model_subset, dump_trainer, ...])	The basic framework of training models, including processing the dataset, training each model (with/without bayesian hyperparameter optimization), and evaluating them on the dataset.
`_train_data_preprocess`(model_name[, warm_start])	Processing the data from `self.trainer.datamodule` for training.
`_train_single_model`(model, model_name, ...)	Training the model (initialized in `_new_model()`).
`_update_optimizer_lr_scheduler_params`(...)	Update parameters of the optimizer and the lr_scheduler according to the input hyperparameters when initializing a model.