tabensemb.data.datamodule.DataModule#

class tabensemb.data.datamodule.DataModule(config: UserConfig | Dict, initialize: bool = True)[source]#

Bases: object

The dataset manager that provides loading, processing, and analyzing utilities.

Attributes:

args: A tabensemb.config.UserConfig instance.
augmented_indices: Indices of data points that are augmented by tabensemb.data.AbstractAugmenter. The first index starts from the number of data points in the original dataset.
cat_feature_mapping: Original values of categorical features before ordinal encoding. The index of a value represents the encoded value.
cat_feature_names: Names of categorical features.
cont_feature_names: Names of continuous features.
data_path: The path to the data file.
dataderivers: A list of tabensemb.data.AbstractDeriver.
dataimputer: A tabensemb.data.AbstractImputer.
dataprocessors: A list of tabensemb.data.AbstractProcessor.
datasplitter: A tabensemb.data.AbstractSplitter.
derived_data: The derived unstacked data calculated using dataderivers whose argument “stacked” is set to False.
df: The unscaled processed dataset. It is already ordinal-encoded if a tabensemb.data.dataprocessor.CategoricalOrdinalEncoder is used.
dropped_indices: Indices of data points that are removed from the original dataset.
label_name: The name(s) of target(s)
label_ordinal_encoder: A sklearn.preprocessing.OrdinalEncoder that encodes the classification targets.
loss: The type of the loss function. See _infer_loss().
n_classes: The number of unique values for each classification target.
retained_indices: Indices of data points that are retained in the original dataset.
scaled_df: The scaled processed dataset. See also df.
task: The type of the task. See _infer_task().
tensors: feature_data(), expanded derived_data, and label_data() in the torch.Tensor form.
test_dataset: The testing set of the entire torch.utils.data.Dataset.
test_indices: Indices of the testing set in the entire dataset (df).
train_dataset: The training set of the entire torch.utils.data.Dataset.
train_indices: Indices of the training set in the entire dataset (df).
training: The training status of the DataModule. See set_status().
unstacked_col_names: Names of columns of each derived unstacked feature.
val_dataset: The validation set of the entire torch.utils.data.Dataset.
val_indices: Indices of the validation set in the entire dataset (df).
D_test: The derived unstacked data of the testing dataset.
D_train: The derived unstacked data of the training dataset.
D_val: The derived unstacked data of the validation dataset.
X_test: The unscaled testing dataset.
X_train: The unscaled training dataset.
X_val: The unscaled validation dataset.
all_feature_names: Get continuous feature names and categorical feature names after load_data().
cat_imputed_mask: A byte mask for categorical data, where 1 means the data is imputed, and 0 means the data originally exists.
cat_num_unique: The number of unique values of each categorical feature.
categorical_data: The categorical data.
cont_imputed_mask: A byte mask for continuous data, where 1 means the data is imputed, and 0 means the data originally exists.
derived_stacked_features: Find derived features in all_feature_names derived by data derivers whose argument “stacked” is set to True, i.e.
feature_data: The scaled feature data.
label_data: The scaled label data.
unscaled_feature_data: The unscaled feature data.
unscaled_label_data: The unscaled label data.
y_test: The target of the testing dataset.
y_train: The target of the training dataset.
y_val: The target of the validation dataset.

Methods

__init__(config: UserConfig | Dict, initialize: bool = True)[source]#

`cal_corr`([method, imputed, features_only, ...])	Calculate Pearson correlation coefficients among continuous features.
`categories_inverse_transform`(X)	Inverse transformation of `CategoricalOrdinalEncoder` for categorical features (If there is one in `self.dataprocessors`).
`categories_transform`(X)	Transformation of `CategoricalOrdinalEncoder` for categorical features (If there is one in `self.dataprocessors`).
`data_transform`(input_data, **kwargs)	Transform the input tabular dataset using fitted data processors.
`dataset_dict`()
`derive`(df)	Derive both stacked and unstacked features using the input dataframe.
`derive_stacked`(df)	Derive stacked features using the input dataframe.
`derive_unstacked`(df[, categorical_only])	Derive unstacked features using the input dataframe.
`describe`([imputed, scaled])	Describe the dataset using `pd.DataFrame.describe`, skewness, gini index, mode values, etc.
`divide_from_tabular_dataset`(data)	Get continuous feature data, categorical feature data, and label data respectively from the input DataFrame.
`extract_derived_stacked_feature_names`(...)	Find derived features in the input list derived by data derivers whose argument "stacked" is set to True, i.e. the stacked data.
`extract_original_cat_feature_names`(...)	Get original categorical features specified in the configuration.
`extract_original_cont_feature_names`(...)	Get original continuous features specified in the configuration.
`feature_types_with_derived`()	A dictionary stating the category of each feature, including derived stacked features.
`generate_subset`(dataset)	Split the dataset into training, validation and testing subsets.
`generate_tensors`(scaled_df, derived_data)	Generate PyTorch tensors.
`get_additional_tensors_slice`(indices)	Get slices of tensors of derived unstacked data.
`get_all_derived_stacked_feature_names`()	Get all derived stacked features (not intermediate) from arguments of `dataderivers`.
`get_all_derived_unstacked_feature_names`()	Get all derived unstacked features from `unstacked_col_names`.
`get_base_predictor`([categorical])	Get a sklearn `RandomForestRegressor` for fundamental usages like pre-processing.
`get_categorical_ordinal_encoder`()	Find and return the `OrdinalEncoder` in data processors..
`get_derived_data_sizes`()	Get dimensions of derived unstacked features.
`get_derived_data_slice`(derived_data, indices)	Get slices of the derived unstacked data.
`get_df`(imputed, scaled, cat_transformed)	Get the entire dataframe with certain processing steps.
`get_feature_idx_by_type`(typ[, var_type])	Find features (by their index) of the specified type defined by `feature_types` in the configuration.
`get_feature_names_by_type`(typ)	Find features of the specified type defined by `feature_types` in the configuration.
`get_feature_types`(features[, allow_unknown])	Get the type defined in `feature_types` in the configuration for each feature.
`get_feature_types_idx`(features[, allow_unknown])	For each feature, get the index in `unique_feature_types` of its type defined in `feature_types` in the configuration.
`get_first_tensor_slice`(indices)	Get a slice of the tensor of continuous features.
`get_not_imputed_df`()	Get the tabular data without imputation.
`get_tabular_dataset`([transformed])	Get the tabular dataset loaded in the DataModule.
`get_var_change`(feature_name, value)	See how data processors act on a feature if its value is `value`.
`get_zero_slip`(feature_name)	See how data processors act on a feature if its value is zero.
`label_categories_inverse_transform`(X)	Inverse transformation of categorical ordinal encoding for the target.
`label_categories_transform`(X)	Categorical ordinal encoding for the target.
`load_data`([data_path, save_path])	Load tabular data.
`pca`([feature_names, feature_idx, indices])	Perform `sklearn.decomposition.PCA`
`prepare_new_data`(df[, derived_data, ...])	Prepare the new tabular dataset for predictions using `_predict()` Stacked and unstacked features are derived; missing values are imputed; The `transform` method of `AbstractProcessor` is called.
`save_data`(path)	Save the tabular data processed by `set_data()`.
`select_by_value`([selection, df, partition, ...])	Select data points with the given value(s) in the given column(s).
`set_data`(df, cont_feature_names, ...[, ...])	Set up the datamodule with a DataFrame.
`set_data_derivers`(config)	Set a list of data derivers containing the name and arguments for each data deriver.
`set_data_imputer`(config)	Set the data imputer.
`set_data_processors`(config)	Set a list of data processors containing the name and arguments for each data processor.
`set_data_splitter`(config[, ratio])	Set the data splitter.
`set_feature_names`(all_feature_names)	Set feature names to a subset of current features (i.e. `self.all_feature_names`) and reload the data.
`set_status`(training)	Set the status of the datamodule.
`sort_derived_data`(derived_data[, ignore_absence])	Sort the dict of derived unstacked data according to the order of derivation.
`unique_feature_types_with_derived`()	Unique values in `feature_types_with_derived()`.
`update_dataset`()	Update PyTorch tensors and datasets.
`_data_preprocess`(input_data[, warm_start, ...])	Call data processors to fit and/or transform the input tabular dataset.
`_data_process`([warm_start, skip_selector, ...])	The main procedure to process data after splitting and imputation.
`_get_gini`(tabular)	Get the gini index for each feature in the tabular dataset.
`_get_indices`([partition])	Get training/validation/testing indices.
`_get_kurtosis`(tabular)	Get the kurtosis for each feature in the tabular dataset.
`_get_mode`(tabular)	Get the mode value for each feature in the tabular dataset.
`_infer_loss`(task)	Automatically infer the loss type using the name of the task and the loss type given in the configuration.
`_infer_task`()	Automatically infer the task type using the target values.