tabensemb.data.datamodule.DataModule#

class tabensemb.data.datamodule.DataModule(config: UserConfig | Dict, initialize: bool = True)[source]#

Bases: object

The dataset manager that provides loading, processing, and analyzing utilities.

Attributes:
args

A tabensemb.config.UserConfig instance.

augmented_indices

Indices of data points that are augmented by tabensemb.data.AbstractAugmenter. The first index starts from the number of data points in the original dataset.

cat_feature_mapping

Original values of categorical features before ordinal encoding. The index of a value represents the encoded value.

cat_feature_names

Names of categorical features.

cont_feature_names

Names of continuous features.

data_path

The path to the data file.

dataderivers

A list of tabensemb.data.AbstractDeriver.

dataimputer

A tabensemb.data.AbstractImputer.

dataprocessors

A list of tabensemb.data.AbstractProcessor.

datasplitter

A tabensemb.data.AbstractSplitter.

derived_data

The derived unstacked data calculated using dataderivers whose argument “stacked” is set to False.

df

The unscaled processed dataset. It is already ordinal-encoded if a tabensemb.data.dataprocessor.CategoricalOrdinalEncoder is used.

dropped_indices

Indices of data points that are removed from the original dataset.

label_name

The name(s) of target(s)

label_ordinal_encoder

A sklearn.preprocessing.OrdinalEncoder that encodes the classification targets.

loss

The type of the loss function. See _infer_loss().

n_classes

The number of unique values for each classification target.

retained_indices

Indices of data points that are retained in the original dataset.

scaled_df

The scaled processed dataset. See also df.

task

The type of the task. See _infer_task().

tensors

feature_data(), expanded derived_data, and label_data() in the torch.Tensor form.

test_dataset

The testing set of the entire torch.utils.data.Dataset.

test_indices

Indices of the testing set in the entire dataset (df).

train_dataset

The training set of the entire torch.utils.data.Dataset.

train_indices

Indices of the training set in the entire dataset (df).

training

The training status of the DataModule. See set_status().

unstacked_col_names

Names of columns of each derived unstacked feature.

val_dataset

The validation set of the entire torch.utils.data.Dataset.

val_indices

Indices of the validation set in the entire dataset (df).

D_test

The derived unstacked data of the testing dataset.

D_train

The derived unstacked data of the training dataset.

D_val

The derived unstacked data of the validation dataset.

X_test

The unscaled testing dataset.

X_train

The unscaled training dataset.

X_val

The unscaled validation dataset.

all_feature_names

Get continuous feature names and categorical feature names after load_data().

cat_imputed_mask

A byte mask for categorical data, where 1 means the data is imputed, and 0 means the data originally exists.

cat_num_unique

The number of unique values of each categorical feature.

categorical_data

The categorical data.

cont_imputed_mask

A byte mask for continuous data, where 1 means the data is imputed, and 0 means the data originally exists.

derived_stacked_features

Find derived features in all_feature_names derived by data derivers whose argument “stacked” is set to True, i.e.

feature_data

The scaled feature data.

label_data

The scaled label data.

unscaled_feature_data

The unscaled feature data.

unscaled_label_data

The unscaled label data.

y_test

The target of the testing dataset.

y_train

The target of the training dataset.

y_val

The target of the validation dataset.

Methods

__init__(config: UserConfig | Dict, initialize: bool = True)[source]#

cal_corr([method, imputed, features_only, ...])

Calculate Pearson correlation coefficients among continuous features.

categories_inverse_transform(X)

Inverse transformation of CategoricalOrdinalEncoder for categorical features (If there is one in self.dataprocessors).

categories_transform(X)

Transformation of CategoricalOrdinalEncoder for categorical features (If there is one in self.dataprocessors).

data_transform(input_data, **kwargs)

Transform the input tabular dataset using fitted data processors.

dataset_dict()

derive(df)

Derive both stacked and unstacked features using the input dataframe.

derive_stacked(df)

Derive stacked features using the input dataframe.

derive_unstacked(df[, categorical_only])

Derive unstacked features using the input dataframe.

describe([imputed, scaled])

Describe the dataset using pd.DataFrame.describe, skewness, gini index, mode values, etc.

divide_from_tabular_dataset(data)

Get continuous feature data, categorical feature data, and label data respectively from the input DataFrame.

extract_derived_stacked_feature_names(...)

Find derived features in the input list derived by data derivers whose argument "stacked" is set to True, i.e. the stacked data.

extract_original_cat_feature_names(...)

Get original categorical features specified in the configuration.

extract_original_cont_feature_names(...)

Get original continuous features specified in the configuration.

feature_types_with_derived()

A dictionary stating the category of each feature, including derived stacked features.

generate_subset(dataset)

Split the dataset into training, validation and testing subsets.

generate_tensors(scaled_df, derived_data)

Generate PyTorch tensors.

get_additional_tensors_slice(indices)

Get slices of tensors of derived unstacked data.

get_all_derived_stacked_feature_names()

Get all derived stacked features (not intermediate) from arguments of dataderivers.

get_all_derived_unstacked_feature_names()

Get all derived unstacked features from unstacked_col_names.

get_base_predictor([categorical])

Get a sklearn RandomForestRegressor for fundamental usages like pre-processing.

get_categorical_ordinal_encoder()

Find and return the OrdinalEncoder in data processors..

get_derived_data_sizes()

Get dimensions of derived unstacked features.

get_derived_data_slice(derived_data, indices)

Get slices of the derived unstacked data.

get_df(imputed, scaled, cat_transformed)

Get the entire dataframe with certain processing steps.

get_feature_idx_by_type(typ[, var_type])

Find features (by their index) of the specified type defined by feature_types in the configuration.

get_feature_names_by_type(typ)

Find features of the specified type defined by feature_types in the configuration.

get_feature_types(features[, allow_unknown])

Get the type defined in feature_types in the configuration for each feature.

get_feature_types_idx(features[, allow_unknown])

For each feature, get the index in unique_feature_types of its type defined in feature_types in the configuration.

get_first_tensor_slice(indices)

Get a slice of the tensor of continuous features.

get_not_imputed_df()

Get the tabular data without imputation.

get_tabular_dataset([transformed])

Get the tabular dataset loaded in the DataModule.

get_var_change(feature_name, value)

See how data processors act on a feature if its value is value.

get_zero_slip(feature_name)

See how data processors act on a feature if its value is zero.

label_categories_inverse_transform(X)

Inverse transformation of categorical ordinal encoding for the target.

label_categories_transform(X)

Categorical ordinal encoding for the target.

load_data([data_path, save_path])

Load tabular data.

pca([feature_names, feature_idx, indices])

Perform sklearn.decomposition.PCA

prepare_new_data(df[, derived_data, ...])

Prepare the new tabular dataset for predictions using _predict() Stacked and unstacked features are derived; missing values are imputed; The transform method of AbstractProcessor is called.

save_data(path)

Save the tabular data processed by set_data().

select_by_value([selection, df, partition, ...])

Select data points with the given value(s) in the given column(s).

set_data(df, cont_feature_names, ...[, ...])

Set up the datamodule with a DataFrame.

set_data_derivers(config)

Set a list of data derivers containing the name and arguments for each data deriver.

set_data_imputer(config)

Set the data imputer.

set_data_processors(config)

Set a list of data processors containing the name and arguments for each data processor.

set_data_splitter(config[, ratio])

Set the data splitter.

set_feature_names(all_feature_names)

Set feature names to a subset of current features (i.e. self.all_feature_names) and reload the data.

set_status(training)

Set the status of the datamodule.

sort_derived_data(derived_data[, ignore_absence])

Sort the dict of derived unstacked data according to the order of derivation.

unique_feature_types_with_derived()

Unique values in feature_types_with_derived().

update_dataset()

Update PyTorch tensors and datasets.

_data_preprocess(input_data[, warm_start, ...])

Call data processors to fit and/or transform the input tabular dataset.

_data_process([warm_start, skip_selector, ...])

The main procedure to process data after splitting and imputation.

_get_gini(tabular)

Get the gini index for each feature in the tabular dataset.

_get_indices([partition])

Get training/validation/testing indices.

_get_kurtosis(tabular)

Get the kurtosis for each feature in the tabular dataset.

_get_mode(tabular)

Get the mode value for each feature in the tabular dataset.

_infer_loss(task)

Automatically infer the loss type using the name of the task and the loss type given in the configuration.

_infer_task()

Automatically infer the task type using the target values.