tabensemb.data.datamodule.DataModule#
- class tabensemb.data.datamodule.DataModule(config: UserConfig | Dict, initialize: bool = True)[source]#
Bases:
objectThe dataset manager that provides loading, processing, and analyzing utilities.
- Attributes:
- args
A
tabensemb.config.UserConfiginstance.- augmented_indices
Indices of data points that are augmented by
tabensemb.data.AbstractAugmenter. The first index starts from the number of data points in the original dataset.- cat_feature_mapping
Original values of categorical features before ordinal encoding. The index of a value represents the encoded value.
- cat_feature_names
Names of categorical features.
- cont_feature_names
Names of continuous features.
- data_path
The path to the data file.
- dataderivers
A list of
tabensemb.data.AbstractDeriver.- dataimputer
- dataprocessors
A list of
tabensemb.data.AbstractProcessor.- datasplitter
- derived_data
The derived unstacked data calculated using
dataderiverswhose argument “stacked” is set to False.- df
The unscaled processed dataset. It is already ordinal-encoded if a
tabensemb.data.dataprocessor.CategoricalOrdinalEncoderis used.- dropped_indices
Indices of data points that are removed from the original dataset.
- label_name
The name(s) of target(s)
- label_ordinal_encoder
A
sklearn.preprocessing.OrdinalEncoderthat encodes the classification targets.- loss
The type of the loss function. See
_infer_loss().- n_classes
The number of unique values for each classification target.
- retained_indices
Indices of data points that are retained in the original dataset.
- scaled_df
The scaled processed dataset. See also
df.- task
The type of the task. See
_infer_task().- tensors
feature_data(), expandedderived_data, andlabel_data()in the torch.Tensor form.- test_dataset
The testing set of the entire
torch.utils.data.Dataset.- test_indices
Indices of the testing set in the entire dataset (
df).- train_dataset
The training set of the entire
torch.utils.data.Dataset.- train_indices
Indices of the training set in the entire dataset (
df).- training
The training status of the
DataModule. Seeset_status().- unstacked_col_names
Names of columns of each derived unstacked feature.
- val_dataset
The validation set of the entire
torch.utils.data.Dataset.- val_indices
Indices of the validation set in the entire dataset (
df).D_testThe derived unstacked data of the testing dataset.
D_trainThe derived unstacked data of the training dataset.
D_valThe derived unstacked data of the validation dataset.
X_testThe unscaled testing dataset.
X_trainThe unscaled training dataset.
X_valThe unscaled validation dataset.
all_feature_namesGet continuous feature names and categorical feature names after
load_data().cat_imputed_maskA byte mask for categorical data, where 1 means the data is imputed, and 0 means the data originally exists.
cat_num_uniqueThe number of unique values of each categorical feature.
categorical_dataThe categorical data.
cont_imputed_maskA byte mask for continuous data, where 1 means the data is imputed, and 0 means the data originally exists.
derived_stacked_featuresFind derived features in
all_feature_namesderived by data derivers whose argument “stacked” is set to True, i.e.feature_dataThe scaled feature data.
label_dataThe scaled label data.
unscaled_feature_dataThe unscaled feature data.
unscaled_label_dataThe unscaled label data.
y_testThe target of the testing dataset.
y_trainThe target of the training dataset.
y_valThe target of the validation dataset.
Methods
- __init__(config: UserConfig | Dict, initialize: bool = True)[source]#
cal_corr([method, imputed, features_only, ...])Calculate Pearson correlation coefficients among continuous features.
Inverse transformation of
CategoricalOrdinalEncoderfor categorical features (If there is one inself.dataprocessors).Transformation of
CategoricalOrdinalEncoderfor categorical features (If there is one inself.dataprocessors).data_transform(input_data, **kwargs)Transform the input tabular dataset using fitted data processors.
derive(df)Derive both stacked and unstacked features using the input dataframe.
derive_stacked(df)Derive stacked features using the input dataframe.
derive_unstacked(df[, categorical_only])Derive unstacked features using the input dataframe.
describe([imputed, scaled])Describe the dataset using
pd.DataFrame.describe, skewness, gini index, mode values, etc.Get continuous feature data, categorical feature data, and label data respectively from the input DataFrame.
Find derived features in the input list derived by data derivers whose argument "stacked" is set to True, i.e. the stacked data.
Get original categorical features specified in the configuration.
Get original continuous features specified in the configuration.
A dictionary stating the category of each feature, including derived stacked features.
generate_subset(dataset)Split the dataset into training, validation and testing subsets.
generate_tensors(scaled_df, derived_data)Generate PyTorch tensors.
get_additional_tensors_slice(indices)Get slices of tensors of derived unstacked data.
Get all derived stacked features (not intermediate) from arguments of
dataderivers.Get all derived unstacked features from
unstacked_col_names.get_base_predictor([categorical])Get a sklearn
RandomForestRegressorfor fundamental usages like pre-processing.Find and return the
OrdinalEncoderin data processors..Get dimensions of derived unstacked features.
get_derived_data_slice(derived_data, indices)Get slices of the derived unstacked data.
get_df(imputed, scaled, cat_transformed)Get the entire dataframe with certain processing steps.
get_feature_idx_by_type(typ[, var_type])Find features (by their index) of the specified type defined by
feature_typesin the configuration.Find features of the specified type defined by
feature_typesin the configuration.get_feature_types(features[, allow_unknown])Get the type defined in
feature_typesin the configuration for each feature.get_feature_types_idx(features[, allow_unknown])For each feature, get the index in
unique_feature_typesof its type defined infeature_typesin the configuration.get_first_tensor_slice(indices)Get a slice of the tensor of continuous features.
Get the tabular data without imputation.
get_tabular_dataset([transformed])Get the tabular dataset loaded in the DataModule.
get_var_change(feature_name, value)See how data processors act on a feature if its value is
value.get_zero_slip(feature_name)See how data processors act on a feature if its value is zero.
Inverse transformation of categorical ordinal encoding for the target.
Categorical ordinal encoding for the target.
load_data([data_path, save_path])Load tabular data.
pca([feature_names, feature_idx, indices])Perform
sklearn.decomposition.PCAprepare_new_data(df[, derived_data, ...])Prepare the new tabular dataset for predictions using
_predict()Stacked and unstacked features are derived; missing values are imputed; Thetransformmethod ofAbstractProcessoris called.save_data(path)Save the tabular data processed by
set_data().select_by_value([selection, df, partition, ...])Select data points with the given value(s) in the given column(s).
set_data(df, cont_feature_names, ...[, ...])Set up the datamodule with a DataFrame.
set_data_derivers(config)Set a list of data derivers containing the name and arguments for each data deriver.
set_data_imputer(config)Set the data imputer.
set_data_processors(config)Set a list of data processors containing the name and arguments for each data processor.
set_data_splitter(config[, ratio])Set the data splitter.
set_feature_names(all_feature_names)Set feature names to a subset of current features (i.e.
self.all_feature_names) and reload the data.set_status(training)Set the status of the datamodule.
sort_derived_data(derived_data[, ignore_absence])Sort the dict of derived unstacked data according to the order of derivation.
Unique values in
feature_types_with_derived().Update PyTorch tensors and datasets.
_data_preprocess(input_data[, warm_start, ...])Call data processors to fit and/or transform the input tabular dataset.
_data_process([warm_start, skip_selector, ...])The main procedure to process data after splitting and imputation.
_get_gini(tabular)Get the gini index for each feature in the tabular dataset.
_get_indices([partition])Get training/validation/testing indices.
_get_kurtosis(tabular)Get the kurtosis for each feature in the tabular dataset.
_get_mode(tabular)Get the mode value for each feature in the tabular dataset.
_infer_loss(task)Automatically infer the loss type using the name of the task and the loss type given in the configuration.
Automatically infer the task type using the target values.