Multimodal data: Image classification as an example#

We take the MNIST image classification task as an example of loading multimodal data. This tutorial is for those that have read all parts in “Get Started” and advanced parts in “Advanced Usage” including “New data derivers” and “Customized model base”.

Although we support multimodal data, multimodal models are currently not integrated as part of the package (that’s why this part is in “Advanced Usage”). pytorch_widedeep (WideDeep in this package) and autogluon (AutoGluon in this package) support some multimodal models. If you are willing to develop multimodal models or add support to model bases, you are welcome to contribute on GitHub.

[1]:
import tabensemb
import torch
import os
from tempfile import TemporaryDirectory

temp_path = TemporaryDirectory()
tabensemb.setting["default_output_path"] = os.path.join(temp_path.name, "output")
tabensemb.setting["default_config_path"] = os.path.join(temp_path.name, "configs")
tabensemb.setting["default_data_path"] = os.path.join(temp_path.name, "data")

device = "cuda" if torch.cuda.is_available() else "cpu"

The following code is copied from an official example of pytorch that defines the network and transformation of images and downloads the dataset.

Remark: Note that the Net returns logits instead of the log_softmax transformed values in the official example for compatibility with the framework. We have emphasized this in “Customized model base”.

[2]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torchvision import datasets, transforms

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, 3, 1)
        self.conv2 = nn.Conv2d(32, 64, 3, 1)
        self.dropout1 = nn.Dropout(0.25)
        self.dropout2 = nn.Dropout(0.5)
        self.fc1 = nn.Linear(9216, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = self.conv1(x)
        x = F.relu(x)
        x = self.conv2(x)
        x = F.relu(x)
        x = F.max_pool2d(x, 2)
        x = self.dropout1(x)
        x = torch.flatten(x, 1)
        x = self.fc1(x)
        x = F.relu(x)
        x = self.dropout2(x)
        x = self.fc2(x)
        return x

transform=transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.1307,), (0.3081,))
        ])

dataset1 = datasets.MNIST(os.path.join(temp_path.name, "data"), train=True, download=True, transform=transform)
dataset2 = datasets.MNIST(os.path.join(temp_path.name, "data"), train=False, transform=transform)
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to /tmp/tmpcdcd59k_/data/MNIST/raw/train-images-idx3-ubyte.gz
Extracting /tmp/tmpcdcd59k_/data/MNIST/raw/train-images-idx3-ubyte.gz to /tmp/tmpcdcd59k_/data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to /tmp/tmpcdcd59k_/data/MNIST/raw/train-labels-idx1-ubyte.gz
Extracting /tmp/tmpcdcd59k_/data/MNIST/raw/train-labels-idx1-ubyte.gz to /tmp/tmpcdcd59k_/data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to /tmp/tmpcdcd59k_/data/MNIST/raw/t10k-images-idx3-ubyte.gz
Extracting /tmp/tmpcdcd59k_/data/MNIST/raw/t10k-images-idx3-ubyte.gz to /tmp/tmpcdcd59k_/data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to /tmp/tmpcdcd59k_/data/MNIST/raw/t10k-labels-idx1-ubyte.gz
Extracting /tmp/tmpcdcd59k_/data/MNIST/raw/t10k-labels-idx1-ubyte.gz to /tmp/tmpcdcd59k_/data/MNIST/raw

In this tutorial, the images are loaded into the memory.

[3]:
import numpy as np

train_images = []
train_targets = []
test_images = []
test_targets = []
for img, target in dataset1:
    train_images.append(img)
    train_targets.append(target)
for img, target in dataset2:
    test_images.append(img)
    test_targets.append(target)
images_array = torch.concat(train_images + test_images, dim=0).numpy()
targets_array = np.array(train_targets + test_targets)
images_array.shape, targets_array.shape
[3]:
((70000, 28, 28), (70000,))

Under this framework, multimodal data is loaded through data derivers. For data derivers, to load images for each data point, we need a column (here we name it image_index) in the tabular dataset that indicates the location of the image. In our case, the location is the index of the image in images_array. In other cases, the location might be a path to the image in the drive.

The MNIST dataset has a separate testing set (index>=60,000 in the images_array and targets_array defined above). We will use it after training to see the performance.

[4]:
import pandas as pd

train_df = pd.DataFrame({"image_index": list(range(len(train_images))), "target": train_targets})
test_df = pd.DataFrame({"image_index": list(range(len(train_images), len(train_images) + len(test_images))), "target": test_targets})

The data deriver to load images is very simple. Multimodal data is not in the tabular data, so stacked=False is set. The tabular data df contains indices of images that can be used to extract images from the above images_array. We need the user to pass an argument image_path to specify the column that indicates the location of images. This is not necessary because we can directly use "image_index" instead of self.kwargs["image_path"] since we already know which column is needed.

[5]:
from tabensemb.data import AbstractDeriver
from tabensemb.data.dataderiver import deriver_mapping

class MNISTLoader(AbstractDeriver):
    def _required_cols(self):
        return ["image_path"]

    def _defaults(self):
        return dict(stacked=False, derived_name="images", intermediate=False, is_continuous=False)

    def _derive(self, df, datamodule):
        images = images_array[df[self.kwargs["image_path"]]]
        print(f"Loaded images: {images.shape}")
        return images

deriver_mapping["MNISTLoader"] = MNISTLoader

The network of the official example can be easily migrated to the framework. In the forward passing, loaded images from the data deriver can be accessed in derived_tensors, and the key is "images" as defined above in _defaults. The tensor is of the shape (n_samples, width, height) and we transform it into (n_samples, n_channels, width, height) where n_channels=1 to meet the requirement of Net.

[6]:
from tabensemb.model import TorchModel, AbstractNN

class NetNN(AbstractNN):
    def __init__(self, datamodule, **kwargs):
        super(NetNN, self).__init__(datamodule, **kwargs)
        self.net = Net()

    def _forward(self, x, derived_tensors):
        images = derived_tensors["images"].unsqueeze(1)
        return self.net(images)

The implementation of the model base is straightforward.

[7]:
class NetModel(TorchModel):
    def _initial_values(self, model_name):
        return self.trainer.chosen_params

    def _space(self, model_name):
        return self.trainer.SPACE

    def _new_model(self, model_name: str, verbose: bool, **kwargs):
        return NetNN(self.trainer.datamodule, **kwargs)

    def _get_program_name(self):
        return "NetModel"

    def _get_model_names(self):
        return ["Net"]

Then we configure the Trainer. Importantly, the MNISTLoader defined above is used to load images, and the argument image_path is given here.

[8]:
from tabensemb.config import UserConfig
from tabensemb.trainer import Trainer

cfg = UserConfig.from_dict({
    "database": "mnist",
    "label_name": ["target"],
    "task": "multiclass",
    "data_derivers": [("MNISTLoader", {"image_path": "image_index"})],
    "epoch": 100,
})
trainer = Trainer(device=device)
trainer.load_config(config=cfg)
The project will be saved to /tmp/tmpcdcd59k_/output/mnist/2023-09-18-10-50-06-0_UserInputConfig

Since we have a separate testing set, during the training stage, we use the first 50,000 images for training and the last 10,000 images for validation and testing. We use the DataModule.set_data API instead of load_data to configure the dataset using these indices, which will skip the data splitter.

[9]:
train_indices = np.arange(50000)
val_indices = np.arange(50000, 60000)
test_indices = val_indices
trainer.datamodule.set_data(train_df, cont_feature_names=[], cat_feature_names=[], label_name=["target"], train_indices=train_indices, val_indices=val_indices, test_indices=test_indices)
Loaded images: (60000, 28, 28)

We can see that the images are loaded in DataModule.derived_data

[10]:
trainer.datamodule.derived_data["images"].shape
[10]:
(60000, 28, 28)

Now train the model. The default loss function is cross entropy loss as shown in the output.

[11]:
trainer.clear_modelbase()
trainer.add_modelbases([NetModel(trainer)])
trainer.train(stderr_to_stdout=True)

-------------Run NetModel-------------

Training Net
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
You are using a CUDA device ('NVIDIA GeForce RTX 3090') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name                | Type             | Params
---------------------------------------------------------
0 | default_loss_fn     | CrossEntropyLoss | 0
1 | default_output_norm | Softmax          | 0
2 | net                 | Net              | 1.2 M
---------------------------------------------------------
1.2 M     Trainable params
0         Non-trainable params
1.2 M     Total params
4.800     Total estimated model params size (MB)
Epoch: 1/100, Train loss: 0.6124, Val loss: 0.1363, Min val loss: 0.1363, Min ES val loss: 0.1363, Epoch time: 1.463s.
Epoch: 20/100, Train loss: 0.0194, Val loss: 0.0381, Min val loss: 0.0372, Min ES val loss: 0.0372, Epoch time: 1.424s.
Epoch: 40/100, Train loss: 0.0119, Val loss: 0.0428, Min val loss: 0.0372, Min ES val loss: 0.0372, Epoch time: 1.339s.
Epoch: 60/100, Train loss: 0.0084, Val loss: 0.0490, Min val loss: 0.0372, Min ES val loss: 0.0372, Epoch time: 1.189s.
Epoch: 80/100, Train loss: 0.0073, Val loss: 0.0527, Min val loss: 0.0372, Min ES val loss: 0.0372, Epoch time: 1.075s.
Epoch: 100/100, Train loss: 0.0063, Val loss: 0.0533, Min val loss: 0.0372, Min ES val loss: 0.0372, Epoch time: 1.164s.
`Trainer.fit` stopped: `max_epochs=100` reached.
Training log_loss loss: 0.00392
Validation log_loss loss: 0.03531
Testing log_loss loss: 0.03531
Trainer saved. To load the trainer, run trainer = load_trainer(path='/tmp/tmpcdcd59k_/output/mnist/2023-09-18-10-50-06-0_UserInputConfig/trainer.pkl')

-------------NetModel End-------------

It is easy to make inferences on the testing set for both predicted classes and probabilities. The data deriver again loads images from images_array.

[12]:
predictions = trainer.get_modelbase("NetModel").predict(test_df, model_name="Net")
proba = trainer.get_modelbase("NetModel").predict_proba(test_df, model_name="Net")
Loaded images: (10000, 28, 28)
Loaded images: (10000, 28, 28)

The prediction accuracy reaches around 99% on the testing set.

[13]:
from tabensemb.utils import auto_metric_sklearn

auto_metric_sklearn(targets_array[60000:], proba, "accuracy_score", "multiclass")
[13]:
0.9914