Deep Glue API Reference

This section provides an API reference for the main functions in deepglue. In other words, this page directly renders the docstrings from the different modules in deepglue, and provides links to the functions themselves in the code base. It's how you dig into the weeds.

Note Deep Glue is pre-alpha, a rapidly changing work in progress.

Training utilities

deepglue training_utils.py

Functions that are useful for training deep networks, including validation and testing and metrics.

`accuracy(outputs, targets, topk=(1,))`

Computes the top-k accuracy for classifier predictions.

Calculates how often the true label is within the top-k predictions, for each value of k specified in topk.

Parameters:

Name	Type	Description	Default
`outputs`	`Tensor`	Model predictions of shape (num_samples, num_classes), where each row contains the logits or probabilities for each class.	required
`targets`	`Tensor`	The ground truth labels, of shape (num_samples,) or (num_samples, num_classes) if one-hot encoded.	required
`topk`	`tuple of int`	A tuple of integers specifying the values of k for which to compute the prediction accuracy. Defaults to (1,).	`(1,)`

Returns:

Type	Description
`list of torch.Tensor`	A list of accuracy values for each specified k in `topk`, expressed as percentages.

Notes

Adapted from torchvision's accuracy() function (release 0.19.1), which is licensed under the BSD-3 License.
Original implementation in pytorch/vision/references/classification/utils.py

Source code in deepglue/training_utils.py

def accuracy(outputs, targets, topk=(1,)):
    """
    Computes the top-k accuracy for classifier predictions.

    Calculates how often the true label is within the top-k predictions,
    for each value of k specified in `topk`. 

    Parameters
    ----------
    outputs : torch.Tensor
        Model predictions of shape (num_samples, num_classes), where each row contains the 
        logits or probabilities for each class. 
    targets : torch.Tensor
        The ground truth labels, of shape (num_samples,) or (num_samples, num_classes) if one-hot encoded.
    topk : tuple of int, optional
        A tuple of integers specifying the values of k for which to compute the prediction accuracy.
        Defaults to (1,).

    Returns
    -------
    list of torch.Tensor
        A list of accuracy values for each specified k in `topk`, expressed as percentages.

    Notes
    -----
    - Adapted from torchvision's accuracy() function (release 0.19.1), which is licensed under the BSD-3 License.
    - Original implementation in pytorch/vision/references/classification/utils.py 
    """
    # logging.debug(f"Calculating topk accuracy with topk input {topk}")

    with torch.inference_mode():
        maxk = max(topk)
        batch_size = targets.size(0)
        # if targets are one-hot encoded
        if targets.ndim == 2:
            targets = targets.max(dim=1)[1]

        _, pred = outputs.topk(maxk, 1, True, True) # get the top k predictions
        pred = pred.t() # convert to maxk x batch_size which is what comparitor wants
        correct = pred.eq(targets.unsqueeze(0))  # k x batches bool gives position of correct prediction (if any) for batch col

        topk_accuracy = []
        for k in topk:
            correct_k = correct[:k].flatten().sum(dtype=torch.float32) # sum all correct in first k rows
            proportion_correct = correct_k/batch_size
            topk_accuracy.append(100.0*proportion_correct)

        return topk_accuracy

`extract_features(dataloader, feature_extractor, layer, device='cuda')`

Extract features from a network layer using a data loader, feature extractor, and specified layer.

Parameters:

Name	Type	Description	Default
`dataloader`	`DataLoader`	DataLoader for the dataset (often configured without shuffling or dropping samples)	required
`feature_extractor`	`Module`	The feature extractor model, note this is typically created with torchvision's `feature_extraction.create_feature_extractor()` built-in.	required
`layer`	`str`	The name of the layer to extract features from. Must be present in the output of the feature extractor.	required
`device`	`str`	The device to use for feature extraction ('cuda' or 'cpu'). Defaults to 'cuda'.	`'cuda'`

Returns:

Name	Type	Description
`features`	`ndarray`	Extracted features of shape (num_images, num_flattened_features), where `num_flattened_features` depends on the layer output dimensions.
`labels`	`ndarray`	Corresponding ground-truth labels for each image, of shape (num_images,).

Raises:

Type	Description
`KeyError`	If the specified layer is not found in the output of the feature extractor.

Notes

For large datasets, ensure sufficient memory is available for concatenating feature arrays: they can grow extremely large for large network models.

TODO

Add optimizations for very large arrays (eg quantization, out-of-core computation with dask and xarray, etc).

Source code in deepglue/training_utils.py

def extract_features(dataloader, feature_extractor, layer, device='cuda'):
    """
    Extract features from a network layer using a data loader, feature extractor, and specified layer.

    Parameters
    ----------
    dataloader : torch.utils.data.DataLoader
        DataLoader for the dataset (often configured without shuffling or dropping samples)
    feature_extractor : torch.nn.Module
        The feature extractor model, note this is typically created with
        torchvision's `feature_extraction.create_feature_extractor()` built-in. 
    layer : str
        The name of the layer to extract features from. Must be present in the output of the feature extractor.
    device : str, optional
        The device to use for feature extraction ('cuda' or 'cpu'). Defaults to 'cuda'.

    Returns
    -------
    features : numpy.ndarray
        Extracted features of shape (num_images, num_flattened_features), where `num_flattened_features`
        depends on the layer output dimensions.
    labels : numpy.ndarray
        Corresponding ground-truth labels for each image, of shape (num_images,).

    Raises
    ------
    KeyError
        If the specified layer is not found in the output of the feature extractor.

    Notes
    -----
    - For large datasets, ensure sufficient memory is available for concatenating feature arrays: they can 
      grow extremely large for large network models. 

    TODO
    ----
    - Add optimizations for very large arrays (eg quantization, out-of-core computation with dask and xarray, etc).
    """

    logging.info(f"Feature extraction starting for layer '{layer}'. Setup can take a minute.")

    feature_extractor.to(device)

    # Initialize vars
    features = []  # To store flattened features for the specified layer
    labels = [] 

    # extract features batch by batch
    for batch_num, (batch_images, batch_labels) in tqdm(enumerate(dataloader),
                                                        desc="Extracting features",
                                                        total=len(dataloader)):
        batch_images = batch_images.to(device)
        with torch.no_grad():
            outputs = feature_extractor(batch_images)

            # Get the output for the specified layer
            if layer not in outputs:
                raise KeyError(f"Layer '{layer}' not found in the feature extractor outputs!")

            # Flatten the features for each image in the batch
            output = outputs[layer]
            flattened_features = output.reshape(output.size(0), -1)
            features.append(flattened_features.cpu().numpy())

            labels.extend(batch_labels.numpy())

    # Concatenate features across all batches
    features = np.concatenate(features, axis=0)  # Shape: [num_images, num_flattened_features]
    labels = np.array(labels)  # Shape: [num_images]

    logging.info(f"Feature extraction complete for layer '{layer}'.")

    return features, labels

`predict_all(model, data_loader, device='cuda')`

Make predictions for all batches of data in data loader.

Use the model to generate predictions for all batches from the provided data loader. It returns the predicted class labels, true labels, class probabilities for each sample.

Parameters:

Name	Type	Description	Default
`model`	`Module`	Trained PyTorch model (e.g., ResNet50).	required
`data_loader`	`DataLoader`	An iterable that provides batches of input data and their corresponding labels.	required
`device`	`str`	The device ('cpu' or 'cuda') on which the model and data are placed. Defaults to 'cuda'.	`'cuda'`

Returns:

Name	Type	Description
`all_predictions`	`Tensor`	An array of predicted labels for each sample in the dataset, with shape (num_samples,)
`all_labels`	`Tensor`	An array of true labels for each sample in the dataset, with shape (num_samples,)
`all_probabilities`	`Tensor`	A 2D array of shape (num_samples, num_categories) containing the softmax-normalized probabilities for each category: each row represents the predicted probability distribution for a single sample.

Source code in deepglue/training_utils.py

def predict_all(model, data_loader, device='cuda'):
    """
    Make predictions for all batches of data in data loader.

    Use the model to generate predictions for all batches from the provided data loader. 
    It returns the predicted class labels, true labels, class probabilities for each sample. 

    Parameters
    ----------
    model : torch.nn.Module
        Trained PyTorch model (e.g., ResNet50).
    data_loader : torch.utils.data.DataLoader
        An iterable that provides batches of input data and their corresponding labels.
    device : str, optional
        The device ('cpu' or 'cuda') on which the model and data are placed.
        Defaults to 'cuda'.

    Returns
    -------
    all_predictions : torch.Tensor
        An array of predicted labels for each sample in the dataset, with shape (num_samples,)
    all_labels : torch.Tensor
        An array of true labels for each sample in the dataset, with shape (num_samples,)
    all_probabilities: torch.Tensor
        A 2D array of shape (num_samples, num_categories) containing the softmax-normalized
        probabilities for each category: each row represents the predicted probability 
        distribution for a single sample.
    """
    all_predictions = []
    all_labels = []
    all_probabilities = []
    num_batches = len(data_loader)
    model.to(device)

    model.eval()
    with torch.no_grad():
        for images, labels in tqdm(data_loader, total=num_batches, desc="Predicting Batches"):
            images, labels = images.to(device), labels.to(device)
            logits = model(images) # logits
            _, preds = torch.max(logits, 1)
            all_predictions.append(preds.cpu())
            all_labels.append(labels.cpu())

            # convert logits to probs in batch 
            probabilities = softmax(logits, dim=1)  
            all_probabilities.append(probabilities.cpu())

    all_predictions = torch.cat(all_predictions, dim=0)
    all_labels = torch.cat(all_labels, dim=0)
    all_probabilities = torch.cat(all_probabilities, dim=0)
    return all_predictions, all_labels, all_probabilities

`predict_batch(model, image_batch, device='cuda')`

Predicts the category probabilities for a batch of images

Parameters:

Name	Type	Description	Default
`model`	`Module`	Trained PyTorch model (e.g., ResNet50).	required
`image_batch`	`Tensor`	A batch of images of shape (batch_size, 3, H, W).	required
`device`	`str`	The device ('cpu' or 'cuda') on which the model and data are placed. Defaults to cuda	`'cuda'`

Returns:

Name	Type	Description
`probabilities`	`Tensor`	Predicted probabilities for each image in the batch. Shape is (batch_size x num_categories)

TODO

Change name to predict_sample because this isn't a batch in the conventional sense coming from a data loader, keep the language consistent across the package.
Have it return predicted 'labels' and actual labels like predict_all does.

Source code in deepglue/training_utils.py

def predict_batch(model, image_batch, device='cuda'):
    """
    Predicts the category probabilities for a batch of images

    Parameters
    ----------
    model : torch.nn.Module
        Trained PyTorch model (e.g., ResNet50).
    image_batch : torch.Tensor
        A batch of images of shape (batch_size, 3, H, W).
    device : str, optional
        The device ('cpu' or 'cuda') on which the model and data are placed.
        Defaults to cuda

    Returns
    -------
    probabilities: torch.Tensor
        Predicted probabilities for each image in the batch.
        Shape is (batch_size x num_categories)

    TODO
    ----
    - Change name to predict_sample because this isn't a batch in the conventional sense coming
    from a data loader, keep the language consistent across the package. 
    - Have it return predicted 'labels' and actual labels like predict_all does.
    """
    if device not in ['cuda', 'cpu']:
        raise ValueError(f"Invalid device: {device}. Use 'cuda' or 'cpu'.")

    model = model.to(device)
    image_batch = image_batch.to(device)

    logging.info(f"Generating predictions for {image_batch.shape[0]} samples")

    model.eval()
    with torch.no_grad():
        logits = model(image_batch)
        probabilities = softmax(logits, dim=1)
    return probabilities

`prepare_ordered_data(data_path, transform, num_workers=0, batch_size=4, split_type='valid')`

Prepare ordered data loader and correponding image path list for feature extraction or other pipelines that require a full dataset in order.

Generate a list of image paths and a DataLoader for a given dataset split. The image path and the DataLoader indices are guaranteed to match because both shuffle and drop_last are set to False, ensuring the data will be accessed in order without dropping any samples.

Parameters:

Name	Type	Description	Default
`data_path`	`str or Path`	Path to the root directory containing the split folders ('train', 'valid', 'test')	required
`transform`	`torchvision transform (callable)`	The transformations to apply to each image.	required
`num_workers`	`int`	Number of workers for parallel data loading. Higher values improve performance during feature extraction but may lead to multiprocessing issues on some platforms. Defaults to 0 (no multiprocessing).	`0`
`batch_size`	`int`	Batch size for the DataLoader. Larger values improve feature extraction speed but requires more memory. Defaults to 4.	`4`
`split_type`	`str`	The split folder to sample from ('train', 'valid', 'test'). Defaults to 'train'.	`'valid'`

Returns:

Name	Type	Description
`image_paths`	`list of str`	A list of file paths to the images in the dataset split, in the same order as the DataLoader batches.
`ordered_loader`	`DataLoader`	A DataLoader for the ordered dataset split, configured to not shuffle data and to include all samples.

Raises:

Type	Description
`FileNotFoundError`	If the specified data paths do not exist.

Notes

Designed for feature extraction workflows where maintaining the correspondence between image file paths and DataLoader batches is critical.
For large datasets, consider increasing num_workers and batch_size for better performance.

Source code in deepglue/training_utils.py

def prepare_ordered_data(data_path, transform, num_workers=0, batch_size=4, split_type='valid'):
    """
    Prepare ordered data loader and correponding image path list for feature extraction or 
    other pipelines that require a full dataset in order.

    Generate a list of image paths and a DataLoader for a given dataset split.
    The image path and the DataLoader indices are guaranteed to match because both `shuffle`
    and `drop_last` are set to `False`, ensuring the data will be accessed in order without
    dropping any samples.

    Parameters
    ----------
    data_path : str or Path
        Path to the root directory containing the split folders ('train', 'valid', 'test')
    transform : torchvision transform (callable)
        The transformations to apply to each image.
    num_workers : int, optional
        Number of workers for parallel data loading. Higher values improve performance
        during feature extraction but may lead to multiprocessing issues on some platforms.
        Defaults to 0 (no multiprocessing).
    batch_size : int, optional
        Batch size for the DataLoader. Larger values improve feature extraction speed
        but requires more memory. Defaults to 4.
    split_type : str, optional
        The split folder to sample from ('train', 'valid', 'test'). Defaults to 'train'.

    Returns
    -------
    image_paths : list of str
        A list of file paths to the images in the dataset split, in the same order as
        the DataLoader batches.
    ordered_loader : torch.utils.data.DataLoader
        A DataLoader for the ordered dataset split, configured to not shuffle data and
        to include all samples.

    Raises
    ------
    FileNotFoundError
        If the specified data paths do not exist.

    Notes
    -----
    - Designed for feature extraction workflows where maintaining the correspondence between 
      image file paths and DataLoader batches is critical.
    - For large datasets, consider increasing `num_workers` and `batch_size` for better performance.
    """
    data_path = Path(data_path) 
    split_path =data_path / split_type
    if not split_path.exists():
        raise FileNotFoundError(f"{split_path} does not exist.")

    dataset = datasets.ImageFolder(root=split_path, transform=transform)
    image_paths = [path for path,_ in dataset.samples]
    ordered_loader = DataLoader(dataset,
                                batch_size=batch_size,
                                num_workers=num_workers,
                                shuffle=False,  # do not change
                                drop_last=False)  # do not change

    return image_paths, ordered_loader    

`train_and_validate(model, train_data_loader, valid_data_loader, loss_function, optimizer, device, topk=(1, 5), epochs=25)`

Train and validate a model for a given number of epochs.

Parameters:

Name	Type	Description	Default
`model`	`torch model`	The neural network model to be trained and validated.	required
`train_data_loader`	`DataLoader`	An iterable that provides the training data batches.	required
`valid_data_loader`	`DataLoader`	An iterable that provides the batches for validation data set	required
`loss_function`	`callable`	The loss function to compute the loss (e.g., CrossEntropyLoss).	required
`optimizer`	`Optimizer`	The optimizer used to update model parameters during training (e.g., Adam, SGD).	required
`device`	`str`	The device ('cpu' or 'cuda') on which the model and data are placed.	required
`topk`		A tuple specifying which top-k accuracies to calculate. Defaults to (1,5)	`(1, 5)`
`epochs`	`int`	Number of epochs to run. Defaults to 25.	`25`

Returns:

Name	Type	Description
`model`	`Module`	The trained model after the completion of training.
`history`	`dict`	A dictionary containing training and validation loss and top-k accuracies per epoch.

Source code in deepglue/training_utils.py

def train_and_validate(model, 
                       train_data_loader, 
                       valid_data_loader, 
                       loss_function, 
                       optimizer, 
                       device, 
                       topk=(1,5),
                       epochs=25):
    """
    Train and validate a model for a given number of epochs.

    Parameters
    ----------
    model : torch model
        The neural network model to be trained and validated.
    train_data_loader : torch.utils.data.DataLoader
        An iterable that provides the training data batches.
    valid_data_loader : torch.utils.data.DataLoader
        An iterable that provides the batches for validation data set
    loss_function : callable
        The loss function to compute the loss (e.g., CrossEntropyLoss).
    optimizer : torch.optim.Optimizer
        The optimizer used to update model parameters during training (e.g., Adam, SGD).
    device : str
        The device ('cpu' or 'cuda') on which the model and data are placed.
    topk: tuple of ints, optional
        A tuple specifying which top-k accuracies to calculate. Defaults to (1,5)
    epochs : int, optional
        Number of epochs to run. Defaults to 25.

    Returns
    -------
    model : torch.nn.Module
        The trained model after the completion of training.
    history : dict
        A dictionary containing training and validation loss and top-k accuracies per epoch.
    """
    train_loss, validation_loss = [], []
    train_topk_acc, validation_topk_acc = [], []

    logging.info(f"Training/validation {epochs} epochs")

    for epoch in range(epochs):
        logging.info(f"Epoch {epoch+1}/{epochs}")

        # Training step
        epoch_loss_train, epoch_train_topk = train_one_epoch(model, 
                                                             train_data_loader, 
                                                             loss_function, 
                                                             optimizer, 
                                                             device,
                                                             topk=topk)
        logging.info(f"\tTraining: Loss {epoch_loss_train:0.4f}")

        train_loss.append(epoch_loss_train)
        train_topk_acc.append(epoch_train_topk)

        # Validation step
        epoch_loss_val, epoch_val_topk = validate_one_epoch(model, 
                                                            valid_data_loader, 
                                                            loss_function, 
                                                            device,
                                                            topk=topk)

        validation_loss.append(epoch_loss_val)
        validation_topk_acc.append(epoch_val_topk)

        logging.info(f"\tValidation: Loss {epoch_loss_val:.4f}")

        torch.cuda.empty_cache()  # Clears unused GPU memory

    logging.info("Done!")

    history = {'train_loss': np.array(train_loss), 
               'train_topk_accuracy': np.array(train_topk_acc),
               'val_loss': np.array(validation_loss),
               'val_topk_accuracy': np.array(validation_topk_acc) }               

    return model, history  # Return both the trained model and the history

`train_one_epoch(model, train_data_loader, loss_function, optimizer, device, topk=(1, 5))`

Trains the model for one epoch using the provided training data loader.

Parameters:

Name	Type	Description	Default
`model`	`torch model`	The neural network model to be trained.	required
`train_data_loader`	`DataLoader`	An iterable that provides the training data batches.	required
`loss_function`	`callable`	The loss function to compute the loss (e.g., CrossEntropyLoss).	required
`optimizer`	`Optimizer`	The optimizer used to update model parameters (e.g., Adam, SGD).	required
`device`	`str`	The device ('cpu' or 'cuda') on which the model and data are to be placed.	required
`topk`		A tuple specifying which top-k accuracies to calculate. Defaults to (1,5)	`(1, 5)`

Returns:

Name	Type	Description
`epoch_loss`	`float`	The average loss over all samples in the epoch.
`epoch_topk_acc`	`list of floats`	A list of average top-k accuracies over all samples in the epoch.

Notes

The function logs progress using the logging module. Set your loggers to 'debug' to see progress.

Source code in deepglue/training_utils.py

def train_one_epoch(model, train_data_loader, loss_function, optimizer, device, topk=(1,5)):
    """
    Trains the model for one epoch using the provided training data loader.

    Parameters
    ----------
    model : torch model
        The neural network model to be trained.
    train_data_loader : torch.utils.data.DataLoader
        An iterable that provides the training data batches.
    loss_function : callable
        The loss function to compute the loss (e.g., CrossEntropyLoss).
    optimizer : torch.optim.Optimizer
        The optimizer used to update model parameters (e.g., Adam, SGD).
    device : str
        The device ('cpu' or 'cuda') on which the model and data are to be placed.
    topk: tuple of ints
        A tuple specifying which top-k accuracies to calculate. Defaults to (1,5)

    Returns
    -------
    epoch_loss : float
        The average loss over all samples in the epoch.
    epoch_topk_acc : list of floats
        A list of average top-k accuracies over all samples in the epoch. 

    Notes
    -----
    The function logs progress using the logging module. Set your loggers to 'debug' to see progress.
    """
    model.to(device)
    model.train()  # Set the model to training mode

    # initialize losses and sample numbers
    running_loss = 0.0
    total_correct_k = [0.0]*len(topk) # to accumulate total number correct at each k level
    total_samples = 0

    num_batches = len(train_data_loader)
    display_period = max(5, int(0.05*num_batches))
    logging.debug(f"Starting training on {num_batches} batches.")
    logging.debug(f"Display period {display_period}")

    # data loader will cycle through all batches in one epoch
    for batch_num, (inputs, labels) in enumerate(train_data_loader):
        if np.mod(batch_num, display_period) == 0:
            logging.debug(f"Starting batch {batch_num}/{num_batches}")

        inputs = inputs.to(device)
        labels = labels.to(device)

        batch_size = inputs.size(0)
        total_samples += batch_size

        optimizer.zero_grad()  # Zero out gradients

        # Forward pass
        outputs = model(inputs)
        loss = loss_function(outputs, labels)

        if np.mod(batch_num, display_period) == 0:
            logging.debug("\tStarting backwards pass")

        # Backward pass and optimization
        loss.backward()
        optimizer.step()

        # Track training loss
        running_loss += loss.item() * batch_size 

        # Calculate number correct for each k (just using accuracy may not be good if batch sizes vary)
        batch_k_accuracies = accuracy(outputs, labels, topk=topk)
        for i, acc_k in enumerate(batch_k_accuracies):
            num_correct_k = acc_k * batch_size / 100
            total_correct_k[i] += num_correct_k 

        if np.mod(batch_num, display_period) == 0:
            logging.debug(f"\tLoss = {loss.item():.3f}")

    # Compute average loss and accuracy over the epoch
    epoch_loss = running_loss / total_samples
    epoch_topk_acc = [100*(correctk.item() / total_samples) for correctk in total_correct_k]

    logging.debug("Done training!")
    logging.debug(f"Training epoch loss: {epoch_loss:0.3f}")

    return epoch_loss, np.array(epoch_topk_acc)

`validate_one_epoch(model, valid_data_loader, loss_function, device, topk=(1, 5))`

Validates the model for one epoch using the provided validation data loader.

Parameters:

Name	Type	Description	Default
`model`	`torch model`	The neural network model to be validated.	required
`valid_data_loader`	`DataLoader`	An iterable that provides the batches for validation data set	required
`loss_function`	`callable`	The loss function to compute the loss (e.g., CrossEntropyLoss).	required
`device`	`str`	The device ('cpu' or 'cuda') on which the model and data are placed.	required
`topk`		A tuple specifying which top-k accuracies to calculate. Defaults to (1,5)	`(1, 5)`

Returns:

Name	Type	Description
`epoch_loss`	`float`	The average loss over all samples in the validation epoch.
`epoch_topk_acc`	`list of floats`	A list of average top-k accuracies over all samples in the epoch.

Notes

Runs in evaluation mode (model.eval()) and gradient calculations are disabled.

Source code in deepglue/training_utils.py

def validate_one_epoch(model, valid_data_loader, loss_function, device, topk=(1,5)):
    """
    Validates the model for one epoch using the provided validation data loader.

    Parameters
    ----------
    model : torch model
        The neural network model to be validated.
    valid_data_loader : torch.utils.data.DataLoader
        An iterable that provides the batches for validation data set
    loss_function : callable
        The loss function to compute the loss (e.g., CrossEntropyLoss).
    device : str
        The device ('cpu' or 'cuda') on which the model and data are placed.
    topk: tuple of ints
        A tuple specifying which top-k accuracies to calculate. Defaults to (1,5)

    Returns
    -------
    epoch_loss : float
        The average loss over all samples in the validation epoch.
    epoch_topk_acc : list of floats
        A list of average top-k accuracies over all samples in the epoch. 

    Notes
    -----
    Runs in evaluation mode (`model.eval()`) and gradient calculations are disabled.
    """

    model.to(device)
    model.eval()  # Set the model to evaluation mode

    # initialize 
    running_loss = 0.0
    total_correct_k = [0.0] * len(topk)  # To accumulate the total number of correct predictions at each k level
    total_samples = 0

    num_batches = len(valid_data_loader)
    display_period = max(5, int(0.05*num_batches))
    logging.debug(f"Starting validation on {num_batches} batches.")
    logging.debug(f"Display period {display_period}")

    with torch.no_grad():  # Disable gradient calculation for validation
        # data loader will cycle through all batches in one epoch
        for batch_num, (inputs, labels) in enumerate(valid_data_loader):
            inputs = inputs.to(device)
            labels = labels.to(device)

            batch_size = inputs.size(0)
            total_samples += batch_size

            # Forward pass
            outputs = model(inputs)
            loss = loss_function(outputs, labels)

            # Loss
            running_loss += loss.item() * batch_size

            # Calculate number correct for each k using the accuracy function
            batch_k_accuracies = accuracy(outputs, labels, topk=topk)
            for i, acc_k in enumerate(batch_k_accuracies):
                num_correct_k = acc_k * batch_size / 100
                total_correct_k[i] += num_correct_k 

            if np.mod(batch_num, display_period) == 0:
                logging.debug(f"Batch {batch_num}/{num_batches} loss = {loss.item():.3f}")

    # Compute average loss and top-k accuracies over the epoch
    epoch_loss = running_loss / total_samples
    epoch_topk_acc = [100 * (correct_k.item() / total_samples) for correct_k in total_correct_k]

    logging.debug("Done validation!")
    logging.debug(f"Validation epoch loss: {epoch_loss:.3f}")

    return epoch_loss, np.array(epoch_topk_acc)

Plot utilities

deepglue plot_utils.py

Module includes functions that are useful for plotting/visualization during different deep learning tasks

`convert_for_plotting(tensor)`

Convert torch tensor image (typically float CxHxW) to a format suitable for standard plotting libraries (uint8 HxWxC).

Parameters:

Name	Type	Description	Default
`tensor`	`Tensor`	The input tensor image. Expected shape: (C, H, W). Typically a float, often not in [0, 1] range.	required

Returns:

Type	Description
`Tensor`	A uint8 tensor image scaled to [0, 255] for plotting and dims (H,W,C)

Source code in deepglue/plot_utils.py

def convert_for_plotting(tensor):
    """
    Convert torch tensor image (typically float CxHxW) to a format suitable for standard plotting libraries (uint8 HxWxC).

    Parameters
    ----------
    tensor : torch.Tensor
        The input tensor image. Expected shape: (C, H, W). Typically a float, often not in [0, 1] range.

    Returns
    -------
    torch.Tensor
        A uint8 tensor image scaled to [0, 255] for plotting and dims (H,W,C)
    """
    # Validate input
    if not isinstance(tensor, torch.Tensor):
        raise ValueError(f"Expected input to be a torch.Tensor, but got {type(tensor)}.")
    if tensor.ndim != 3 or tensor.size(0) not in {1, 3}:
        raise ValueError(f"Expected tensor shape (1, H, W) or (3, H, W), but got {tuple(tensor.shape)}.")

    # Ensure the tensor is on CPU and detached from any computation graph
    tensor = tensor.detach().cpu()

    # Handle grayscale by expanding it to 3 channels for consistent plotting
    if tensor.size(0) == 1:
        tensor = tensor.expand(3, -1, -1)  # Convert (1, H, W) -> (3, H, W)

    # Clamp float range to [0,1]
    tensor = tensor - tensor.min()
    tensor = tensor / tensor.max()

    # Scale to [0, 255] and convert to uint8
    tensor = (tensor * 255).byte()

    # Reorder dimensions to (3, W, C) for standard plotting libraries
    if tensor.dim() == 3:
        tensor = tensor.permute(1, 2, 0)  # (3, H, W) -> (H, W, 3)

    return tensor

`create_embeddable_image(image_path, size=(50, 50), quality=50)`

Converts an image to a base64-encoded string for embedding in HTML.

Loads an image from disk, resizes it, and converts it to a specified format (default is JPEG). The processed image is then base64-encoded and returned as a string that can be embedded in HTML or visualized interactively using tools like Bokeh.

Parameters:

Name	Type	Description	Default
`image_path`	`str or Path`	Path to the input image file.	required
`size`	`tuple of int`	Desired size for the resized image as (width, height). Defaults to (50, 50).	`(50, 50)`
`format`	`str`	Image format for saving. Supported formats include 'JPEG' and 'PNG'. Defaults to 'JPEG'.	required
`quality`	`int`	Compression quality for the image Valid values are between 1 (worst) and 95 (best). Defaults to 50.	`50`

Returns:

Type	Description
`str`	A Base64-encoded string representing the processed image, ready for embedding.

Notes

Adapted from umap example at https://umap-learn.readthedocs.io/en/latest/basic_usage.html

Source code in deepglue/plot_utils.py

def create_embeddable_image(image_path, size=(50, 50), quality=50):
    """
    Converts an image to a base64-encoded string for embedding in HTML.

    Loads an image from disk, resizes it, and converts it to a specified format
    (default is JPEG). The processed image is then base64-encoded and returned as a
    string that can be embedded in HTML or visualized interactively using tools like Bokeh.

    Parameters
    ----------
    image_path : str or Path
        Path to the input image file.
    size : tuple of int, optional
        Desired size for the resized image as (width, height). Defaults to (50, 50).
    format : str, optional
        Image format for saving. Supported formats include 'JPEG' and 'PNG'. Defaults to 'JPEG'.
    quality : int, optional
        Compression quality for the image 
        Valid values are between 1 (worst) and 95 (best). Defaults to 50.

    Returns
    -------
    str
        A Base64-encoded string representing the processed image, ready for embedding.

    Notes
    -----
    - Adapted from umap example at https://umap-learn.readthedocs.io/en/latest/basic_usage.html
    """
    image = Image.open(image_path).convert('RGB').resize(size, Image.Resampling.BICUBIC)

    # Save the image to a memory buffer
    buffer = BytesIO()
    image.save(buffer, format="JPEG", quality=quality)
    #buffer.seek(0)  # Ensure the buffer is at the beginning

    # Convert the image to Base64 encoding
    base64_encoded = base64.b64encode(buffer.getvalue()).decode()

    # Return the Base64 string with the data URI prefix
    return f'data:image/jpeg;base64,{base64_encoded}'

`plot_batch(batch_images, batch_targets, category_map, max_to_plot=32)`

Plots a batch of images, and their corresponding target categories, from a DataLoader.

Parameters:

Name	Type	Description	Default
`batch_images`	`Tensor`	A tensor containing a batch of images with shape `(N, C, H, W)`, where `N` is the batch size, `C` is the number of channels, `H` is the height, and `W` is the width of the images.	required
`batch_targets`	`Tensor`	A tensor containing the target labels for the batch, with shape `(N,)`.	required
`category_map`	`dict`	A dictionary mapping category indices (as strings) to their human-readable labels, e.g., `{'0': 'car', '1': 'ant'}`.	required
`max_to_plot`	`int`	The maximum number of images to plot from the batch. Defaults to 32.	`32`
`cmap`	`str`	The colormap to use for displaying images. Defaults to 'gray'.	required

Returns:

Type	Description
`None`	Displays a grid of images with their corresponding labels.

Notes

Images are converted to grayscale.
If batch size is smaller than max_to_plot, all images in batch will be plotted.

Source code in deepglue/plot_utils.py

def plot_batch(batch_images, batch_targets, category_map, max_to_plot=32):
    """
    Plots a batch of images, and their corresponding target categories, from a DataLoader.

    Parameters
    ----------
    batch_images : torch.Tensor
        A tensor containing a batch of images with shape `(N, C, H, W)`, where
        `N` is the batch size, `C` is the number of channels, `H` is the height,
        and `W` is the width of the images.
    batch_targets : torch.Tensor
        A tensor containing the target labels for the batch, with shape `(N,)`.
    category_map : dict
        A dictionary mapping category indices (as strings) to their human-readable
        labels, e.g., `{'0': 'car', '1': 'ant'}`.
    max_to_plot : int, optional
        The maximum number of images to plot from the batch. Defaults to 32.
    cmap : str, optional
        The colormap to use for displaying images. Defaults to 'gray'.

    Returns
    -------
    None
        Displays a grid of images with their corresponding labels.

    Notes
    -----
    - Images are converted to grayscale.
    - If batch size is smaller than `max_to_plot`, all images in batch will be plotted.
    """
    nbatch = len(batch_targets)
    num_to_plot = min(nbatch, max_to_plot)
    nrows = int(np.ceil(num_to_plot/4))

    fig, axes = plt.subplots(nrows=nrows, ncols=4, figsize=(7, 2*(nrows))) # size is width x height

    for index, ax in enumerate(axes.flat):
        if index >= num_to_plot:  # TODO: is this really needed?
            break
        image = batch_images[index]
        image = convert_for_plotting(image)
        category = str(batch_targets[index].item())
        ax.imshow(image)
        ax.set_title(category_map[category])
        ax.axis('off')

    fig.tight_layout()

`plot_interactive_projection(features_2d, labels, image_paths, category_map, predictions=None, title='Feature Projection', image_size=(50, 50), plot_size=800, legend_location=None, show_in_notebook=True)`

Create an interactive Bokeh plot for any low-dimensional projection of features corresponding to images.

Create an interactive plot of a 2D projection of features extracted from images, such as those obtained using dimensionality reduction techniques like UMAP, PCA, or t-SNE. When you hover over scatter point, it shows the original image corresponding to the point in the 2d space. If you provide predictions, it will show the incorrect predictions as an X.

Parameters:

Name	Type	Description	Default
`features_2d`	`array - like`	2D array of features obtained after dimensionality reduction (num_samples, 2).	required
`labels`	`list`	List of integer labels for the data points (len num_samples).	required
`image_paths`	`list`	List of file paths to the images corresponding to the features (len num_samples).	required
`category_map`	`dict`	A mapping of category indices (as strings) to their respective labels. Example: {'0': 'cat', '1': 'dog'}.	required
`predictions`	`array - like`	Predicted labels for the data points (len num_samples). Defaults to None.	`None`
`title`	`str`	Title of the plot. Defaults to 'Feature Projection'.	`'Feature Projection'`
`image_size`	`tuple`	Size of the images shown in plot when you hover over points (width, height). Defaults to (50, 50).	`(50, 50)`
`plot_size`	`int`	Size of the plot (width and height in pixels). Defaults to 800.	`800`
`legend_location`		Location of the legend. Defaults to None which puts it in default location. Options include 'top_left', 'top_right', 'bottom_left', 'bottom_right', 'top', 'bottom', 'left', 'right','center'	`None`
`show_in_notebook`	`bool`	If True, display the plot inline in a Jupyter Notebook. If False, open the plot in a new browser tab (projection_plot.html). Defaults to True.	`True`

Returns:

Type	Description
`None`	Displays the interactive plot.

Source code in deepglue/plot_utils.py

def plot_interactive_projection(features_2d, labels, image_paths, category_map,
                                predictions=None, title='Feature Projection', 
                                image_size=(50, 50), plot_size=800, legend_location=None,
                                show_in_notebook=True):
    """
    Create an interactive Bokeh plot for any low-dimensional projection of features corresponding to images.

    Create an interactive plot of a 2D projection of features extracted from images, such as those obtained using
    dimensionality reduction techniques like UMAP, PCA, or t-SNE. When you hover over scatter point, it shows
    the original image corresponding to the point in the 2d space. If you provide predictions, it will show the
    incorrect predictions as an X. 

    Parameters
    ----------
    features_2d : array-like
        2D array of features obtained after dimensionality reduction (num_samples, 2).
    labels : list
        List of integer labels for the data points (len num_samples).
    image_paths : list
        List of file paths to the images corresponding to the features (len num_samples).
    category_map : dict
        A mapping of category indices (as strings) to their respective labels.
        Example: {'0': 'cat', '1': 'dog'}.
    predictions : array-like, optional
        Predicted labels for the data points (len num_samples). Defaults to None.
    title : str, optional
        Title of the plot. Defaults to 'Feature Projection'.
    image_size : tuple, optional
        Size of the images shown in plot when you hover over points (width, height). Defaults to (50, 50).
    plot_size : int, optional
        Size of the plot (width and height in pixels). Defaults to 800.
    legend_location: str, optional
        Location of the legend. Defaults to None which puts it in default location. 
        Options include 'top_left', 'top_right', 'bottom_left', 'bottom_right', 'top', 'bottom', 'left', 'right','center'
    show_in_notebook : bool, optional
        If True, display the plot inline in a Jupyter Notebook.
        If False, open the plot in a new browser tab (projection_plot.html). Defaults to True.

    Returns
    -------
    None
        Displays the interactive plot.
    """
    reset_output() # just so you don't update things outside of the current window

    category_names = list(category_map.values())
    num_categories = len(category_names)

    # Prepare the DataFrame
    df = pd.DataFrame(features_2d, columns=('x', 'y'))
    df['category'] = [category_map[str(label)] for label in labels]
    df['image'] = list(map(lambda path: create_embeddable_image(path, size=image_size), image_paths))
    df.insert(0, 'index', df.index) # index column for hover

    """
    Handling the logic for correct/incorrect predictions
    - If predictions were provided
        - add a 'correct' column to the df
        - get rows of correct and incorrect predictions in the df
        - create separate ColumnDataSources for correct and incorrect predictions for plotting
    - If no predictions provided, just use a single ColumnDataSource for all points 
    """
    if predictions is not None:
        if predictions is not None:
            # Convert predictions to a Python list if they are a PyTorch tensor
            if isinstance(predictions, torch.Tensor):
                predictions = predictions.tolist()

        df['correct'] = [prediction == label for prediction, label in zip(predictions, labels)]
        df_correct_inds = df[df['correct']]
        df_incorrect_inds = df[~df['correct']]
        datasource_correct = ColumnDataSource(df_correct_inds)
        datasource_incorrect = ColumnDataSource(df_incorrect_inds)
    else:
        datasource_all = ColumnDataSource(df)

    # Set up color mapping 
    cmap = plt.cm.tab10
    colors = [cmap(i / num_categories) for i in range(num_categories)]
    hex_colors = [rgb2hex(c) for c in colors]
    color_mapping = CategoricalColorMapper(factors=category_names,
                                           palette=hex_colors)

    # Define the tooltip HTML used to show images on hover
    tooltips = """
    <div>
        <img src='@image' style='margin: 8px 0 0 0;'/>
        <br>@category (@index)
    </div>
    """

    # Create the Bokeh figure
    plot_figure = figure(title=title,
                         width=plot_size,
                         height=plot_size,
                         tools=('pan, box_zoom, wheel_zoom, reset'),
                         tooltips=tooltips)

    # Add scatter points based on whether predictions are provided
    if predictions is not None:
        # Scatter for correct points (circles)
        plot_figure.scatter('x', 'y',
                            source=datasource_correct,
                            color=dict(field='category', transform=color_mapping),
                            marker='circle',
                            size=6,
                            line_alpha=0.6,
                            fill_alpha=0.6,
                            legend_label="Correct")

        # Scatter for incorrect points (X's)
        plot_figure.scatter('x', 'y',
                            source=datasource_incorrect,
                            color=dict(field='category', transform=color_mapping),
                            marker='x',
                            size=8,
                            line_alpha=0.6,
                            fill_alpha=0.6,
                            legend_label="Incorrect")
    else:
        # Scatter for all points if no predictions
        plot_figure.scatter('x', 'y',
                            source=datasource_all,
                            color=dict(field='category', transform=color_mapping),
                            marker='circle',
                            size=6,
                            line_alpha=0.6,
                            fill_alpha=0.6)


    # change legend location, change theme to dark 
    if legend_location is not None:
        plot_figure.legend.location = legend_location
    curdoc().theme = 'dark_minimal'


    # Set output target based on the `show_in_notebook` parameter
    if show_in_notebook:
        output_notebook()
    else:
        output_file("projection_plot.html", title=title)

    # Display the plot
    show(plot_figure)

`plot_prediction_grid(images, probability_matrix, true_categories, category_map, top_n=5, figsize_per_plot=(2, 3), logscale=True)`

Plots a grid of classifier prediction visualizations.

Each visualization in the grid contains the image on the left , plotted using dg.plot_prediction_image() and bar plot of top_n category probabilities on the right, plotting using dg.visualize_prediction_probs()

Parameters:

Name	Type	Description	Default
`images`	`Tensor`	Shape num_predictions x 3 x H x W of images to be classified	required
`probability_matrix`	`Tensor`	Torch tensor w/shape num predictions x num categories Each row corresponds to image and contains classifier probabilities for each category.	required
`true_categories`	`list of str`	Length num_predictions list of correct labels for each prediction (e.g., ['cat', 'dog'...]	required
`category_map`	`dict`	A mapping of category indices (as strings) to their respective labels. Example: {'0': 'cat', '1': 'dog'}.	required
`top_n`	`int`	The top n class probabilities to show in bar plot, default is 5.	`5`
`figsize_per_plot`	`tuple`	Size of each (image + bar plot) pair in inches. Default is (3, 3).	`(2, 3)`
`logscale`	`bool`	If True, the bar plot uses a logarithmic scale. Default is True.	`True`

Returns:

Name	Type	Description
`fig`	`Figure`	The figure object containing the full grid of prediction plots.
`axes`	`np.ndarray of matplotlib.axes.Axes`	Array of axes objects arranged in a grid

Note

Inspired by visualization created by the Nuevo Foundation: https://workshops.nuevofoundation.org/python-tensorflow/plotting_model/

Source code in deepglue/plot_utils.py

def plot_prediction_grid(images, probability_matrix, true_categories, category_map, 
                          top_n=5, figsize_per_plot=(2, 3), logscale=True):
    """
    Plots a grid of classifier prediction visualizations.

    Each visualization in the grid contains the image on the left , plotted 
    using dg.plot_prediction_image() and bar plot of top_n category 
    probabilities on the right, plotting using dg.visualize_prediction_probs()

    Parameters
    ----------
    images : torch.Tensor
        Shape num_predictions x 3 x H x W of images to be classified
    probability_matrix : torch.Tensor
        Torch tensor w/shape num predictions x num categories
        Each row corresponds to image and contains classifier probabilities for each category.
    true_categories : list of str
        Length num_predictions list of correct labels for each prediction (e.g., ['cat', 'dog'...]
    category_map : dict
        A mapping of category indices (as strings) to their respective labels.
        Example: {'0': 'cat', '1': 'dog'}.
    top_n : int, optional
        The top n class probabilities to show in bar plot, default is 5.
    figsize_per_plot : tuple, optional
        Size of each (image + bar plot) pair in inches. Default is (3, 3).
    logscale : bool, optional
        If True, the bar plot uses a logarithmic scale. Default is True.

    Returns
    -------
    fig : matplotlib.figure.Figure
        The figure object containing the full grid of prediction plots.
    axes : np.ndarray of matplotlib.axes.Axes
        Array of axes objects arranged in a grid

    Note
    ----
    Inspired by visualization created by the Nuevo Foundation:
    https://workshops.nuevofoundation.org/python-tensorflow/plotting_model/
    """
    num_predictions = len(true_categories)

    if images.shape[0] != num_predictions or probability_matrix.shape[0] != num_predictions:
        raise ValueError("samples must all be same: images.shape[0], probability_matrix.shape[0] and len(true_categories)")

    predictions_per_row = 2
    subplots_per_prediction = 2  # Image + bar plot 
    ncols = predictions_per_row * subplots_per_prediction
    nrows = int(np.ceil(num_predictions / predictions_per_row))

    # Create figure with specified grid layout
    fig, axes = plt.subplots(nrows, ncols, figsize=(figsize_per_plot[0] * ncols, 
                                                    figsize_per_plot[1] * nrows), 
                             layout="constrained")

    for prediction_ind in range(num_predictions):
        # First work out indexing and assign to image and bar plot
        row, position_in_row = divmod(prediction_ind, predictions_per_row)
        col_start = position_in_row * 2  # Start column for this prediction group
        im_ax = axes[row, col_start]
        prob_ax = axes[row, col_start + 1]

        # Get true label for the current prediction
        true_label = true_categories[prediction_ind]

        # Plot the image with the actual and predicted label
        _, im_ax = plot_prediction_image(images[prediction_ind], 
                                                 probability_matrix[prediction_ind], 
                                                 category_map, 
                                                 true_label=true_label, 
                                                 ax=im_ax)

        # Plot the bar plot and remove title and xlabel
        _, prob_ax = plot_prediction_probs(probability_matrix[prediction_ind], 
                                                   category_map, 
                                                   true_label=true_label, 
                                                   top_n=top_n, 
                                                   logscale=logscale, 
                                                   ax=prob_ax)
        prob_ax.set_title("")
        prob_ax.set_xlabel("")

    # Remove unused axes 
    for unused_ax in axes.flatten()[num_predictions * subplots_per_prediction:]:
        unused_ax.axis("off")

    return fig, axes

`plot_prediction_image(tensor, probabilities, category_map, true_label=None, ax=None, figsize=(2.5, 2.5))`

Plot classifier prediction: displays image with true label on top and estimate on bottom with probability.

Parameters:

Name	Type	Description	Default
`tensor`	`Tensor`	The input image tensor (CxHxW) or 1xCxHxW	required
`probabilities`	`Tensor`	Prediction probabilities for each category (1D tensor).	required
`category_map`	`dict`	A mapping of category indices (as strings) to their respective labels. Example: {'0': 'cat', '1': 'dog'}.	required
`true_label`	`str`	The actual category label of the image, if known (e.g., 'dog'). Default is None.	`None`
`axes`	`Axes`	Axes object for plot. If None, new axes are created. Default is None.	required
`figsize`	`tuple`	Size of the figure in inches. Default is (2.5, 2.5).	`(2.5, 2.5)`

Returns:

Name	Type	Description
`fig`	`Figure`	The figure object for further customization or saving.
`axes`	`Axes`	The image axis object

Source code in deepglue/plot_utils.py

def plot_prediction_image(tensor, probabilities, category_map, 
                               true_label=None, ax=None, figsize=(2.5, 2.5)):
    """
    Plot classifier prediction: displays image with true label on top and estimate on bottom with probability.

    Parameters
    ----------
    tensor : torch.Tensor
        The input image tensor (CxHxW) or 1xCxHxW
    probabilities : torch.Tensor
        Prediction probabilities for each category (1D tensor).
    category_map : dict
        A mapping of category indices (as strings) to their respective labels.
        Example: {'0': 'cat', '1': 'dog'}.
    true_label : str, optional
        The actual category label of the image, if known (e.g., 'dog'). Default is None.
    axes : matplotlib.axes.Axes, optional
        Axes object for plot. If None, new axes are created. Default is None.
    figsize : tuple, optional
        Size of the figure in inches. Default is (2.5, 2.5).

    Returns
    -------
    fig : matplotlib.figure.Figure
        The figure object for further customization or saving.
    axes : matplotlib.axes.Axes
        The image axis object
    """
    # Handle any stray tensor dimensions in case singleton batch sent in
    if tensor.dim() == 2 and tensor.shape[0] == 1:
        tensor = tensor.squeeze(0)
    if probabilities.dim() == 2 and probabilities.shape[0] == 1:
        probabilities = probabilities.squeeze(0)

    # Get top predictions
    top_prob, top_ind = torch.topk(probabilities, 1)
    predicted_label = category_map[str(top_ind.item())]
    image = convert_for_plotting(tensor)

    if ax is None:
        fig, ax = plt.subplots(figsize=figsize, layout="constrained")
    else:
        fig = plt.gcf()

    # Plot the image
    ax.imshow(image, cmap="gray")
    ax.set(xticks=[], yticks=[])
    ax.set_xlabel(f"Est: {predicted_label} ({top_prob.item():.2f})")  
    if true_label:
        ax.set_title(f"{true_label}")   

    return fig, ax

`plot_prediction_probs(probabilities, category_map, true_label=None, top_n=5, logscale=True, ax=None, figsize=(3, 2.5), bar_color='skyblue')`

Plot classifier prediction probabilities: bar plot of top N category probabilities.

Parameters:

Name	Type	Description	Default
`probabilities`	`Tensor`	Prediction probabilities for each category (1D tensor).	required
`category_map`	`dict`	A mapping of category indices (as strings) to their respective labels. Example: {'0': 'cat', '1': 'dog'}.	required
`true_label`	`str`	The actual category label, if known (e.g., 'dog'). Default is None.	`None`
`top_n`	`int`	The top n class probabilities to display from the classifier, default is 5.	`5`
`logscale`	`bool`	If True, the bar plot uses a logarithmic scale. Default is True.	`True`
`axes`	`Axes`	Axes object for plot. If None, new axes are created. Default is None.	required
`figsize`	`tuple`	Size of the figure in inches. Default is (2.5, 2.5).	`(3, 2.5)`
`bar_color`	`str`	Color for the bars in the bar plot. Default is 'skyblue'.	`'skyblue'`

Returns:

Name	Type	Description
`fig`	`Figure`	The figure object for further customization or saving.
`axes`	`Axes`	The bar plot axis object

Source code in deepglue/plot_utils.py

def plot_prediction_probs(probabilities, category_map, true_label=None, top_n=5, logscale=True, 
                               ax=None, figsize=(3, 2.5), bar_color='skyblue'):
    """
    Plot classifier prediction probabilities: bar plot of top N category probabilities.

    Parameters
    ----------
    probabilities : torch.Tensor
        Prediction probabilities for each category (1D tensor).
    category_map : dict
        A mapping of category indices (as strings) to their respective labels.
        Example: {'0': 'cat', '1': 'dog'}.
    true_label : str, optional
        The actual category label, if known (e.g., 'dog'). Default is None.
    top_n : int, optional
        The top n class probabilities to display from the classifier, default is 5.
    logscale : bool, optional
        If True, the bar plot uses a logarithmic scale. Default is True.
    axes : matplotlib.axes.Axes, optional
        Axes object for plot. If None, new axes are created. Default is None.
    figsize : tuple, optional
        Size of the figure in inches. Default is (2.5, 2.5).
    bar_color : str, optional
        Color for the bars in the bar plot. Default is 'skyblue'.

    Returns
    -------
    fig : matplotlib.figure.Figure
        The figure object for further customization or saving.
    axes : matplotlib.axes.Axes
        The bar plot axis object

    """
    # Handle any stray tensor dimensions from singleton batches
    if probabilities.dim() == 2 and probabilities.shape[0] == 1:
        probabilities = probabilities.squeeze(0)

    # Ensure top_n doesn't exceed the number of available categories
    if top_n > len(category_map):
        logging.warning(f"top_n ({top_n}) is greater than the number of categories "
                        f"Setting top_n to {len(category_map)}.")
        top_n = len(category_map)

    # Get top predictions
    top_probs, top_indices = torch.topk(probabilities, top_n)
    top_labels = [category_map[str(idx)] for idx in top_indices.cpu().numpy()]

    if ax is None:
        fig, ax = plt.subplots(figsize=figsize, layout="constrained")
    else:
        fig = plt.gcf()

    # Plot the bar chart of top n probabilities
    ax.barh(top_labels, top_probs.cpu().numpy(), color=bar_color, log=logscale)
    ax.set_xlabel("Log Probability" if logscale else "Probability")
    ax.invert_yaxis() # high on top
    ax.set_title(f"Top {top_n} Predictions")

    # Set y-tick labels with bold formatting for the true label
    y_labels = ax.get_yticklabels()
    for label in y_labels:
        if label.get_text() == true_label:
            label.set_fontweight('bold')  # Set correct label to bold

    return fig, ax

`plot_random_category_sample(data_path, category, split_type='train', num_to_plot=16)`

Plots a random selection of images from a specific category within a data split.

Assumes a directory structure where images are stored in category-specific subdirectories under split folders (e.g., 'train', 'valid', 'test'):

Parameters:

Name	Type	Description	Default
`data_path`	`str or Path`	The path to the root directory containing the split folders ('train', 'valid', 'test').	required
`category`	`str`	The name of the category from which to plot images (e.g., 'cat')	required
`split_type`	`str`	The split folder to pull images from ('train', 'valid', 'test'). Defaults to 'train'.	`'train'`
`num_to_plot`	`int`	The number of images to plot. Defaults to 16. If it exceeds the available number of images, a warning will be issued and all available images will be plotted.	`16`

Returns:

Name	Type	Description
`fig`	`Figure`	The figure object containing the subplots.
`axes`	`array of matplotlib.axes`	An array of matplotlib Axes objects, one for each image subplot.

Source code in deepglue/plot_utils.py

def plot_random_category_sample(data_path, category, split_type='train', num_to_plot=16):
    """
    Plots a random selection of images from a specific category within a data split.

    Assumes a directory structure where images are stored in category-specific 
    subdirectories under split folders (e.g., 'train', 'valid', 'test'):

    Parameters
    ----------
    data_path : str or Path
        The path to the root directory containing the split folders ('train', 'valid', 'test').
    category : str
        The name of the category from which to plot images (e.g., 'cat')
    split_type : str, optional
        The split folder to pull images from ('train', 'valid', 'test'). Defaults to 'train'.
    num_to_plot : int, optional
        The number of images to plot. Defaults to 16. If it exceeds the available number of images,
        a warning will be issued and all available images will be plotted. 

    Returns
    -------
    fig : matplotlib.figure.Figure
        The figure object containing the subplots.
    axes : array of matplotlib.axes
        An array of matplotlib Axes objects, one for each image subplot.
    """
    data_path = Path(data_path) # in case it's a string

    # make a dummy category map for sample_random_images() to work with
    category_map = {category: category}

    # Use dg.sample_random_images() to select the images from the specified category
    sampled_paths, _ = sample_random_images(data_path=data_path,
                                            category_map=category_map,
                                            num_images=num_to_plot,
                                            split_type=split_type,
                                            category=category)

    if not sampled_paths:
        raise FileNotFoundError(f"No images found in '{split_type}/{category}'.")

    num_to_plot = len(sampled_paths) # in case too manhy requested

    ncols = 4
    nrows = int(np.ceil(num_to_plot/ncols))
    fig, axes = plt.subplots(nrows=nrows, ncols=4, figsize=(6, 1.5*nrows))

    for ax, img_file in zip(axes.flat, sampled_paths):
        # Load the image
        img = Image.open(img_file)
        ax.imshow(img)
        ax.axis('off')  # Hide axes for cleaner display

    # Hide any unused axes
    for ax in axes.flat[num_to_plot:]:
        ax.axis('off')

    fig.suptitle(f"Category: {category} ({split_type} split)", y=0.97)
    fig.tight_layout()

    return fig, axes

`plot_random_sample(data_path, category_map, split_type='train', num_to_plot=16)`

Plots random image samples from a specified data split.

Assumes a directory structure where images are stored in category-specific subdirectories inside the split folders ('train', 'valid', 'test').

data_path/
    train/
        cat/
        dog/
    valid/
        cat/
        dog/   
    test/
        cat/
        dog/

Parameters:

Name	Type	Description	Default
`data_path`	`str or Path`	The path to the root directory containing the split folders ('train', 'valid', 'test').	required
`category_map`	`dict`	A dictionary mapping category indices (as strings) to their human-readable labels, e.g., `{'0': 'cat', '1': 'dog'}`.	required
`split_type`	`str`	The split folder to pull images from ('train', 'valid', 'test'). Defaults to 'train'.	`'train'`
`num_to_plot`	`int`	Number of images to plot. Defaults to 16.	`16`

Returns:

Name	Type	Description
`fig`	`Figure`	The figure object containing the subplots
`axes`	`array of matplotlib.axes`	An array of matplotlib Axes objects, one for each image subplot.

Source code in deepglue/plot_utils.py

def plot_random_sample(data_path, category_map, split_type='train', num_to_plot=16):
    """
    Plots random image samples from a specified data split.

    Assumes a directory structure where images are stored in category-specific
    subdirectories inside the split folders ('train', 'valid', 'test').

        data_path/
            train/
                cat/
                dog/
            valid/
                cat/
                dog/   
            test/
                cat/
                dog/

    Parameters
    ----------
    data_path : str or Path
        The path to the root directory containing the split folders ('train', 'valid', 'test').
    category_map : dict
        A dictionary mapping category indices (as strings) to their human-readable
        labels, e.g., `{'0': 'cat', '1': 'dog'}`.
    split_type : str, optional
        The split folder to pull images from ('train', 'valid', 'test'). Defaults to 'train'.
    num_to_plot : int, optional
        Number of images to plot. Defaults to 16.

    Returns
    -------
    fig : matplotlib.figure.Figure
        The figure object containing the subplots
    axes : array of matplotlib.axes
        An array of matplotlib Axes objects, one for each image subplot.
    """
    sample_paths, sample_categories = sample_random_images(data_path, 
                                                           category_map, 
                                                           split_type=split_type, 
                                                           num_images=num_to_plot)
    ncols = 4
    nrows = int(np.ceil(num_to_plot/ncols))

    fig, axes = plt.subplots(nrows=nrows, ncols=4, figsize=(6, 1.5*nrows))

    for ax, sample_path, sample_category in zip(axes.flat, sample_paths, sample_categories):
        # Load the image
        img = Image.open(sample_path)
        ax.imshow(img, cmap="gray")
        ax.set_title(sample_category)
        ax.axis('off') 

    fig.suptitle(f"Random images from {split_type} split", y=0.96)
    fig.tight_layout()

    return fig, axes

`plot_transformed(original_image, transform, cmap=None, num_to_plot=4)`

Plot the original image and pytorch transformations applied to it.

original_image : 2d array-like image The original image to be transformed. Can be tensor or numpy/PIL or other array. transform : pytorch transform callable A transformation function (or series of transformations) to apply to the original image. The function should accept an image and return a transformed tensor. cmap : str, optional Colormap to use for displaying greyscale images. Set to None for color images. num_transforms : int, optional The number of transformed images to generate and display, in addition to original image. Defaults to 4.

Returns:

Name	Type	Description
`fig`	`Figure`	The figure object containing the plots.
`axes`	`array of matplotlib.axes`	The axes array containing the individual image subplots.

Notes

The first image displayed is the original, and subsequent images are transformed versions.

Source code in deepglue/plot_utils.py

def plot_transformed(original_image, transform, cmap=None, num_to_plot=4):
    """
    Plot the original image and pytorch transformations applied to it.

    original_image : 2d array-like image
        The original image to be transformed. Can be tensor or numpy/PIL or other array.
    transform : pytorch transform callable
        A transformation function (or series of transformations) to apply to the original image.
        The function should accept an image and return a transformed tensor.
    cmap : str, optional
        Colormap to use for displaying greyscale images. Set to None for color images.
    num_transforms : int, optional
        The number of transformed images to generate and display, in addition to original image. Defaults to 4.

    Returns
    -------
    fig : matplotlib.figure.Figure
        The figure object containing the plots.
    axes : array of matplotlib.axes
        The axes array containing the individual image subplots.

    Notes
    -----
    - The first image displayed is the original, and subsequent images are transformed versions.
    """
    ncols = 5
    nrows = int(np.ceil((num_to_plot+1)/ncols))

    fig, axes = plt.subplots(nrows=nrows, ncols=ncols, figsize=(8,2*nrows))

    for index, ax in enumerate(axes.flat):      
        if index >= num_to_plot+1:
            ax.axis('off')
            continue
        if index == 0:
            image = original_image
        else:
            image = convert_for_plotting(transform(original_image))  # convert for matplotlib
        ax.imshow(image, cmap="gray")
        ax.axis('off')
        if index == 0:
            ax.set_title('Original')
        else:
            ax.set_title(f'Transform {index}')

    fig.tight_layout()

    return fig, axes

File Utilities

deepglue file_utils.py

Module includes functions that are useful for wrangling directories and files.

`count_by_category(data_path)`

Calculates the total number of images for each category across all splits.

Traverses the train, valid, and test folders and aggregates image counts for each category. This can be useful for identifying category imbalances.

Parameters:

Name	Type	Description	Default
`data_path`	`Path`	The path to the parent directory containing the 'train', 'valid', and 'test' folders. They each contain the same category-specific subdirectories.	required

Returns:

Name	Type	Description
`num_per_category`	`dict`	A dictionary where keys are category names and values are the total number of images in each category.

Raises:

Type	Description
`FileNotFoundError`	If any of the specified split directories ('train', 'valid', 'test') do not exist at the given path.

Source code in deepglue/file_utils.py

def count_by_category(data_path):
    """
    Calculates the total number of images for each category across all splits.

    Traverses the train, valid, and test folders and aggregates image counts
    for each category. This can be useful for identifying category imbalances.

    Parameters
    ----------
    data_path : Path
        The path to the parent directory containing the 'train', 'valid', and 'test' folders. They each
        contain the same category-specific subdirectories.

    Returns
    -------
    num_per_category : dict
        A dictionary where keys are category names and values are the total number of images in each category.

    Raises
    ------
    FileNotFoundError
        If any of the specified split directories ('train', 'valid', 'test') do not exist at the given path.
    """
    logging.info(f"Getting samples per category in {data_path}")

    num_per_category = {}
    split_types = ['train', 'valid', 'test']

    for split_type in split_types:
        split_path = data_path / split_type

        if not split_path.exists():
            raise FileNotFoundError(f"{split_path} does not exist. Please check your directory structure.")

        # Traverse each category directory in the split
        for category_path in split_path.iterdir():
            if category_path.is_dir():
                # Use glob to count files in each category directory
                category_name = category_path.name
                num_images = len(list(category_path.glob('*')))
                # initialize
                if category_name not in num_per_category:
                    num_per_category[category_name] = 0
                num_per_category[category_name] += num_images

    return num_per_category

`count_by_split(data_path)`

Calculates the total number of images in train, test, and validation splits, regardless of categories.

This function directly traverses the 'train', 'valid', and 'test' folders and counts all image files, providing the total number of samples in each split without considering category distinctions.

Parameters:

Name	Type	Description	Default
`data_path`	`Path`	The path to the directory containing the 'train', 'valid', and 'test' folders.	required

Returns:

Name	Type	Description
`num_per_split`	`dict`	A dictionary with keys 'train', 'valid', and 'test', each containing the total number of samples in each split, regardless of category.

Raises:

Type	Description
`FileNotFoundError`	If any of the specified split directories ('train', 'valid', 'test') do not exist at the given path.

Source code in deepglue/file_utils.py

def count_by_split(data_path):
    """
    Calculates the total number of images in train, test, and validation splits, regardless of categories.

    This function directly traverses the 'train', 'valid', and 'test' folders and counts all image files,
    providing the total number of samples in each split without considering category distinctions.

    Parameters
    ----------
    data_path : Path
        The path to the directory containing the 'train', 'valid', and 'test' folders.

    Returns
    -------
    num_per_split : dict
        A dictionary with keys 'train', 'valid', and 'test', each containing the total number of samples 
        in each split, regardless of category.

    Raises
    ------
    FileNotFoundError
        If any of the specified split directories ('train', 'valid', 'test') do not exist at the given path.
    """
    logging.info(f"Getting samples per split in {data_path}")

    num_per_split = {}
    split_types = ['train', 'valid', 'test']

    for split_type in split_types:
        split_path = data_path / split_type

        if not split_path.exists():
            raise FileNotFoundError(f"{split_path} does not exist. Please check your directory structure.")

        split_total = 0
        # Directly iterate over category directories within the split folder
        for category_dir in split_path.iterdir():
            if category_dir.is_dir():
                # Count files in the category directory
                split_total += len(list(category_dir.glob('*')))

        num_per_split[split_type] = split_total

    return num_per_split

`count_category_by_split(data_path)`

Counts the number of images in each category within train, validation, and test splits.

Assumes a directory structure where images are stored in category-specific subdirectories under 'train', 'valid', and 'test' folders.

Parameters:

Name	Type	Description	Default
`data_path`	`Path`	The path to the directory containing the 'train', 'valid', and 'test' folders.	required

Returns:

Name	Type	Description
`num_category_per_split`	`dict`	A nested dictionary with keys 'train', 'valid', and 'test', each containing a sub-dictionary where the keys are category names and the values are the counts of images in each category.

Raises:

Type	Description
`FileNotFoundError`	If any of the 'train', 'valid', or 'test' directories do not exist at the specified path.

Source code in deepglue/file_utils.py

def count_category_by_split(data_path):
    """
    Counts the number of images in each category within train, validation, and test splits.

    Assumes a directory structure where images are stored in category-specific 
    subdirectories under 'train', 'valid', and 'test' folders.

    Parameters
    ----------
    data_path : Path
        The path to the directory containing the 'train', 'valid', and 'test' folders.

    Returns
    -------
    num_category_per_split: dict
        A nested dictionary with keys 'train', 'valid', and 'test', each containing a sub-dictionary 
        where the keys are category names and the values are the counts of images in each category.

    Raises
    ------
    FileNotFoundError
        If any of the 'train', 'valid', or 'test' directories do not exist at the specified path.
    """
    logging.info(f"Getting category counts by split in {data_path}")

    split_types = ['train', 'valid', 'test']
    num_category_per_split = {}

    for split_type in split_types:
        dataset_path = data_path / split_type

        if not dataset_path.exists():
            raise FileNotFoundError(f"{dataset_path} does not exist. Please check your directory structure.")

        num_category_per_split[split_type] = {}

        for category in os.listdir(dataset_path):
            category_path = dataset_path / category
            if category_path.is_dir():
                num_images = len(list(category_path.glob('*'))) 
                num_category_per_split[split_type][category] = num_images

    return num_category_per_split

`create_project(projects_dir, project_name)`

Creates a minimal project directory structure within the project parent directory:

projects_dir/
    project_name/
        data/
        models/

Parameters:

Name	Type	Description	Default
`projects_dir`	`str or Path`	Path to the project parent directory.	required
`project_name`	`str`	name of the project (must be a valid directory name: avoid spaces and other weird things)	required

Returns:

Name	Type	Description
`project_dir`	`Path`	Path to the project directory that was created in projects_dir
`data_dir`	`Path`	Path to the data directory in project_dir
`models_dir`	`Path`	Path to the models directory in the project_dir

TODO

consider using pathvalidate to throw error if project_name is invalid

Source code in deepglue/file_utils.py

def create_project(projects_dir, project_name):
    """
    Creates a minimal project directory structure within the project parent directory:

        projects_dir/
            project_name/
                data/
                models/

    Parameters
    ----------
    projects_dir : str or Path
        Path to the project parent directory. 
    project_name : str
        name of the project (must be a valid directory name: avoid spaces and other weird things)

    Returns
    -------
    project_dir : Path
        Path to the project directory that was created in projects_dir
    data_dir : Path
        Path to the data directory in project_dir
    models_dir : Path
        Path to the models directory in the project_dir

    TODO
    ----
    consider using pathvalidate to throw error if project_name is invalid
    """
    projects_dir = Path(projects_dir)
    project_dir = projects_dir / project_name

    try:
        project_dir.mkdir(parents=True, exist_ok=False)
        logging.info(f"Created project directory: {project_dir}")
    except FileExistsError:
        logging.info(f"Project directory '{project_dir}' already exists. Skipping.")

    subdirs = ["data", "models"]
    project_subdirs = create_subdirs(project_dir, subdirs)
    data_dir, models_dir = project_subdirs[0], project_subdirs[1]

    return project_dir, data_dir, models_dir

`create_subdirs(parent_dir, subdirs)`

Create subdirectories within a specified parent directory, unless they already exist.

Parameters:

Name	Type	Description	Default
`parent_dir`		The path to the parent directory where subdirectories will be created	required
`subdirs`		List of subdirectory names to create within the parent directory. If a single string is provided, will be converted to a list	required

Returns:

Name	Type	Description
`new_paths`	`list of Path`	A list of path objects to the newly created subdirectories

Example

create_subdirs(Path("path/to/parent"), ["subdir1", "subdir2"]) [Path('/path/to/parent/subdir1'), Path('/path/to/parent/subdir2')]

Source code in deepglue/file_utils.py

def create_subdirs(parent_dir, subdirs):
    """
    Create subdirectories within a specified parent directory, unless they already exist.

    Parameters
    ---------
    parent_dir: str or Path
        The path to the parent directory where subdirectories will be created

    subdirs: list or str
        List of subdirectory names to create within the parent directory.
        If a single string is provided, will be converted to a list

    Returns
    -------
    new_paths: list of Path
        A list of path objects to the newly created subdirectories

    Example
    -------
    >>> create_subdirs(Path("path/to/parent"), ["subdir1", "subdir2"])
    [Path('/path/to/parent/subdir1'), Path('/path/to/parent/subdir2')]
    """
    logging.info(f"Creating subdirectories of {parent_dir}")

    parent_dir = Path(parent_dir)

    if not parent_dir.exists():
        parent_dir.mkdir(parents=True, exist_ok=True)  # Create parent and intermediates if they don't exist

    # make sure subdirs is a list
    if isinstance(subdirs, str):
        subdirs = [subdirs]

    new_paths = []
    for subdir in subdirs:
        subdir_path = parent_dir / subdir
        if not subdir_path.exists():
            subdir_path.mkdir(parents=False, exist_ok=False) # prevent overwriting
        new_paths.append(subdir_path)

    return new_paths

`load_images_for_model(image_paths, transform)`

Given a list of image paths, returns a tensor suitable for model input.

Parameters:

Name	Type	Description	Default
`image_paths`	`list of str or Paths`	List of image file paths.	required
`transform`	`torchvision transform (callable)`	The transformations to apply to each image.	required

Returns:

Type	Description
`Tensor`	A batch of images as a tensor of shape (len(image_paths), 3, H, W).

Source code in deepglue/file_utils.py

def load_images_for_model(image_paths, transform):
    """
    Given a list of image paths, returns a tensor suitable for model input.

    Parameters
    ----------
    image_paths : list of str or Paths
        List of image file paths.
    transform : torchvision transform (callable)
        The transformations to apply to each image.

    Returns
    -------
    torch.Tensor
        A batch of images as a tensor of shape (len(image_paths), 3, H, W).
    """
    images = [transform(Image.open(image_path).convert("RGB")) for image_path in image_paths]
    return torch.stack(images)

`sample_random_images(data_path, category_map, num_images=1, split_type='train', category=None)`

Randomly sample image paths from a dataset with a standard categorical directory structure.

Assumes a directory structure where images are stored in category-specific subdirectories inside the split folders ('train', 'valid', 'test').

data_path/
    train/
        cat/
        dog/
    valid/
        cat/
        dog/   
    test/
        cat/
        dog/

Parameters:

Name	Type	Description	Default
`data_path`	`str or Path`	Path to the root directory containing the split folders ('train', 'valid', 'test')	required
`category_map`	`dict`	Dictionary mapping category index (as string) to category name: {'0': 'dog', '1': 'cat'}	required
`num_images`	`int`	Number of image paths to sample, by default 1.	`1`
`split_type`	`str`	The split folder to sample from ('train', 'valid', 'test'). Defaults to 'train'.	`'train'`
`category`	`str`	If specified, only images from this category will be sampled. When default of None is chosen, will select randomly across all categories.	`None`

Returns:

Name	Type	Description
`sampled_paths`	`list`	len num_images list of paths to images
`sampled_categories`	`list`	len num_images list of corresponding categories

Raises:

Type	Description
`FileNotFoundError`	If the specified split or category path does not exist.

Notes

Assumes that each split folder contains only category subdirectories.
If num_images exceeds the total available images, all images will be returned, and a warning will be logged.

Source code in deepglue/file_utils.py

def sample_random_images(data_path, category_map, num_images=1, split_type='train', category=None):
    """
    Randomly sample image paths from a dataset with a standard categorical directory structure.

    Assumes a directory structure where images are stored in category-specific
    subdirectories inside the split folders ('train', 'valid', 'test').

        data_path/
            train/
                cat/
                dog/
            valid/
                cat/
                dog/   
            test/
                cat/
                dog/

    Parameters
    ----------
    data_path : str or Path
        Path to the root directory containing the split folders ('train', 'valid', 'test')
    category_map : dict
        Dictionary mapping category index (as string) to category name:  {'0': 'dog', '1': 'cat'}
    num_images : int, optional
        Number of image paths to sample, by default 1.
    split_type : str, optional
        The split folder to sample from ('train', 'valid', 'test'). Defaults to 'train'.
    category : str, optional
        If specified, only images from this category will be sampled. When default of None is
        chosen, will select randomly across all categories. 

    Returns
    -------
    sampled_paths: list
        len num_images list of paths to images
    sampled_categories: list
        len num_images list of corresponding categories 

    Raises
    ------
    FileNotFoundError
        If the specified split or category path does not exist.

    Notes
    -----
    - Assumes that each split folder contains only category subdirectories.
    - If `num_images` exceeds the total available images, all images will be returned, and a warning will be logged.
    """
    data_path = Path(data_path) 
    split_path =data_path / split_type
    logging.info(f"Selecting {num_images} random images from {data_path}")

    if not split_path.exists():
        raise FileNotFoundError(f"Split path {split_path} does not exist.")

    # Set up list of category directories (depends on whether single cat or all)
    if category is None:
        # filter out things that aren't directories
        category_dirs = [category_dir for category_dir in split_path.iterdir() if category_dir.is_dir()]
    else:
        # If a specific category is given, construct its path directly
        category_path = split_path / category
        if not category_path.is_dir():
            raise FileNotFoundError(f"Category directory name {category} does not exist in {split_path}.")
        category_dirs = [category_path]    # even though single category, expects list

    # Collect all image paths and their corresponding categories
    image_paths = []
    categories = []
    for category_dir in category_dirs:
        category_name = category_map[category_dir.name]
        for img_path in category_dir.glob('*'):
            image_paths.append(img_path)
            categories.append(category_name)

    total_images = len(image_paths)

    # Adjust num_images if it exceeds the available number of images
    if num_images > total_images:
        logging.warning(f"Requested {num_images} images, but only {total_images} are available. "
                        f"Returning all available images.")
        num_images = total_images

    # Randomly sample the requested number of images
    sampled_indices = random.sample(range(total_images), num_images)  # sample() works on iterable, so we give it range()
    sampled_paths = [image_paths[i] for i in sampled_indices]
    sampled_categories = [categories[i] for i in sampled_indices]

    return sampled_paths, sampled_categories

`split_dataset(source_dir, target_dir, splits=(0.7, 0.15, 0.15), shuffle=True)`

Splits a dataset organized by category folders into train/valid/test folders.

Copies images from a flat category structure (e.g. target_dir/ cat/, dog/, etc.) into a canonical deep learning format with separate splits:

target_dir/
    train/
        cat/
        dog/
    valid/
        cat/
        dog/   
    test/
        cat/
        dog/

Parameters:

Name	Type	Description	Default
`source_dir`	`str or Path`	Path to the folder containing category subfolders (e.g. cat/, dog/).	required
`target_dir`	`str or Path`	Path where the split dataset should be created.	required
`splits`	`tuple of 3 floats`	Tuple indicating proportions of (train, valid, test) splits. Values must sum to 1.0. Defaults to (0.7, 0.15, 0.15).	`(0.7, 0.15, 0.15)`
`shuffle`	`bool`	Whether to shuffle images before splitting within a category. Defaults to True.	`True`

Returns:

Name	Type	Description
`counts`	`dict`	A nested dictionary showing the number of images per category in each split Example: { 'train': {'cat': 140, 'dog': 200}, 'valid': {'cat': 30, 'dog': 40}, 'test': {'cat': 30, 'dog': 35} }

Raises:

Type	Description
`FileNotFoundError`	If the source directory does not exist.

Source code in deepglue/file_utils.py

def split_dataset(source_dir,
                  target_dir,
                  splits=(0.7, 0.15, 0.15),
                  shuffle=True):
    """
    Splits a dataset organized by category folders into train/valid/test folders.

    Copies images from a flat category structure (e.g. target_dir/ cat/, dog/, etc.) into a
    canonical deep learning format with separate splits:

        target_dir/
            train/
                cat/
                dog/
            valid/
                cat/
                dog/   
            test/
                cat/
                dog/

    Parameters
    ----------
    source_dir : str or Path
        Path to the folder containing category subfolders (e.g. cat/, dog/).
    target_dir : str or Path
        Path where the split dataset should be created.
    splits : tuple of 3 floats, optional
        Tuple indicating proportions of (train, valid, test) splits.
        Values must sum to 1.0. Defaults to (0.7, 0.15, 0.15).
    shuffle : bool, optional
        Whether to shuffle images before splitting within a category. Defaults to True.

    Returns
    -------
    counts : dict
        A nested dictionary showing the number of images per category in each split
        Example:
            {
                'train': {'cat': 140, 'dog': 200},
                'valid': {'cat': 30, 'dog': 40},
                'test': {'cat': 30, 'dog': 35}
            }
    Raises
    ------
    FileNotFoundError
        If the source directory does not exist.
    """
    source_dir = Path(source_dir)
    target_dir = Path(target_dir)

    if not source_dir.exists():
        raise FileNotFoundError(f"Source directory {source_dir} does not exist.")

    if len(splits) != 3:
        raise ValueError(f"Splits must be a tuple of three floats. Got {splits}.")

    split_names = ["train", "valid", "test"]
    splits_dict = dict(zip(split_names, splits))
    if abs(sum(splits_dict.values()) - 1.0) > 1e-6:
        raise ValueError(f"Split fractions must sum to 1.0. Got {sum(splits_dict.values())}.")

    logging.info(f"Splitting data from {source_dir} into {target_dir} using splits {splits_dict}")

    # Prepare counts dictionary
    counts = {
        "train": {},
        "valid": {},
        "test": {},
    }

    category_dirs = [category_dir for category_dir in source_dir.iterdir() if category_dir.is_dir()]

    for category_dir in category_dirs:
        category = category_dir.name
        images_category = sorted(category_dir.glob("*")) # get all files in category dir

        if shuffle:
            random.shuffle(images_category)  # shuffle in place

        n_cat_total = len(images_category)
        n_cat_train = int(splits_dict["train"] * n_cat_total)
        n_cat_valid = int(splits_dict["valid"] * n_cat_total)
        n_cat_test = n_cat_total - n_cat_train - n_cat_valid

        logging.info(f"Processing category '{category}': {n_cat_total} images"
                     f"\ntrain: {n_cat_train}, valid: {n_cat_valid}, test: {n_cat_test}")

        # Create mapping of split names to list of image paths
        split_category_map = {"train": images_category[:n_cat_train],
                         "valid": images_category[n_cat_train: n_cat_train + n_cat_valid],
                         "test": images_category[n_cat_train + n_cat_valid:]}

        for split_type, file_list in split_category_map.items():
            split_category_dir = target_dir / split_type / category
            split_category_dir.mkdir(parents=True, exist_ok=True)

            for file in file_list:
                dest = split_category_dir / file.name
                shutil.copy2(file, dest)

            counts[split_type][category] = len(file_list)

    return counts