mdlearn.nn.models.vae.symmetric_conv2d_vae

Classes

SymmetricConv2dVAE(*args, **kwargs)

Convolutional variational autoencoder from the "Deep clustering of protein folding simulations" paper.

SymmetricConv2dVAETrainer(input_shape[, ...])

Trainer class to fit a convolutional variational autoencoder to a set of contact maps.

class mdlearn.nn.models.vae.symmetric_conv2d_vae.SymmetricConv2dVAE(*args: Any, **kwargs: Any)

Convolutional variational autoencoder from the “Deep clustering of protein folding simulations” paper. Inherits from mdlearn.nn.models.vae.VAE.

__init__(input_shape: Tuple[int, int, int], init_weights: Optional[str] = None, filters: List[int] = [64, 64, 64], kernels: List[int] = [3, 3, 3], strides: List[int] = [1, 2, 1], affine_widths: List[int] = [128], affine_dropouts: List[float] = [0.0], latent_dim: int = 3, activation: str = 'ReLU', output_activation: str = 'Sigmoid')
Parameters
  • input_shape (Tuple[int, int, int]) – (1, height, width) input dimensions of input image.

  • init_weights (Optional[str]) – .pt weights file to initial weights with.

  • filters (List[int]) – Convolutional filter dimensions.

  • kernels (List[int]) – Convolutional kernel dimensions (assumes square kernel).

  • strides (List[int]) – Convolutional stride lengths (assumes square strides).

  • affine_widths (List[int]) – Number of neurons in each linear layer.

  • affine_dropouts (List[float]) – Dropout probability for each linear layer. Dropout value of 0.0 will skip adding the dropout layer.

  • latent_dim (int) – Latent dimension for \(mu\) and \(logstd\) layers.

  • activation (str) – Activation function to use between convultional and linear layers.

  • output_activation (str) – Output activation function for last decoder layer.

forward(x: torch.Tensor) Tuple[torch.Tensor, torch.Tensor]

Forward pass of variational autoencoder.

Parameters

x (torch.Tensor) – Input x data to encode and reconstruct.

Returns

  • torch.Tensor\(z\)-latent space batch tensor.

  • torch.Tensorrecon_x reconstruction of x.

class mdlearn.nn.models.vae.symmetric_conv2d_vae.SymmetricConv2dVAETrainer(input_shape: Tuple[int, int, int], filters: List[int] = [64, 64, 64], kernels: List[int] = [3, 3, 3], strides: List[int] = [1, 2, 1], affine_widths: List[int] = [128], affine_dropouts: List[float] = [0.0], latent_dim: int = 10, activation: str = 'ReLU', output_activation: str = 'Sigmoid', lambda_rec: float = 1.0, seed: int = 42, num_data_workers: int = 0, prefetch_factor: int = 2, split_pct: float = 0.8, split_method: str = 'random', batch_size: int = 128, shuffle: bool = True, device: str = 'cpu', optimizer_name: str = 'RMSprop', optimizer_hparams: Dict[str, Any] = {'lr': 0.001, 'weight_decay': 1e-05}, scheduler_name: Optional[str] = None, scheduler_hparams: Dict[str, Any] = {}, epochs: int = 100, verbose: bool = False, clip_grad_max_norm: float = 10.0, checkpoint_log_every: int = 10, plot_log_every: int = 10, plot_n_samples: int = 10000, plot_method: Optional[str] = None, train_subsample_pct: float = 1.0, valid_subsample_pct: float = 1.0, use_wandb: bool = False)

Trainer class to fit a convolutional variational autoencoder to a set of contact maps.

__init__(input_shape: Tuple[int, int, int], filters: List[int] = [64, 64, 64], kernels: List[int] = [3, 3, 3], strides: List[int] = [1, 2, 1], affine_widths: List[int] = [128], affine_dropouts: List[float] = [0.0], latent_dim: int = 10, activation: str = 'ReLU', output_activation: str = 'Sigmoid', lambda_rec: float = 1.0, seed: int = 42, num_data_workers: int = 0, prefetch_factor: int = 2, split_pct: float = 0.8, split_method: str = 'random', batch_size: int = 128, shuffle: bool = True, device: str = 'cpu', optimizer_name: str = 'RMSprop', optimizer_hparams: Dict[str, Any] = {'lr': 0.001, 'weight_decay': 1e-05}, scheduler_name: Optional[str] = None, scheduler_hparams: Dict[str, Any] = {}, epochs: int = 100, verbose: bool = False, clip_grad_max_norm: float = 10.0, checkpoint_log_every: int = 10, plot_log_every: int = 10, plot_n_samples: int = 10000, plot_method: Optional[str] = None, train_subsample_pct: float = 1.0, valid_subsample_pct: float = 1.0, use_wandb: bool = False)
Parameters
  • input_shape (Tuple[int, int, int]) – (1, height, width) input dimensions of input image.

  • filters (List[int], default=[64, 64, 64]) – Convolutional filter dimensions.

  • kernels (List[int], default=[3, 3, 3]) – Convolutional kernel dimensions (assumes square kernel).

  • strides (List[int], default=[1, 2, 1]) – Convolutional stride lengths (assumes square strides).

  • affine_widths (List[int], default=[128]) – Number of neurons in each linear layer. Defines the shape of the autoencoder (does not include latent dimension). The encoder and decoder are symmetric.

  • affine_dropouts (List[float], default=[0.0]) – Dropout probability for each linear layer. Dropout value of 0.0 will skip adding the dropout layer.

  • latent_dim (int, default=10) – Latent dimension for \(mu\) and \(logstd\) layers.

  • activation (str, default=”ReLU”) – Activation function to use between convultional and linear layers.

  • output_activation (str, default=”Sigmoid”) – Output activation function for last decoder layer.

  • lambda_rec (float, default=1.0) – Factor to scale reconstruction loss by during training such that loss = lambda_rec * recon_loss + kld_loss.

  • seed (int, default=42) – Random seed for torch, numpy, and random module.

  • num_data_workers (int, default=0) – How many subprocesses to use for data loading. 0 means that the data will be loaded in the main process.

  • prefetch_factor (int, by default=2) – Number of samples loaded in advance by each worker. 2 means there will be a total of 2 * num_workers samples prefetched across all workers.

  • split_pct (float, default=0.8) – Proportion of data set to use for training. The rest goes to validation.

  • split_method (str, default=”random”) – Method to split the data. For random split use “random”, for a simple partition, use “partition”.

  • batch_size (int, default=128) – Mini-batch size for training.

  • shuffle (bool, default=True) – Whether to shuffle training data or not.

  • device (str, default=”cpu”) – Specify training hardware either cpu or cuda for GPU devices.

  • optimizer_name (str, default=”RMSprop”) – Name of the PyTorch optimizer to use. Matches PyTorch optimizer class name.

  • optimizer_hparams (Dict[str, Any], default={“lr”: 0.001, “weight_decay”: 0.00001}) – Dictionary of hyperparameters to pass to the chosen PyTorch optimizer.

  • scheduler_name (Optional[str], default=None) – Name of the PyTorch learning rate scheduler to use. Matches PyTorch optimizer class name.

  • scheduler_hparams (Dict[str, Any], default={}) – Dictionary of hyperparameters to pass to the chosen PyTorch learning rate scheduler.

  • epochs (int, default=100) – Number of epochs to train for.

  • verbose (bool, default False) – If True, will print training and validation loss at each epoch.

  • clip_grad_max_norm (float, default=10.0) – Max norm of the gradients for gradient clipping for more information see: torch.nn.utils.clip_grad_norm_ documentation.

  • checkpoint_log_every (int, default=10) – Epoch interval to log a checkpoint file containing the model weights, optimizer, and scheduler parameters.

  • plot_log_every (int, default=10) – Epoch interval to log a visualization plot of the latent space.

  • plot_n_samples (int, default=10000) – Number of validation samples to use for plotting.

  • plot_method (Optional[str], default=None) – The method for visualizing the latent space or if visualization should not be run, set plot_method=None. If using "TSNE", it will attempt to use the RAPIDS.ai GPU implementation and will fallback to the sklearn CPU implementation if RAPIDS.ai is unavailable. A fast alternative is to plot the raw embeddings (or up to the first 3 dimensions if D > 3) using "raw".

  • train_subsample_pct (float, default=1.0) – Percentage of training data to use during hyperparameter sweeps.

  • valid_subsample_pct (float, default=1.0) – Percentage of validation data to use during hyperparameter sweeps.

  • use_wandb (bool, default=False) – If True, will log results to wandb. Metric keys include “train_loss”, “train_recon_loss”, “train_kld_loss”, “valid_loss”, “valid_recon_loss” and “valid_kld_loss”.

Raises
  • ValueErrorsplit_pct should be between 0 and 1.

  • ValueErrortrain_subsample_pct should be between 0 and 1.

  • ValueErrorvalid_subsample_pct should be between 0 and 1.

  • ValueError – Specified device as cuda, but it is unavailable.

Examples

For an accompanying example, see: https://github.com/ramanathanlab/mdlearn/tree/main/examples/symmetric_conv2d_vae/training.

fit(X: numpy.ndarray, scalars: Dict[str, numpy.ndarray] = {}, output_path: Union[str, pathlib.Path] = './', checkpoint: Optional[Union[str, pathlib.Path]] = None)

Trains the autoencoder on the input data X.

Parameters
  • X (np.ndarray) – Input contact matrices in sparse COO format of shape (N,) where N is the number of data examples, and the empty dimension is ragged. The row and column index vectors should be contatenated and the values are assumed to be 1 and don’t need to be explcitly passed.

  • scalars (Dict[str, np.ndarray], default={}) – Dictionary of scalar arrays. For instance, the root mean squared deviation (RMSD) for each feature vector can be passed via {"rmsd": np.array(...)}. The dimension of each scalar array should match the number of input feature vectors N.

  • output_path (PathLike, default=”./”) – Path to write training results to. Makes an output_path/checkpoints folder to save model checkpoint files, and output_path/plots folder to store latent space visualizations.

  • checkpoint (Optional[PathLike], default=None) – Path to a specific model checkpoint file to restore training.

Raises
  • TypeError – If scalars is not type dict. A common error is to pass output_path as the second argument.

  • NotImplementedError – If using a learning rate scheduler other than ReduceLROnPlateau, a step function will need to be implemented.

predict(X: numpy.ndarray, inference_batch_size: int = 128, checkpoint: Optional[Union[str, pathlib.Path]] = None) Tuple[numpy.ndarray, float, float, float]

Predict using the LinearAE

Parameters
  • X (np.ndarray) – Input contact matrices in sparse COO format of shape (N,) where N is the number of data examples, and the empty dimension is ragged. The row and column index vectors should be contatenated and the values are assumed to be 1 and don’t need to be explcitly passed.

  • inference_batch_size (int, default=128) – The batch size for inference.

  • checkpoint (Optional[PathLike], default=None) – Path to a specific model checkpoint file.

Returns

Tuple[np.ndarray, float, float, float, float] – The z latent vectors corresponding to the input data X and the average losses [total, reconstruction, KL-divergence]