mdlearn.data.datasets.contact_map

ContactMap Dataset.

Classes

ContactMapDataset(*args, **kwargs)

PyTorch Dataset class which stores sparse contact matrix data in memory.

ContactMapHDF5Dataset(*args, **kwargs)

PyTorch Dataset class to load contact matrix data from HDF5 format.

class mdlearn.data.datasets.contact_map.ContactMapDataset(*args: Any, **kwargs: Any)

PyTorch Dataset class which stores sparse contact matrix data in memory.

__init__(data: numpy.ndarray, shape: Tuple[int, int, int], scalars: Dict[str, numpy.ndarray] = {}, scalar_requires_grad: bool = False)
Parameters
  • data (np.ndarray) – Input contact matrices in sparse COO format of shape (N,) where N is the number of data examples, and the empty dimension is ragged. The row and column index vectors should be contatenated and the values are assumed to be 1 and don’t need to be explcitly passed.

  • shape (Tuple[int, int, int]) – Shape of the contact map (1, D, D) where D is the number of rows and columns.

  • scalars (Dict[str, np.ndarray], default={}) – Dictionary of scalar arrays. For instance, the root mean squared deviation (RMSD) for each feature vector can be passed via {"rmsd": np.array(...)}. The dimension of each scalar array should match the number of input feature vectors N.

  • scalar_requires_grad (bool, default=False) – Sets requires_grad torch.Tensor parameter for scalars specified by scalars. Set to True, to use scalars for multi-task learning. If scalars are only required for plotting, then set it as False.

class mdlearn.data.datasets.contact_map.ContactMapHDF5Dataset(*args: Any, **kwargs: Any)

PyTorch Dataset class to load contact matrix data from HDF5 format.

__init__(path: Union[str, pathlib.Path], shape: Tuple[int, ...], dataset_name: str = 'contact_map', scalar_dset_names: List[str] = [], values_dset_name: Optional[str] = None, scalar_requires_grad: bool = False, in_memory: bool = True)
Parameters
  • path (PathLike) – Path to HDF5 file containing contact matrices.

  • shape (Tuple[int, …]) – Shape of contact matrices required by the model (H, W), may be (1, H, W).

  • dataset_name (str) – Name of contact map dataset in HDF5 file.

  • scalar_dset_names (List[str]) – List of scalar dataset names inside HDF5 file to be passed to training logs.

  • values_dset_name (str, optional) – Name of HDF5 dataset field containing optional values of the entries the distance/contact matrix. By default, values are all assumed to be 1 corresponding to a binary contact map and created on the fly.

  • scalar_requires_grad (bool) – Sets requires_grad torch.Tensor parameter for scalars specified by scalar_dset_names. Set to True, to use scalars for multi-task learning. If scalars are only required for plotting, then set it as False.

  • in_memory (bool) – If True, pull data stored in HDF5 from disk to numpy arrays. Otherwise, read each batch from HDF5 on the fly.

Examples

>>> dataset = ContactMapDataset("contact_maps.h5", (28, 28))
>>> dataset[0]
{'X': torch.Tensor(..., dtype=float32), 'index': tensor(0)}
>>> dataset[0]["X"].shape
(28, 28)
>>> dataset = ContactMapDataset("contact_maps.h5", (28, 28), scalar_dset_names=["rmsd"])
>>> dataset[0]
{'X': torch.Tensor(..., dtype=float32), 'index': tensor(0), 'rmsd': tensor(8.7578, dtype=torch.float16)}