mdlearn.data.datasets.contact_map
ContactMap Dataset.
Classes
|
PyTorch Dataset class which stores sparse contact matrix data in memory. |
|
PyTorch Dataset class to load contact matrix data from HDF5 format. |
- class mdlearn.data.datasets.contact_map.ContactMapDataset(*args: Any, **kwargs: Any)
PyTorch Dataset class which stores sparse contact matrix data in memory.
- __init__(data: numpy.ndarray, shape: Tuple[int, int, int], scalars: Dict[str, numpy.ndarray] = {}, scalar_requires_grad: bool = False)
- Parameters
data (np.ndarray) – Input contact matrices in sparse COO format of shape (N,) where N is the number of data examples, and the empty dimension is ragged. The row and column index vectors should be contatenated and the values are assumed to be 1 and don’t need to be explcitly passed.
shape (Tuple[int, int, int]) – Shape of the contact map (1, D, D) where D is the number of rows and columns.
scalars (Dict[str, np.ndarray], default={}) – Dictionary of scalar arrays. For instance, the root mean squared deviation (RMSD) for each feature vector can be passed via
{"rmsd": np.array(...)}
. The dimension of each scalar array should match the number of input feature vectors N.scalar_requires_grad (bool, default=False) – Sets requires_grad torch.Tensor parameter for scalars specified by
scalars
. Set to True, to use scalars for multi-task learning. If scalars are only required for plotting, then set it as False.
- class mdlearn.data.datasets.contact_map.ContactMapHDF5Dataset(*args: Any, **kwargs: Any)
PyTorch Dataset class to load contact matrix data from HDF5 format.
- __init__(path: Union[str, pathlib.Path], shape: Tuple[int, ...], dataset_name: str = 'contact_map', scalar_dset_names: List[str] = [], values_dset_name: Optional[str] = None, scalar_requires_grad: bool = False, in_memory: bool = True)
- Parameters
path (PathLike) – Path to HDF5 file containing contact matrices.
shape (Tuple[int, …]) – Shape of contact matrices required by the model (H, W), may be (1, H, W).
dataset_name (str) – Name of contact map dataset in HDF5 file.
scalar_dset_names (List[str]) – List of scalar dataset names inside HDF5 file to be passed to training logs.
values_dset_name (str, optional) – Name of HDF5 dataset field containing optional values of the entries the distance/contact matrix. By default, values are all assumed to be 1 corresponding to a binary contact map and created on the fly.
scalar_requires_grad (bool) – Sets requires_grad torch.Tensor parameter for scalars specified by scalar_dset_names. Set to True, to use scalars for multi-task learning. If scalars are only required for plotting, then set it as False.
in_memory (bool) – If True, pull data stored in HDF5 from disk to numpy arrays. Otherwise, read each batch from HDF5 on the fly.
Examples
>>> dataset = ContactMapDataset("contact_maps.h5", (28, 28)) >>> dataset[0] {'X': torch.Tensor(..., dtype=float32), 'index': tensor(0)} >>> dataset[0]["X"].shape (28, 28)
>>> dataset = ContactMapDataset("contact_maps.h5", (28, 28), scalar_dset_names=["rmsd"]) >>> dataset[0] {'X': torch.Tensor(..., dtype=float32), 'index': tensor(0), 'rmsd': tensor(8.7578, dtype=torch.float16)}