mdlearn.data.datasets.point_cloud

PointCloud Dataset.

Classes

`CenterOfMassTransform`(data)
`PointCloudDataset`(args, *kwargs)	PyTorch Dataset class to load point cloud data.
`PointCloudDatasetInMemory`(args, *kwargs)	PyTorch Dataset class to load point cloud data.

class mdlearn.data.datasets.point_cloud.CenterOfMassTransform(data: numpy.ndarray)

__init__(data: numpy.ndarray) → None

Computes center of mass transformation

Parameters: data (np.ndarray) – Dataset of positions with shape (num_examples, 3, num_points).

transform(x: numpy.ndarray) → numpy.ndarray

Normalize example by bias and scale factors

Parameters: x (np.ndarray) – Data to transform shape (3, num_points). Modifies x.
Returns: np.ndarray – The transformed data
Raises: ValueError – If NaN encountered in input

class mdlearn.data.datasets.point_cloud.PointCloudDataset(*args: Any, **kwargs: Any)

PyTorch Dataset class to load point cloud data. Optionally, uses HDF5 files to only read into memory what is necessary for one batch.

__init__(path: Union[str, pathlib.Path], num_points: int, num_features: int = 0, dataset_name: str = 'point_cloud', scalar_dset_names: List[str] = [], seed: int = 333, cms_transform: bool = False, scalar_requires_grad: bool = False, in_memory: bool = True)

Parameters

path (Union[str, Path]) – Path to HDF5 file containing data set.
dataset_name (str) – Name of the point cloud data in the HDF5 file.
scalar_dset_names (List[str]) – List of scalar dataset names inside HDF5 file to be passed to training logs.
num_points (int) – Number of points per sample. Should be smaller or equal than the total number of points.
num_features (int) – Number of additional per-point features in addition to xyz coords.
seed (int) – Seed for the RNG for the splitting. Make sure it is the same for all workers reading from the same file.
cms_transform (bool) – If True, subtract center of mass from batch and shift and scale batch by the full dataset statistics.
scalar_requires_grad (bool) – Sets requires_grad torch.Tensor parameter for scalars specified by scalar_dset_names. Set to True, to use scalars for multi-task learning. If scalars are only required for plotting, then set it as False.
in_memory (bool) – If True, pull data stored in HDF5 from disk to numpy arrays. Otherwise, read each batch from HDF5 on the fly.

Examples

>>> dataset = PointCloudDataset("point_clouds.h5", 28)
>>> dataset[0]
{'X': torch.Tensor(..., dtype=float32), 'index': tensor(0)}
>>> dataset[0]["X"].shape
torch.Size([3, 28])

>>> dataset = PointCloudDataset("point_clouds.h5", 28, 1)
>>> dataset[0]["X"].shape
torch.Size([4, 28])

>>> dataset = PointCloudDataset("point_clouds.h5", 28, scalar_dset_names=["rmsd"])
>>> dataset[0]
{'X': torch.Tensor(..., dtype=float32), 'index': tensor(0), 'rmsd': tensor(8.7578, dtype=torch.float16)}

property point_cloud_size: Tuple[int, int]

class mdlearn.data.datasets.point_cloud.PointCloudDatasetInMemory(*args: Any, **kwargs: Any)

PyTorch Dataset class to load point cloud data. Optionally, uses HDF5 files to only read into memory what is necessary for one batch.

__init__(data: numpy.ndarray, scalars: Dict[str, numpy.ndarray] = {}, cms_transform: bool = False, scalar_requires_grad: bool = False)

Parameters

data (np.ndarray) – Dataset of positions with shape (num_examples, 3, num_points)
scalars (Dict[str, np.ndarray], default={}) – Dictionary of scalar arrays. For instance, the root mean squared deviation (RMSD) for each feature vector can be passed via {"rmsd": np.array(...)}. The dimension of each scalar array should match the number of input feature vectors N.
cms_transform (bool) – If True, subtract center of mass from batch and shift and scale batch by the full dataset statistics.
scalar_requires_grad (bool) – Sets requires_grad torch.Tensor parameter for scalars specified by scalar_dset_names. Set to True, to use scalars for learning. If scalars are only required for plotting, then set it as False.