mdlearn.data.datasets.point_cloud

PointCloud Dataset.

Classes

CenterOfMassTransform(data)

PointCloudDataset(*args, **kwargs)

PyTorch Dataset class to load point cloud data.

PointCloudDatasetInMemory(*args, **kwargs)

PyTorch Dataset class to load point cloud data.

class mdlearn.data.datasets.point_cloud.CenterOfMassTransform(data: numpy.ndarray)
__init__(data: numpy.ndarray) None

Computes center of mass transformation

Parameters

data (np.ndarray) – Dataset of positions with shape (num_examples, 3, num_points).

transform(x: numpy.ndarray) numpy.ndarray

Normalize example by bias and scale factors

Parameters

x (np.ndarray) – Data to transform shape (3, num_points). Modifies x.

Returns

np.ndarray – The transformed data

Raises

ValueError – If NaN encountered in input

class mdlearn.data.datasets.point_cloud.PointCloudDataset(*args: Any, **kwargs: Any)

PyTorch Dataset class to load point cloud data. Optionally, uses HDF5 files to only read into memory what is necessary for one batch.

__init__(path: Union[str, pathlib.Path], num_points: int, num_features: int = 0, dataset_name: str = 'point_cloud', scalar_dset_names: List[str] = [], seed: int = 333, cms_transform: bool = False, scalar_requires_grad: bool = False, in_memory: bool = True)
Parameters
  • path (Union[str, Path]) – Path to HDF5 file containing data set.

  • dataset_name (str) – Name of the point cloud data in the HDF5 file.

  • scalar_dset_names (List[str]) – List of scalar dataset names inside HDF5 file to be passed to training logs.

  • num_points (int) – Number of points per sample. Should be smaller or equal than the total number of points.

  • num_features (int) – Number of additional per-point features in addition to xyz coords.

  • seed (int) – Seed for the RNG for the splitting. Make sure it is the same for all workers reading from the same file.

  • cms_transform (bool) – If True, subtract center of mass from batch and shift and scale batch by the full dataset statistics.

  • scalar_requires_grad (bool) – Sets requires_grad torch.Tensor parameter for scalars specified by scalar_dset_names. Set to True, to use scalars for multi-task learning. If scalars are only required for plotting, then set it as False.

  • in_memory (bool) – If True, pull data stored in HDF5 from disk to numpy arrays. Otherwise, read each batch from HDF5 on the fly.

Examples

>>> dataset = PointCloudDataset("point_clouds.h5", 28)
>>> dataset[0]
{'X': torch.Tensor(..., dtype=float32), 'index': tensor(0)}
>>> dataset[0]["X"].shape
torch.Size([3, 28])
>>> dataset = PointCloudDataset("point_clouds.h5", 28, 1)
>>> dataset[0]["X"].shape
torch.Size([4, 28])
>>> dataset = PointCloudDataset("point_clouds.h5", 28, scalar_dset_names=["rmsd"])
>>> dataset[0]
{'X': torch.Tensor(..., dtype=float32), 'index': tensor(0), 'rmsd': tensor(8.7578, dtype=torch.float16)}
property point_cloud_size: Tuple[int, int]
class mdlearn.data.datasets.point_cloud.PointCloudDatasetInMemory(*args: Any, **kwargs: Any)

PyTorch Dataset class to load point cloud data. Optionally, uses HDF5 files to only read into memory what is necessary for one batch.

__init__(data: numpy.ndarray, scalars: Dict[str, numpy.ndarray] = {}, cms_transform: bool = False, scalar_requires_grad: bool = False)
Parameters
  • data (np.ndarray) – Dataset of positions with shape (num_examples, 3, num_points)

  • scalars (Dict[str, np.ndarray], default={}) – Dictionary of scalar arrays. For instance, the root mean squared deviation (RMSD) for each feature vector can be passed via {"rmsd": np.array(...)}. The dimension of each scalar array should match the number of input feature vectors N.

  • cms_transform (bool) – If True, subtract center of mass from batch and shift and scale batch by the full dataset statistics.

  • scalar_requires_grad (bool) – Sets requires_grad torch.Tensor parameter for scalars specified by scalar_dset_names. Set to True, to use scalars for learning. If scalars are only required for plotting, then set it as False.