mdlearn.data.datasets.point_cloud
PointCloud Dataset.
Classes
|
|
|
PyTorch Dataset class to load point cloud data. |
|
PyTorch Dataset class to load point cloud data. |
- class mdlearn.data.datasets.point_cloud.CenterOfMassTransform(data: numpy.typing.ArrayLike)
- __init__(data: numpy.typing.ArrayLike) None
Computes center of mass transformation
- Parameters:
data (ArrayLike) – Dataset of positions with shape (num_examples, 3, num_points).
- transform(x: numpy.typing.ArrayLike) numpy.typing.ArrayLike
Normalize example by bias and scale factors
- Parameters:
x (ArrayLike) – Data to transform shape (3, num_points). Modifies
x.- Returns:
ArrayLike – The transformed data
- Raises:
ValueError – If NaN encountered in input
- class mdlearn.data.datasets.point_cloud.PointCloudDataset(*args: Any, **kwargs: Any)
PyTorch Dataset class to load point cloud data. Optionally, uses HDF5 files to only read into memory what is necessary for one batch.
- __init__(path: str | Path, num_points: int, num_features: int = 0, dataset_name: str = 'point_cloud', scalar_dset_names: list[str] = [], seed: int = numpy.random.default_rng.integers, cms_transform: bool = False, scalar_requires_grad: bool = False, in_memory: bool = True)
- Parameters:
path (Union[str, Path]) – Path to HDF5 file containing data set.
dataset_name (str) – Name of the point cloud data in the HDF5 file.
scalar_dset_names (List[str]) – List of scalar dataset names inside HDF5 file to be passed to training logs.
num_points (int) – Number of points per sample. Should be smaller or equal than the total number of points.
num_features (int) – Number of additional per-point features in addition to xyz coords.
seed (int) – Seed for the RNG for the splitting. Make sure it is the same for all workers reading from the same file.
cms_transform (bool) – If True, subtract center of mass from batch and shift and scale batch by the full dataset statistics.
scalar_requires_grad (bool) – Sets requires_grad torch.Tensor parameter for scalars specified by
scalar_dset_names. Set to True, to use scalars for multi-task learning. If scalars are only required for plotting, then set it as False.in_memory (bool) – If True, pull data stored in HDF5 from disk to numpy arrays. Otherwise, read each batch from HDF5 on the fly.
Examples
>>> dataset = PointCloudDataset("point_clouds.h5", 28) >>> dataset[0] {'X': torch.Tensor(..., dtype=float32), 'index': tensor(0)} >>> dataset[0]["X"].shape torch.Size([3, 28])
>>> dataset = PointCloudDataset("point_clouds.h5", 28, 1) >>> dataset[0]["X"].shape torch.Size([4, 28])
>>> dataset = PointCloudDataset("point_clouds.h5", 28, scalar_dset_names=["rmsd"]) >>> dataset[0] {'X': torch.Tensor(..., dtype=float32), 'index': tensor(0), 'rmsd': tensor(8.7578, dtype=torch.float16)}
- property point_cloud_size: tuple[int, int]
- class mdlearn.data.datasets.point_cloud.PointCloudDatasetInMemory(*args: Any, **kwargs: Any)
PyTorch Dataset class to load point cloud data. Optionally, uses HDF5 files to only read into memory what is necessary for one batch.
- __init__(data: numpy.typing.ArrayLike, scalars: dict[str, numpy.typing.ArrayLike] = {}, cms_transform: bool = False, scalar_requires_grad: bool = False)
- Parameters:
data (ArrayLike) – Dataset of positions with shape (num_examples, 3, num_points)
scalars (dict[str, ArrayLike], default={}) – Dictionary of scalar arrays. For instance, the root mean squared deviation (RMSD) for each feature vector can be passed via
{"rmsd": np.array(...)}. The dimension of each scalar array should match the number of input feature vectors N.cms_transform (bool) – If True, subtract center of mass from batch and shift and scale batch by the full dataset statistics.
scalar_requires_grad (bool) – Sets requires_grad torch.Tensor parameter for scalars specified by
scalar_dset_names. Set to True, to use scalars for learning. If scalars are only required for plotting, then set it as False.