mdlearn.data.datasets.point_cloud
PointCloud Dataset.
Classes
|
|
|
PyTorch Dataset class to load point cloud data. |
|
PyTorch Dataset class to load point cloud data. |
- class mdlearn.data.datasets.point_cloud.CenterOfMassTransform(data: numpy.ndarray)
- __init__(data: numpy.ndarray) None
Computes center of mass transformation
- Parameters
data (np.ndarray) – Dataset of positions with shape (num_examples, 3, num_points).
- transform(x: numpy.ndarray) numpy.ndarray
Normalize example by bias and scale factors
- Parameters
x (np.ndarray) – Data to transform shape (3, num_points). Modifies
x
.- Returns
np.ndarray – The transformed data
- Raises
ValueError – If NaN encountered in input
- class mdlearn.data.datasets.point_cloud.PointCloudDataset(*args: Any, **kwargs: Any)
PyTorch Dataset class to load point cloud data. Optionally, uses HDF5 files to only read into memory what is necessary for one batch.
- __init__(path: Union[str, pathlib.Path], num_points: int, num_features: int = 0, dataset_name: str = 'point_cloud', scalar_dset_names: List[str] = [], seed: int = 333, cms_transform: bool = False, scalar_requires_grad: bool = False, in_memory: bool = True)
- Parameters
path (Union[str, Path]) – Path to HDF5 file containing data set.
dataset_name (str) – Name of the point cloud data in the HDF5 file.
scalar_dset_names (List[str]) – List of scalar dataset names inside HDF5 file to be passed to training logs.
num_points (int) – Number of points per sample. Should be smaller or equal than the total number of points.
num_features (int) – Number of additional per-point features in addition to xyz coords.
seed (int) – Seed for the RNG for the splitting. Make sure it is the same for all workers reading from the same file.
cms_transform (bool) – If True, subtract center of mass from batch and shift and scale batch by the full dataset statistics.
scalar_requires_grad (bool) – Sets requires_grad torch.Tensor parameter for scalars specified by
scalar_dset_names
. Set to True, to use scalars for multi-task learning. If scalars are only required for plotting, then set it as False.in_memory (bool) – If True, pull data stored in HDF5 from disk to numpy arrays. Otherwise, read each batch from HDF5 on the fly.
Examples
>>> dataset = PointCloudDataset("point_clouds.h5", 28) >>> dataset[0] {'X': torch.Tensor(..., dtype=float32), 'index': tensor(0)} >>> dataset[0]["X"].shape torch.Size([3, 28])
>>> dataset = PointCloudDataset("point_clouds.h5", 28, 1) >>> dataset[0]["X"].shape torch.Size([4, 28])
>>> dataset = PointCloudDataset("point_clouds.h5", 28, scalar_dset_names=["rmsd"]) >>> dataset[0] {'X': torch.Tensor(..., dtype=float32), 'index': tensor(0), 'rmsd': tensor(8.7578, dtype=torch.float16)}
- property point_cloud_size: Tuple[int, int]
- class mdlearn.data.datasets.point_cloud.PointCloudDatasetInMemory(*args: Any, **kwargs: Any)
PyTorch Dataset class to load point cloud data. Optionally, uses HDF5 files to only read into memory what is necessary for one batch.
- __init__(data: numpy.ndarray, scalars: Dict[str, numpy.ndarray] = {}, cms_transform: bool = False, scalar_requires_grad: bool = False)
- Parameters
data (np.ndarray) – Dataset of positions with shape (num_examples, 3, num_points)
scalars (Dict[str, np.ndarray], default={}) – Dictionary of scalar arrays. For instance, the root mean squared deviation (RMSD) for each feature vector can be passed via
{"rmsd": np.array(...)}
. The dimension of each scalar array should match the number of input feature vectors N.cms_transform (bool) – If True, subtract center of mass from batch and shift and scale batch by the full dataset statistics.
scalar_requires_grad (bool) – Sets requires_grad torch.Tensor parameter for scalars specified by
scalar_dset_names
. Set to True, to use scalars for learning. If scalars are only required for plotting, then set it as False.