mdlearn.data.preprocess.simulation

Module for extracting outputs from molecular dynamics trajectories.

Functions

parallel_preprocess(topic, input_dir, output_dir)

Preprocess simulation data from many trajectory files in parallel.

preprocess(top_file, traj_file, output_dir, ...)

Preprocess simulation data from a trajectory file.

Classes

ContactMapPreprocessor(top_file, traj_file)

Process contact maps from a MD trajectory.

CoordinatePreprocessor(top_file, traj_file, ...)

Process coordinates from a MD trajectory.

RmsdPreprocessor(top_file, traj_file, ref_file)

Process RMSD from a MD trajectory.

SimulationPreprocessor(top_file, traj_file, ...)

Protocol for simulation data preprocessors.

class mdlearn.data.preprocess.simulation.ContactMapPreprocessor(top_file: Path | str, traj_file: Path | str, cutoff: float = 8.0, selection: str = 'protein and name CA')

Process contact maps from a MD trajectory.

__init__(top_file: Path | str, traj_file: Path | str, cutoff: float = 8.0, selection: str = 'protein and name CA') None

Initialize the contact map preprocessor.

Parameters:
  • top_file (Path | str) – Topology file of the simulation.

  • traj_file (Path | str) – Trajectory file of the simulation.

  • cutoff (float) – Cutoff distance (in Angstroms) for contact map calculation, defaults to 8.0.

  • selection (str) – Atom selection string for the reference structure, defaults to ‘protein and name CA’.

get() numpy.ndarray

Get contact maps of a trajectory file.

Returns:

np.ndarray – The contact maps of the trajectory with shape (n_frames, R) where R is a ragged dimension containing the concatenated row and column indices of the ones in the contact map.

class mdlearn.data.preprocess.simulation.CoordinatePreprocessor(top_file: Path | str, traj_file: Path | str, ref_file: Path | str, selection: str = 'protein and name CA')

Process coordinates from a MD trajectory.

__init__(top_file: Path | str, traj_file: Path | str, ref_file: Path | str, selection: str = 'protein and name CA') None

Initialize the coordinate preprocessor.

Parameters:
  • top_file (Path | str) – Topology file of the simulation.

  • traj_file (Path | str) – Trajectory file of the simulation.

  • ref_file (Path | str) – Reference structure file to align the trajectory.

  • selection (str) – Atom selection string for the reference structure, defaults to ‘protein and name CA’.

get() numpy.ndarray

Get coordinates of a trajectory file.

Returns:

np.ndarray – Coordinates of the trajectory. The shape of the array is (n_frames, n_atoms, 3), where n_frames is the number of frames in the trajectory, n_atoms is the number of atoms in the selection, and 3 corresponds to x, y, and z.

class mdlearn.data.preprocess.simulation.RmsdPreprocessor(top_file: Path | str, traj_file: Path | str, ref_file: Path | str, selection: str = 'protein and name CA')

Process RMSD from a MD trajectory.

__init__(top_file: Path | str, traj_file: Path | str, ref_file: Path | str, selection: str = 'protein and name CA') None

Initialize the RMSD preprocessor.

Parameters:
  • top_file (Path | str) – Topology file of the simulation.

  • traj_file (Path | str) – Trajectory file of the simulation.

  • ref_file (Path | str) – Reference structure file to calculate RMSD.

  • selection (str) – Atom selection string for the reference structure, defaults to ‘protein and name CA’.

get() numpy.ndarray

Get RMSD to reference state of a trajectory file.

Returns:

np.ndarray – RMSD to reference state of the trajectory. The shape of the array is (n_frames,), where n_frames is the number of frames in the trajectory.

class mdlearn.data.preprocess.simulation.SimulationPreprocessor(top_file: Path | str, traj_file: Path | str, *args: Any, **kwargs: dict[str, Any])

Protocol for simulation data preprocessors.

__init__(top_file: Path | str, traj_file: Path | str, *args: Any, **kwargs: dict[str, Any]) None

Initialize the simulation preprocessor.

Parameters:
  • top_file (Path | str) – Topology file of the simulation.

  • traj_file (Path | str) – Trajectory file of the simulation.

  • *args (Any) – Positional arguments for the preprocessor.

  • **kwargs (dict[str, Any]) – Keyword arguments for the preprocessor.

get() numpy.ndarray

Get simulation data from a trajectory file.

Returns:

np.ndarray – Simulation data from the trajectory.

mdlearn.data.preprocess.simulation.parallel_preprocess(topic: str, input_dir: Path | str, output_dir: Path | str, top_ext: str = '.pdb', traj_ext: str = '.dcd', num_workers: int = 10, **kwargs: Any) None

Preprocess simulation data from many trajectory files in parallel.

Parameters:
  • topic (str) – Topic/name of the preprocessor.

  • input_dir (Path | str) – Input directory containing the trajectory files.

  • output_dir (Path | str) – Output directory to save the preprocessed data.

  • top_ext (str) – Extension of the topology files, defaults to ‘.pdb’.

  • traj_ext (str) – Extension of the trajectory files, defaults to ‘.dcd’.

  • num_workers (int) – Number of workers for parallel processing, defaults to 10.

  • **kwargs (Any) – Keyword arguments for the preprocessor.

mdlearn.data.preprocess.simulation.preprocess(top_file: Path | str, traj_file: Path | str, output_dir: Path | str, topic: str, **kwargs: Any) None

Preprocess simulation data from a trajectory file.

Parameters:
  • top_file (Path | str) – Topology file of the simulation.

  • traj_file (Path | str) – Trajectory file of the simulation.

  • output_dir (Path | str) – Output directory to save the preprocessed data.

  • topic (str) – Topic of the simulation data.

  • **kwargs (Any) – Keyword arguments for the preprocessor.