drugforge.ml.dataset.SplitDockedDataset

class drugforge.ml.dataset.SplitDockedDataset(*args: Any, **kwargs: Any)[source]

Bases: DockedDataset

Same layout as DockedDataset, but each entry is a dict that has entries for “complex”, “protein”, and “ligand”, which store the corresponding representations.

__init__(compounds={}, structures=[], random_iter=False)

Constructor for DockedDataset object.

Parameters:
  • compounds (dict[(str, str), list[int]]) – Dict mapping a compound tuple (xtal_id, compound_id) to a list of indices in structures that are poses for that id pair

  • structures (list[dict]) – List of pose dicts, containing at minimum tensors for atomic number, atomic positions, and a ligand idx. Indices in this list should match the indices in the lists in compounds.

  • random_iter (bool, default=False) – Iterate through the dataset randomly each time

Methods

__init__([compounds, structures, random_iter])

Constructor for DockedDataset object.

from_complexes(complexes[, exp_dict, ...])

Build from a list of Complex objects.

from_files(str_fns, compounds[, ignore_h, ...])

classmethod from_complexes(complexes: list[Complex], exp_dict=None, ignore_h=True, random_iter=False)

Build from a list of Complex objects.

Parameters:
  • complexes (list[Complex]) – List of Complex schema objects to build into a DockedDataset object

  • exp_dict (dict[str, dict[str, int | float]], optional) – Dict mapping compound_id to an experimental results dict. The dict for a compound will be added to the pose representation of each Complex containing a ligand witht that compound_id

  • ignore_h (bool, default=True) – Whether to remove hydrogens from the loaded structure

  • random_iter (bool, default=False) – Iterate through the dataset randomly each time

Return type:

DockedDataset

classmethod from_files(str_fns, compounds, ignore_h=True, extra_dict=None, num_workers=1, random_iter=False)
Parameters:
  • str_fns (list[str]) – List of paths for the PDB files. Should correspond 1:1 with the names in compounds

  • compounds (list[tuple[str]]) – List of (crystal structure, ligand compound id)

  • ignore_h (bool, default=True) – Whether to remove hydrogens from the loaded structure

  • extra_dict (dict[str, dict], optional) – Extra information to add to each structure. Keys should be compounds, and dicts can be anything as long as they don’t have the keys [“z”, “pos”, “lig”, “compound”]

  • num_workers (int, default=1) – Number of cores to use to load structures

  • random_iter (bool, default=False) – Iterate through the dataset randomly each time