{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "a9dd55cb-bb04-47e5-9758-f006379db2c3",
   "metadata": {},
   "source": [
    "# Docking and Scoring\n",
    "## Intro"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "70b9d2640eb06041",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "This notebook will show you how to dock and score molecules using the drugforge-docking module. You should have already gone through the `interfacing with databases and systems` tutorial. \n",
    "\n",
    "\n",
    "Our docking pipeline primarily focuses on the use-case for a structure-enabled drug discovery program, in which we have crystal structures of early molecules to use for *reference-based* docking. This approach has been demonstrated to be effective for prioritising designed compounds. \n",
    "\n",
    "To this end, we have implemented an api that wraps the OpenEye POSIT docking algorithm, which, through its use of the HYBRID and SHAPEFIT algorithms, enables reference-based docking. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "28497ab0b4fa94ca",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "### The scope of this guide"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "506fdcc62785c33c",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "This guide will show you how to dock and score molecules. For the *extremely* necessary precursor step of data loading and prepping, please see [protein_prep](%protein_prep.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fc7d5b6c525f9fee",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "## Setting up example data"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8f8614357c006f2",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "We will use files we use for testing, since these molecules have already been prepped for docking."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c4002264f3f38448",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from drugforge.data.testing.test_resources import fetch_test_file\n",
    "from drugforge.data.schema.complex import Complex\n",
    "from drugforge.data.schema.ligand import Ligand\n",
    "from drugforge.modeling.schema import PreppedComplex\n",
    "\n",
    "# grab a pre-prepared complex \n",
    "prepped_complex = PreppedComplex.from_oedu_file(\n",
    "        fetch_test_file(\"Mpro-P2660_0A_bound-prepped_receptor.oedu\"),\n",
    "        ligand_kwargs={\"compound_name\": \"test\"},\n",
    "        target_kwargs={\"target_name\": \"test\", \"target_hash\": \"mock_hash\"},\n",
    "    )\n",
    "\n",
    "# make a ligand from an SDF file \n",
    "ligand = Ligand.from_sdf(\n",
    "        fetch_test_file(\"Mpro-P0008_0A_ERI-UCB-ce40166b-17.sdf\"), compound_name=\"test\"\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6e6495a9b40fcf4b",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "## Docking with the POSITDocker"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "83d11ecc36f27b8b",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "There are a *ton* of choices we can make for docking, which will not be enumerated here. But in order to get a flavor for the options, we can examine the class attributes of the POSITDocker:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1e8b578e0c5cd0e9",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from drugforge.docking.openeye import POSITDocker"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9f7b3d0d-89ad-4031-9fa5-a759e48ddfe5",
   "metadata": {},
   "outputs": [],
   "source": [
    "POSITDocker?"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c09c386f-93e2-4ad4-9406-589208f35b7e",
   "metadata": {},
   "source": [
    "A quick overview of some important options \n",
    "\n",
    "* `relax`: Whether to allow relaxation when generating docked structures\n",
    "* `posit_method`: which POSIT method to use, see [here](https://docs.eyesopen.com/applications/oedocking/theory/posit_theory.html#) for a complete treatment of POSIT theory. The default (`ALL`) selects the best method possible iteratively.\n",
    "* `use_omega`: Use OpenEye's OMEGA conformer generation to enumerate conformers before docking. Should vastly improve the quality of predicted poses.\n",
    "* `omega_dense`: Use dense OMEGA sampling.\n",
    "* `allow_retries`: Try several configurations of docking parameters to attempt to obtain a result. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bcb198f12778c0ee",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# lets go ahead and make a Docker object\n",
    "docker = POSITDocker()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4367f0f81f56bb65",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# and check out its defaults.\n",
    "docker.dict()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bbf6a77ef91fe4cb",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "`POSITDocker.dock()` requires:\n",
    "1) a list of DockingInputBase objects\n",
    "2) an output directory\n",
    "3) and some dask options"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d2f1782a6f9a548f",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "Currently, we have 2 kinds of DockingInputBase objects implemented:\n",
    "1) a complex-ligand pair (DockingInputPair)\n",
    "2) a one-to-many ligand:complexes object (DockingInputMultiStructure)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c50f672710755a10",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "### Running simple docking "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c6af7343f46886e",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from drugforge.docking.docking import DockingInputPair\n",
    "input_pair = DockingInputPair(ligand=ligand, complex=prepped_complex)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "835610ba-a32c-466b-ad3c-a46d804a5f64",
   "metadata": {},
   "outputs": [],
   "source": [
    "results = docker.dock(inputs=[input_pair])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b0f912b68fb1cde4",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "This returns a list of POSITDockingResults objects!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e8f208e377bb2b20",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# lets grab one of the results objects\n",
    "result = results[0] "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8cda12ed29150315",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# and dump it to disk\n",
    "result.write_docking_files(\"docking_test\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3b9731bc8369ab68",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "## Scoring"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a190fdf18f2d480d",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "We decouple the *pose prediction* and *scoring* parts of docking, which enables us to score docked poses with different scoring functions. \n",
    "\n",
    "This flexibility allows us to implement our own scoring functions and capture information important to our discovery process. \n",
    "\n",
    "To this end, we have written a few \"scorer\" classes, including:\n",
    "1. A traditional phyiscs based docking scorer: `ChemGauss4Scorer`\n",
    "2. A score which tries to capture information about the potential for the binding site to evolve: `FINTScorer` (see `working_with_fitness_data` tutorial)\n",
    "3. A 2D Graph Attention based scorer trained on Covid Moonshot data that predicts pIC50s directly: `GATScorer`\n",
    "4. 3D equivariant scorers that use 3D pose information and are trained on Covid Moonshot data to predict pIC50s:  `E3NNScorer` and  `SchnetScorer`\n",
    "5. And finally, a MetaScorer which can run all the other scorers easily. \n",
    "\n",
    "Currently these live in the `docking` module, but we are planning to move some of the scorers to other subpackages so that `docking` doesn't have to depend on `dataviz`, `spectrum`, `ml`, etc."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d8fbc441d8b79f31",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from drugforge.docking.scorer import ChemGauss4Scorer\n",
    "from drugforge.docking.fint_scorer import FINTScorer\n",
    "# The ML scorers are not imported for now as the ml module is not working. \n",
    "# from drugforge.docking.ml_scorer import GATScorer, E3NNScorer, SchnetScorer\n",
    "from drugforge.docking.meta_scorer import MetaScorer"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f4c1490dadfe2e97",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "### Targets"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "70e3c7a2755aadbf",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "Several of our scorers require target-specific information. We can find out the targets that the repo \"knows about\" like so:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "40078b13ed983fae",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from drugforge.data.services.postera.manifold_data_validation import TargetTags"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e35dc4a923af8fbc",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "TargetTags.get_values()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4926aff3a53db7dd",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "Since we're working with a known target, we can set that as a variable and use that throughout"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e3a7dcf7c1893b66",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "target = TargetTags(\"SARS-CoV-2-Mpro\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dde143845fd4555a",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "### ChemGauss4 Scorer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "29004e554bddee19",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "chemgauss_scorer = ChemGauss4Scorer()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9e0336fe703133b0",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "scores = chemgauss_scorer.score(results)\n",
    "scores"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ce43acbebbeb6cbd",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "We can see this returns an array of score objects. If we want a dataframe, we can ask it to run this instead:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "993311abca9066",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "scores_df = chemgauss_scorer.score(results, return_df=True)\n",
    "scores_df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "20b0e55036723903",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "### FINTScore"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cbaf4ae04b00fce4",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "In antiviral drug discovery, potential for mutation of target binding sites is very high. Thus it is important to avoid sidechain-ligand interactions with highly mutable residues when evaluating potential designs. \n",
    "\n",
    "FINTScore attempts to compress information about the mutability of the ligand binding site into a single number between 0 and 1. It rewards interactions with non-mutable sidechains and backbone atoms, while penalising interactions with mutable sidechains. See the implementation for more details. You can also view this information in 3D. See the `visualising ASAP targets` tutorial for more information. \n",
    "\n",
    "For the FINT score, we need fitness data (normally obtained by deep mutational scanning experiments or from phylogenetic data), which means we can only work on a target for which we have vendored fitness data. To check which targets those are, we can use:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "37b5911a6ccd3beb",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from drugforge.spectrum.fitness import target_has_fitness_data\n",
    "\n",
    "# does our target have fitness data?\n",
    "target_has_fitness_data(target) # yes!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f140196cc0f07bbe",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "fint_scorer = FINTScorer(target=target)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ddccb75f5817b233",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "scores = fint_scorer.score(results, return_df=True)\n",
    "scores"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "16686f3be0c2e114",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "### ML Scorers - WARNING THIS SECTION CURRENTLY DOES NOT WORK. We hope to fix this in the next release."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "31a41fc630992c0c",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "Our ML scorers are trained to predict pIC50s from 2D graph or 3D equivariant representations of ligands and target-ligand complexes respectively. This is enabled by the [MTENN](https://github.com/choderalab/mtenn) framework that abstracts and modularize the task of structure-based machine learning.\n",
    "\n",
    "Currently, we have ML scorers for some targets separately, but are exploring PDBBind based foundational models for multi-target prediction.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "60c72e58ef104c79",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# import our ML model registry\n",
    "from drugforge.ml.models import ASAPMLModelRegistry"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2c67d9c1082d67f9",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "ASAPMLModelRegistry.get_implemented_model_types()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "65347d0d908bcf26",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from drugforge.docking.scorer import MLModelScorer\n",
    "ml_scorers = [MLModelScorer.from_latest_by_target_and_type(target, model_type) \n",
    "           for model_type in ASAPMLModelRegistry.get_implemented_model_types()]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "44aa66154bb8ae28",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "gat_scores = ml_scorers[0].score(results, return_df=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f264873d0321aa64",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "gat_scores"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8190384bd127966b",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "#### MetaScorer"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4ba76b0c4774056d",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "We can use the MetaScorer to run all the scoring for us and combine everything into a dataframe we can save easily"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "75474c208b20741",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "scorers = [chemgauss_scorer, fint_scorer, *ml_scorers]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ef31d8bd27f52a6c",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "metascorer = MetaScorer(scorers=scorers)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "708eb495cdff224c",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "scores_df = metascorer.score(results, return_df=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "746e077f0e551f7b",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "scores_df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bd654b07c6348387",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "Under the hood, this uses this function: `drugforge.docking.scorer.Score._combine_and_pivot_scores_df` to return the scores in a dataframe. As of version 0.4, this uses `drugforge.docking.scorer._SCORE_MANIFOLD_ALIAS` to change the column names to conform to the standard names used within the ASAPDiscovery Consortium. You can examine which column names correspond to which score here:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1d28e20c4fc8e94f",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from drugforge.docking.scorer import _SCORE_MANIFOLD_ALIAS\n",
    "_SCORE_MANIFOLD_ALIAS"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1d9bd774a1550dfa",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "## Advanced Topics: Selectors"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d12eb4812586bc59",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "The ASAPDiscovery Consortium regularly operates in a regime where we have many experimental structures to choose from as references for docking. To accelerate the process of choosing which structures to use, we have generated a series of Selector objects which take a set of ligands and complexes and choose which set of ligand-complex pairs to use for docking."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6e24b052db7cc3f",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from drugforge.docking.selectors.selector_list import StructureSelector"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c33e1790f8efbe1",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "StructureSelector.get_values()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ad49c819b4ad91bb",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# mock a fragalysis dump from diamond light source\n",
    "\n",
    "from drugforge.data.services.fragalysis.fragalysis_reader import FragalysisFactory\n",
    "all_mpro_fns = [\n",
    "        \"metadata.csv\",\n",
    "        \"aligned/Mpro-x11041_0A/Mpro-x11041_0A_bound.pdb\",\n",
    "        \"aligned/Mpro-x1425_0A/Mpro-x1425_0A_bound.pdb\",\n",
    "        \"aligned/Mpro-x11894_0A/Mpro-x11894_0A_bound.pdb\",\n",
    "        \"aligned/Mpro-x1002_0A/Mpro-x1002_0A_bound.pdb\",\n",
    "        \"aligned/Mpro-x10155_0A/Mpro-x10155_0A_bound.pdb\",\n",
    "        \"aligned/Mpro-x0354_0A/Mpro-x0354_0A_bound.pdb\",\n",
    "        \"aligned/Mpro-x11271_0A/Mpro-x11271_0A_bound.pdb\",\n",
    "        \"aligned/Mpro-x1101_1A/Mpro-x1101_1A_bound.pdb\",\n",
    "        \"aligned/Mpro-x1187_0A/Mpro-x1187_0A_bound.pdb\",\n",
    "        \"aligned/Mpro-x10338_0A/Mpro-x10338_0A_bound.pdb\",\n",
    "    ]\n",
    "all_paths = [fetch_test_file(f\"frag_factory_test/{fn}\") for fn in all_mpro_fns]\n",
    "parent_dir = all_paths[0].parent\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "870a69e6-2c7f-4bf7-8e0c-041e11f357f9",
   "metadata": {},
   "outputs": [],
   "source": [
    "ff = FragalysisFactory.from_dir(parent_dir)\n",
    "complexes = ff.load()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fead92efc620886e",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "ligands = [complex.ligand for complex in complexes]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f396e1ecb5d57749",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "### To illustrate what the selectors do"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2260a9ff-b118-4f17-92aa-7baa16715c6b",
   "metadata": {},
   "source": [
    "Selectors take a list of `Complexes` and `Ligands` and return `Pairs` based on some criteria. \n",
    "\n",
    "For example the `SelfDockingSelector` only selects ligands that match those already present in the complexes. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "de4146b4307d1c96",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "#### SelfDockingSelector"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e17bec8b31331e7a",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "selector = StructureSelector('SelfDockingSelector').selector_cls()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4040cd792b5b06cf",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "pairs = selector.select(ligands, complexes)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5806a3e4f5c87241",
   "metadata": {},
   "outputs": [],
   "source": [
    "len(pairs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "282520ebdd7728e3",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# are all the ligands the same as those in the complexes?\n",
    "all(pair.complex.ligand == pair.ligand for pair in pairs)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "13c7f4acd29e8afc",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "#### PairwiseSelector"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "15142b2c-c624-4297-9725-88c97219357e",
   "metadata": {},
   "source": [
    "The `PairwiseSelector` produces the full outer product of complexes and ligands e.g in the example below we have 10 complexes and 10 ligands. The resulting outer product of pairs contains 100 elements. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ea5e768515d4a5c0",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "selector = StructureSelector('PairwiseSelector').selector_cls()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9f14dfeecf060c11",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "pairs = selector.select(ligands, complexes)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1518d5e2db86eb47",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "len(pairs)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aba14008fc7868ca",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "#### MCSSelector"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d1142304-3297-4a79-81e7-760b18435bd4",
   "metadata": {},
   "source": [
    "The `MCSSelector` selects complexes that closely match the structures in the query ligand by Maximum Common Substructure (MCS). This is very useful when docking new designs and you need to determine a chemically similar reference. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a2de81ea7cb5193e",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "selector = StructureSelector('MCSSelector').selector_cls()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8cf6e9ffdd576126",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# the n_select parameter controls how many matches per ligand\n",
    "pairs = selector.select(ligands, complexes, n_select=5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bb800d955cf5bb9d",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "len(pairs)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "725e4d271720a847",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "#### LeaveSimilarOutSelector"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d5a1804e89458fed",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "The `LeaveSimilarOutSelector` filters out pairs where the ligand is a stereoisomer / tautomer / protonation state isomer / etc of the complex ligand. It can take a while though because it has to do an len(ligands) * len(complexes) pairwise comparison of all those chemical possibilities. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e96cfc897fa52cdd",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "selector = StructureSelector('LeaveSimilarOutSelector').selector_cls()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "64cefea78ac239ef",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "pairs = selector.select(ligands, complexes)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "db5e918078166918",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "len(pairs)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "23cd649995d30efd",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "## Advanced Topics: Multi-Structure Docking"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4cc3cb4d9ba92350",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "\n",
    "Some docking protocols (i.e., POSIT) will accept multiple receptor structures and choose for themselves which to dock to. For these docking protocols, we pass a different kind of input:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "beaea6120bb0b262",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from drugforge.docking.docking import DockingInputMultiStructure\n",
    "from drugforge.docking.selectors.selector_list import StructureSelector"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ef5e2d307e8fe72f",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "cached_dus = {\n",
    "        \"Mpro-x1002\": \"du_cache/Mpro-x1002_0A_bound.oedu\",\n",
    "        \"Mpro-x0354\": \"du_cache/Mpro-x0354_0A_bound.oedu\",\n",
    "    }\n",
    "prepped_complexes = [\n",
    "        PreppedComplex.from_oedu_file(\n",
    "            fetch_test_file(cached_du),\n",
    "            ligand_kwargs={\"compound_name\": \"test\"},\n",
    "            target_kwargs={\"target_name\": name, \"target_hash\": \"mock_hash\"},\n",
    "        )\n",
    "        for name, cached_du in cached_dus.items()\n",
    "    ]\n",
    "ligand = Ligand.from_sdf(\n",
    "        fetch_test_file(\"Mpro-P0008_0A_ERI-UCB-ce40166b-17.sdf\"), compound_name=\"test\"\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a6bb4861bcc951e2",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "Let's assume we had gotten this subset of ligand-protein pairs from the selector logic from above. This would look something like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "56ae3d56aefdd676",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "selector = StructureSelector('LeaveSimilarOutSelector').selector_cls()\n",
    "pairs = selector.select([ligand], prepped_complexes)\n",
    "len(pairs)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "77fa7f097b8c05e7",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "We then collapse these pairs into a single MultiStructure set:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bc44f9a4d4e44e81",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "inputs = DockingInputMultiStructure.from_pairs(pairs) # Returns a list since multiple sets could be generated"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ef0984eb7f6d012d",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "If we already knew exactly what we wanted to do, we could just create the set directly:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "efc3ec6333d0e4ff",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "alternate_inputs = DockingInputMultiStructure(ligand=ligand, complexes=prepped_complexes)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c8869871f782821a",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "We can see that two are equivalent in this case:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fd357bd6d104d9e8",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "inputs[0] == alternate_inputs"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7f33cffdbfb70ca9",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "Now we run docking as before:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e44ee5641c5cd712",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "docker = POSITDocker() # let's just use defaults for now"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "59ea45bad2128216",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "results = docker.dock(inputs) # we won't use dask or write an output, takes ~3 minutes on a Macbook Pro"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5d432d326184dbbd",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "result = results[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ecc8cdd2eb073af",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "result.write_docking_files(\"multi_structure_docking_test\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "73ec695e8d32f737",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "Since we input multiple structures, we don't know which one it actually used. We can find this out by examining the results:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9408902e9f618b23",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "result.input_pair.complex.target.target_name"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4b684cd6a21a1554",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "## Advanced Topics: Multi-pose Docking"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e2cdcaaabbbb112a",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "Note: this is functionality that was most recently added. Please make an issue if you encounter problems :)\n",
    "\n",
    "We'll use the same docking scheme as above"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "286da78a989fe721",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from drugforge.docking.docking import DockingInputPair\n",
    "prepped_complex = PreppedComplex.from_oedu_file(\n",
    "        fetch_test_file(\"Mpro-P2660_0A_bound-prepped_receptor.oedu\"),\n",
    "        ligand_kwargs={\"compound_name\": \"test\"},\n",
    "        target_kwargs={\"target_name\": \"test\", \"target_hash\": \"mock_hash\"},\n",
    "    )\n",
    "ligand = Ligand.from_sdf(\n",
    "        fetch_test_file(\"Mpro-P0008_0A_ERI-UCB-ce40166b-17.sdf\"), compound_name=\"test\"\n",
    "    )\n",
    "input_pair = DockingInputPair(ligand=ligand, complex=prepped_complex)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "300f06cb8ddc89a0",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "docker = POSITDocker(num_poses=50) # we set the number of poses when we create the docker"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4106e4745d9d0dfa",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "results = docker.dock([input_pair]) # we won't use dask or write an output, takes ~1 min on a Macbook Pro"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "740d37a3ffad9e43",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "len(results)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f1118bd8aa6e30bc",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print([result.probability for result in results])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1fdead236a1e76f",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "## A few side notes: Dask and Target Specific Workflows"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cca7da8cadf596f3",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "### Dask"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6af146d6f9b22551",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "We make heavy use of Dask throughout our code, which helps automate parallel processing and provides a nice dashboard for evaluating the progress of large scale docking efforts. Due to the way in which Dask automates error handling, this has occasionally led to situations where the behaviour of our code is different depending on whether you have enabled Dask. We have tried to stamp out any instances of this, but if you find another, please make an issue!"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cb96fc4cad7e758a",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "### Target-specific workflows"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a897e110095c4b6",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "We have implemented our library code within the `asapdiscovery-workflows` module, which puts everything together in a command-line interface (cli). Unfortunately, as of version 0.4, these workflows only work if you are using the targets specified for ASAP. We plan on changing this for version 0.5 "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aeb67094b84d84f2",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "To find out which targets can be passed to these workflows, you can use this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "986630eb29710dde",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from drugforge.data.services.postera.manifold_data_validation import TargetTags"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d53657ae265ce4d5",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "TargetTags.get_values()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}