{ "cells": [ { "cell_type": "markdown", "id": "76dea8ef-f2b5-4e9f-a850-60aa8abde5d0", "metadata": {}, "source": [ "# Interfacing with databases and systems" ] }, { "cell_type": "markdown", "id": "203dee44-82ef-459e-864f-983b0e39a2a2", "metadata": {}, "source": [ "ASAP's workflows involve interfacing with many different services, data providers and integrations. Much like with our base level abstractions, we aim to provide a seamless way to work with these databases and integrations with high level abstractions" ] }, { "cell_type": "markdown", "id": "d63f64e6-f443-4b5b-b90a-1057a1446d4d", "metadata": {}, "source": [ "## Reading lots of molecules from files\n", "\n", "One often wants to read a giant file filled with molecule data, e.g. an `SDF` or `mol2` file. We provide a `MolFileFactory` to quickly read these into a list of Ligands" ] }, { "cell_type": "code", "execution_count": 1, "id": "1c32a0c5-788d-4f8f-a869-0fb2e062e59c", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Warning: OE3DToAtomStereo is unable to perceive atom stereo from a flat geometry on atom 8 of molecule 'MAT-POS-fa06b69f-6'\n", "Warning: OE3DToAtomStereo had a problem during OEMolToSmiles when writing 'MAT-POS-fa06b69f-6'\n", "Warning: OE3DToAtomStereo is unable to perceive atom stereo from a flat geometry on atom 8 of molecule 'MAT-POS-fa06b69f-6'\n", "Warning: OE3DToAtomStereo had a problem during OEMolToSTDInChI when writing 'MAT-POS-fa06b69f-6'\n", "Warning: OE3DToAtomStereo is unable to perceive atom stereo from a flat geometry on atom 8 of molecule 'MAT-POS-fa06b69f-6'\n", "Warning: OE3DToAtomStereo had a problem during OEMolToSTDInChIKey when writing 'MAT-POS-fa06b69f-6'\n", "Warning: OE3DToAtomStereo is unable to perceive atom stereo from a flat geometry on atom 8 of molecule 'MAT-POS-fa06b69f-6'\n", "Warning: OE3DToAtomStereo had a problem during OEMolToInChI when writing 'MAT-POS-fa06b69f-6'\n", "Warning: OE3DToAtomStereo is unable to perceive atom stereo from a flat geometry on atom 8 of molecule 'MAT-POS-fa06b69f-6'\n", "Warning: OE3DToAtomStereo had a problem during OEMolToInChIKey when writing 'MAT-POS-fa06b69f-6'\n", "Warning: OE3DToAtomStereo is unable to perceive atom stereo from a flat geometry on atom 8 of molecule 'AAR-POS-0daf6b7e-2'\n", "Warning: OE3DToAtomStereo is unable to perceive atom stereo from a flat geometry on atom 11 of molecule 'AAR-POS-0daf6b7e-2'\n", "Warning: OE3DToAtomStereo had a problem during OEMolToSmiles when writing 'AAR-POS-0daf6b7e-2'\n", "Warning: OE3DToAtomStereo is unable to perceive atom stereo from a flat geometry on atom 8 of molecule 'AAR-POS-0daf6b7e-2'\n", "Warning: OE3DToAtomStereo is unable to perceive atom stereo from a flat geometry on atom 11 of molecule 'AAR-POS-0daf6b7e-2'\n", "Warning: OE3DToAtomStereo had a problem during OEMolToSTDInChI when writing 'AAR-POS-0daf6b7e-2'\n", "Warning: OE3DToAtomStereo is unable to perceive atom stereo from a flat geometry on atom 8 of molecule 'AAR-POS-0daf6b7e-2'\n", "Warning: OE3DToAtomStereo is unable to perceive atom stereo from a flat geometry on atom 11 of molecule 'AAR-POS-0daf6b7e-2'\n", "Warning: OE3DToAtomStereo had a problem during OEMolToSTDInChIKey when writing 'AAR-POS-0daf6b7e-2'\n", "Warning: OE3DToAtomStereo is unable to perceive atom stereo from a flat geometry on atom 8 of molecule 'AAR-POS-0daf6b7e-2'\n", "Warning: OE3DToAtomStereo is unable to perceive atom stereo from a flat geometry on atom 11 of molecule 'AAR-POS-0daf6b7e-2'\n", "Warning: OE3DToAtomStereo had a problem during OEMolToInChI when writing 'AAR-POS-0daf6b7e-2'\n", "Warning: OE3DToAtomStereo is unable to perceive atom stereo from a flat geometry on atom 8 of molecule 'AAR-POS-0daf6b7e-2'\n", "Warning: OE3DToAtomStereo is unable to perceive atom stereo from a flat geometry on atom 11 of molecule 'AAR-POS-0daf6b7e-2'\n", "Warning: OE3DToAtomStereo had a problem during OEMolToInChIKey when writing 'AAR-POS-0daf6b7e-2'\n", "Warning: OE3DToAtomStereo is unable to perceive atom stereo from a flat geometry on atom 2 of molecule 'TAT-ENA-80bfd3e5-7'\n", "Warning: OE3DToAtomStereo had a problem during OEMolToSmiles when writing 'TAT-ENA-80bfd3e5-7'\n", "Warning: OE3DToAtomStereo is unable to perceive atom stereo from a flat geometry on atom 2 of molecule 'TAT-ENA-80bfd3e5-7'\n", "Warning: OE3DToAtomStereo had a problem during OEMolToSTDInChI when writing 'TAT-ENA-80bfd3e5-7'\n", "Warning: OE3DToAtomStereo is unable to perceive atom stereo from a flat geometry on atom 2 of molecule 'TAT-ENA-80bfd3e5-7'\n", "Warning: OE3DToAtomStereo had a problem during OEMolToSTDInChIKey when writing 'TAT-ENA-80bfd3e5-7'\n", "Warning: OE3DToAtomStereo is unable to perceive atom stereo from a flat geometry on atom 2 of molecule 'TAT-ENA-80bfd3e5-7'\n", "Warning: OE3DToAtomStereo had a problem during OEMolToInChI when writing 'TAT-ENA-80bfd3e5-7'\n", "Warning: OE3DToAtomStereo is unable to perceive atom stereo from a flat geometry on atom 2 of molecule 'TAT-ENA-80bfd3e5-7'\n", "Warning: OE3DToAtomStereo had a problem during OEMolToInChIKey when writing 'TAT-ENA-80bfd3e5-7'\n", "Warning: OE3DToAtomStereo is unable to perceive atom stereo from a flat geometry on atom 8 of molecule 'LON-WEI-8f408cad-5'\n", "Warning: OE3DToAtomStereo had a problem during OEMolToSmiles when writing 'LON-WEI-8f408cad-5'\n", "Warning: OE3DToAtomStereo is unable to perceive atom stereo from a flat geometry on atom 8 of molecule 'LON-WEI-8f408cad-5'\n", "Warning: OE3DToAtomStereo had a problem during OEMolToSTDInChI when writing 'LON-WEI-8f408cad-5'\n", "Warning: OE3DToAtomStereo is unable to perceive atom stereo from a flat geometry on atom 8 of molecule 'LON-WEI-8f408cad-5'\n", "Warning: OE3DToAtomStereo had a problem during OEMolToSTDInChIKey when writing 'LON-WEI-8f408cad-5'\n", "Warning: OE3DToAtomStereo is unable to perceive atom stereo from a flat geometry on atom 8 of molecule 'LON-WEI-8f408cad-5'\n", "Warning: OE3DToAtomStereo had a problem during OEMolToInChI when writing 'LON-WEI-8f408cad-5'\n", "Warning: OE3DToAtomStereo is unable to perceive atom stereo from a flat geometry on atom 8 of molecule 'LON-WEI-8f408cad-5'\n", "Warning: OE3DToAtomStereo had a problem during OEMolToInChIKey when writing 'LON-WEI-8f408cad-5'\n" ] } ], "source": [ "from drugforge.data.readers.molfile import MolFileFactory\n", "from drugforge.data.testing.test_resources import fetch_test_file\n", "\n", "big_sdf_file = fetch_test_file(\"Mpro_combined_labeled.sdf\") # SDF file filled with COVID Moonshot compounds\n", "\n", "factory = MolFileFactory(filename=big_sdf_file)\n", "ligands = factory.load()" ] }, { "cell_type": "code", "execution_count": 2, "id": "40e299c9-198a-478d-801f-1810e6c18765", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "576\n" ] } ], "source": [ "print(len(ligands)) # loaded 576 ligands into a list " ] }, { "cell_type": "markdown", "id": "7d229000-2f2f-49ef-99e9-0d8155b4626a", "metadata": {}, "source": [ "## Reading structures from Fragalysis " ] }, { "cell_type": "markdown", "id": "8ea04b21-f07b-4455-bbea-b6bf83c52861", "metadata": {}, "source": [ "Diamond light source uses the [Fragalysis](https://fragalysis.diamond.ac.uk/viewer/react/landing) platform to display their crystallography results. ASAP makes extensive use of Diamond's high throughput crystallography pipeline, and therefore have developed easy ways to download and parse Fragalysis data in our workflows.\n", "\n", "To get a Fragalysis format dump, navigate to the `Download` button on the desired target in the Fragalysis UI. For ease of use here we have vendored a SARS-CoV-2-Mpro fragalysis file in our testing suite. " ] }, { "cell_type": "code", "execution_count": 3, "id": "dd6a0ff8-73c7-4691-8df2-1d71d0da8183", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Downloading file 'mpro_fragalysis-04-01-24_zipped.zip' from 'https://asap-discovery-test-files.s3.amazonaws.com/mpro_fragalysis-04-01-24_zipped.zip' to '/Users/joshua/Library/Caches/asapdiscovery_testing'.\n" ] } ], "source": [ "from drugforge.data.testing.test_resources import fetch_test_file\n", "\n", "mpro_fragalysis_zipped = fetch_test_file(\"mpro_fragalysis-04-01-24_zipped.zip\")\n", "extract_dir = \".\"\n", "# unzip \n", "import shutil\n", "shutil.unpack_archive(mpro_fragalysis_zipped, \".\")" ] }, { "cell_type": "code", "execution_count": 4, "id": "c64a85aa-8393-4aa0-a0b3-d0d2ae5981b6", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Warning: OE3DToAtomStereo is unable to perceive atom stereo from a flat geometry on atom 8 of molecule 'LIG'\n", "Warning: OE3DToAtomStereo is unable to perceive atom stereo from a flat geometry on atom 11 of molecule 'LIG'\n", "Warning: OE3DToAtomStereo had a problem during OEWriteMolecule when writing 'LIG'\n", "Warning: OE3DToAtomStereo is unable to perceive atom stereo from a flat geometry on atom 2 of molecule 'LIG'\n", "Warning: OE3DToAtomStereo had a problem during OEWriteMolecule when writing 'LIG'\n", "Warning: OE3DToAtomStereo is unable to perceive atom stereo from a flat geometry on atom 8 of molecule 'LIG'\n", "Warning: OE3DToAtomStereo had a problem during OEWriteMolecule when writing 'LIG'\n", "Warning: OE3DToAtomStereo is unable to perceive atom stereo from a flat geometry on atom 8 of molecule 'LIG'\n", "Warning: OE3DToAtomStereo had a problem during OEWriteMolecule when writing 'LIG'\n", "Warning: OECreateInChI: InChI only supports molecules with between 1 and 1023 atoms! (note: large molecule support is experimental)\n", "Warning: OECreateInChI: InChI only supports molecules with between 1 and 1023 atoms! (note: large molecule support is experimental)\n", "Warning: OECreateInChI: InChI only supports molecules with between 1 and 1023 atoms! (note: large molecule support is experimental)\n", "Warning: OECreateInChI: InChI only supports molecules with between 1 and 1023 atoms! (note: large molecule support is experimental)\n", "Warning: OECreateInChI: InChI only supports molecules with between 1 and 1023 atoms! (note: large molecule support is experimental)\n", "Warning: OECreateInChI: InChI only supports molecules with between 1 and 1023 atoms! (note: large molecule support is experimental)\n", "Warning: OECreateInChI: InChI only supports molecules with between 1 and 1023 atoms! (note: large molecule support is experimental)\n", "Warning: OECreateInChI: InChI only supports molecules with between 1 and 1023 atoms! (note: large molecule support is experimental)\n", "Warning: OECreateInChI: InChI only supports molecules with between 1 and 1023 atoms! (note: large molecule support is experimental)\n", "Warning: OECreateInChI: InChI only supports molecules with between 1 and 1023 atoms! (note: large molecule support is experimental)\n", "Warning: OECreateInChI: InChI only supports molecules with between 1 and 1023 atoms! (note: large molecule support is experimental)\n", "Warning: OECreateInChI: InChI only supports molecules with between 1 and 1023 atoms! (note: large molecule support is experimental)\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "803\n" ] } ], "source": [ "from drugforge.data.services.fragalysis.fragalysis_reader import FragalysisFactory\n", "\n", "frag_factory = FragalysisFactory.from_dir(\"mpro_fragalysis-04-01-24_zipped\")\n", "complexes = frag_factory.load(use_dask=True) # we can use dask to speed this up a lot\n", "\n", "# we now have a list of 800 complexes from fragalysis to use!\n", "print(len(complexes))\n" ] }, { "cell_type": "markdown", "id": "c37daca3-99f9-4177-a66e-4b633c6594bd", "metadata": {}, "source": [ "## Loading compounds from Postera" ] }, { "cell_type": "markdown", "id": "3037101e-5336-47e2-9b8c-57def72771ae", "metadata": {}, "source": [ "Postera's [Manifold](https://app.postera.ai/) platform is the primary place for a lot of our DMTA cycle \n", "\n", "We need to be able to push and pull data from there with easy. To use this example you will need a valid `POSTERA_API_KEY`, which you can create after making an account" ] }, { "cell_type": "code", "execution_count": 5, "id": "08049921-62c2-4290-a8b3-b880c53d4bb3", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "env: POSTERA_API_KEY=EXAMPLE\n" ] }, { "data": { "text/plain": [ "PosteraSettings(POSTERA_API_KEY='EXAMPLE', POSTERA_API_URL='https://api.asap.postera.ai', POSTERA_API_VERSION='v1')" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%env POSTERA_API_KEY=EXAMPLE\n", "\n", "from drugforge.data.services.postera.postera_factory import PosteraFactory, PosteraSettings\n", "\n", "ps = PosteraSettings()\n", "ps" ] }, { "cell_type": "code", "execution_count": 6, "id": "495ea947-ea56-43e8-ae44-3957fa65a68a", "metadata": {}, "outputs": [], "source": [ "pf = PosteraFactory(settings=ps, molecule_set_name=\"MY_MOLSET\")\n", "# pf.pull() will return a list of Ligands" ] }, { "cell_type": "markdown", "id": "a73584b2-7aee-44e5-84f6-39cfe96a582c", "metadata": {}, "source": [ "## Pushing data to Postera" ] }, { "cell_type": "markdown", "id": "d5c846a7-b677-4ab3-869d-4b066dc0e38c", "metadata": {}, "source": [ "Pushing data to postera is similar to loading, however we only allow certain tags followin our design specification to be updated in Manifold. These can be queried with `ManifoldAllowedTags`" ] }, { "cell_type": "code", "execution_count": 7, "id": "a0714038-a29f-4973-ade3-3f79dfb47f63", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['SMILES',\n", " 'biochemical-activity_EV-A71-3Cpro_computed-SchNet-pIC50_msk',\n", " 'biochemical-activity_EV-D68-3Cpro_computed-SchNet-pIC50_msk',\n", " 'biochemical-activity_MERS-CoV-Mpro_computed-SchNet-pIC50_msk',\n", " 'biochemical-activity_SARS-CoV-2-Mpro_computed-SchNet-pIC50_msk',\n", " 'biochemical-activity_ZIKV-NS2B-NS3pro_computed-SchNet-pIC50_msk',\n", " 'in-silico_DENV-NS2B-NS3pro_ligand-conformer-strain-szybki-kcal-mol_msk',\n", " 'in-silico_EV-A71-3Cpro_docking-structure-POSIT_msk',\n", " 'in-silico_EV-A71-Capsid_docking-pose-fitness-POSIT_msk',\n", " 'in-silico_EV-D68-3Cpro_docking-hit_msk',\n", " 'in-silico_EV-D68-3Cpro_md-pose_msk',\n", " 'in-silico_EV-D68-Capsid_ligand-local-strain-szybki-kcal-mol_msk',\n", " 'in-silico_MERS-CoV-Mpro_ligand-conformer-strain-szybki-kcal-mol_msk',\n", " 'in-silico_SARS-CoV-2-Mac1_docking-structure-POSIT_msk',\n", " 'in-silico_SARS-CoV-2-Mpro_docking-pose-fitness-POSIT_msk',\n", " 'in-silico_SARS-CoV-2-N-protein_docking-hit_msk',\n", " 'in-silico_SARS-CoV-2-N-protein_md-pose_msk',\n", " 'in-silico_ZIKV-NS2B-NS3pro_ligand-local-strain-szybki-kcal-mol_msk']" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from drugforge.data.services.postera.manifold_data_validation import ManifoldAllowedTags\n", "ManifoldAllowedTags.get_values()[::10]" ] }, { "cell_type": "markdown", "id": "ed8866c9-6216-4115-b8d6-14dd9b2bce15", "metadata": {}, "source": [ "Lets push some mock data to postera! You have to provide a SMILES and also a `ligand_id` which will be propagated to postera backend if it matches a UUID already present in Postera. " ] }, { "cell_type": "code", "execution_count": 8, "id": "bbc99863-3876-4ef5-b254-50e1e293d5d6", "metadata": {}, "outputs": [], "source": [ "from drugforge.workflows.postera.postera_uploader import PosteraUploader\n", "import pandas as pd\n", "data = {\"SMILES\": [\"CCC\", \"CCCC\"], \"ligand_id\":[\"abcderf1244134jasdasda\", \"asidaosidasdnalsd\"], \"in-silico_SARS-CoV-2-Mac1_docking-structure-POSIT_msk\":[\"structure1\", \"structure2\"]}\n", "df = pd.DataFrame(data)\n" ] }, { "cell_type": "code", "execution_count": 9, "id": "0e01229c-5bb3-4a77-a1e6-85380739ccf9", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SMILESligand_idin-silico_SARS-CoV-2-Mac1_docking-structure-POSIT_msk
0CCCabcderf1244134jasdasdastructure1
1CCCCasidaosidasdnalsdstructure2
\n", "
" ], "text/plain": [ " SMILES ligand_id \\\n", "0 CCC abcderf1244134jasdasda \n", "1 CCCC asidaosidasdnalsd \n", "\n", " in-silico_SARS-CoV-2-Mac1_docking-structure-POSIT_msk \n", "0 structure1 \n", "1 structure2 " ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "code", "execution_count": 10, "id": "66d13bb8-0147-46eb-a593-b94a44c20941", "metadata": {}, "outputs": [], "source": [ "pu = PosteraUploader(settings=ps, molecule_set_name=\"MY_MOLSET\")\n", "# pu.push(df) # will push data to remote " ] }, { "cell_type": "markdown", "id": "0028a4e0-5d48-43db-b49e-e5b9d528f40c", "metadata": {}, "source": [ "## Reading data from CDD\n", "\n", "At ASAP we use the [CDD vault](https://www.collaborativedrug.com/) to store assay information on tested molecules and often need to search and pull data. To use this service you should export your `CDD_API_KEY` and `CDD_VAULT_NUMBER` which will be automatically picked up by our `CDDSettings`:" ] }, { "cell_type": "code", "execution_count": 11, "id": "120418db-6413-448a-8573-d0d74e169960", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "env: CDD_API_KEY=EXAMPLE, CDD_VAULT_NUMBER=1\n" ] }, { "data": { "text/plain": [ "CDDSettings(CDD_API_KEY='EXAMPLE, CDD_VAULT_NUMBER=1', CDD_VAULT_NUMBER=6890, CDD_API_URL='https://app.collaborativedrug.com', CDD_API_VERSION='v1')" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%env CDD_API_KEY=EXAMPLE, CDD_VAULT_NUMBER=1\n", "\n", "from drugforge.data.services.cdd.cdd_api import CDDAPI, CDDSettings\n", "settings = CDDSettings()\n", "settings" ] }, { "cell_type": "markdown", "id": "41aa2ca8-4b99-4db3-8e42-9f8d3d6260b8", "metadata": {}, "source": [ "we can now use the `CDDAPI` interface to query our vault, lets start by searching for molecules, note that they are returned as raw dictionary data from the CDD which can be converted into ligand objects using the `CXSmiles` or `Smiles` data:" ] }, { "cell_type": "code", "execution_count": null, "id": "e2311fa0-773f-4d54-a82e-0894303cde14", "metadata": {}, "outputs": [], "source": [ "cdd_api = CDDAPI.from_settings(settings=settings)\n", "# Search for a specific molecule in the vault, can only do one search at a time using smiles\n", "benzene = cdd_api.get_molecules(smiles=\"c1ccccc1\")\n", "# search for a list of molecules by their name in CDD\n", "molecules_by_name = cdd_api.get_molecules(names=['org-id-1', 'org-id-2'])\n", "# or search for molecules using the CDD compound-id\n", "molecules_by_id = cdd_api.get_molecules(compound_ids=[1, 2, 3, 4])" ] }, { "cell_type": "markdown", "id": "a80b7fff-af09-4309-bd62-f329084ea20c", "metadata": {}, "source": [ "Another common task is to download all `IC50` data for a given protocol to use in ML model development or benchmarking binding affinity calculations, this is trivial using the API:" ] }, { "cell_type": "code", "execution_count": null, "id": "2e3cb8f8-5b38-4029-8405-361923bd2c96", "metadata": {}, "outputs": [], "source": [ "ic50_dataframe = cdd_api.get_ic50_data(protocol_name='assay-1')" ] }, { "cell_type": "markdown", "id": "5a49eadc-42f2-4687-b5dc-28f63ae0927b", "metadata": {}, "source": [ "We also provide a utility function which allows you to quickly download all of the molecules in a given protocol and filter for fully defined stereo and non-covalent ligands only, which returns the results as asap `Lignad` objects:" ] }, { "cell_type": "code", "execution_count": null, "id": "0ad68fa8-41a7-493e-afa5-5e41e279d1b0", "metadata": {}, "outputs": [], "source": [ "from drugforge.alchemy.cli.utils import get_cdd_molecules\n", "\n", "molecules = get_cdd_molecules(protocol_name='assay-1', defined_stereo_only=True, remove_covalent=True)" ] }, { "cell_type": "code", "execution_count": null, "id": "ade03258-8d3c-4543-9e86-1b968434d53f", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.6" } }, "nbformat": 4, "nbformat_minor": 5 }