[
  {
    "path": ".github/ISSUE_TEMPLATE/bug-.md",
    "content": "---\nname: Bug\nabout: Have you used metaGEM and found some strange behavior or unexpected output?\n  Please post your bug reports here!\ntitle: \"[Bug]: _______________ \"\nlabels: bug\nassignees: franciscozorrilla\n\n---\n\nPlease include any and all relevant information here, for example:\n\n* What tool and/or Snakefile rule are causing problems?\n* What steps have you taken so far?\n* Provide any relevant error messages:\n\n```\n\n```\n\nBefore posting, please search for key terms in the issues section as it is likely that your problem may have already been addressed to a certain extent. If not, please continue and post a new issue.\n"
  },
  {
    "path": ".github/ISSUE_TEMPLATE/question-.md",
    "content": "---\nname: Question?\nabout: Are you wondering if metaGEM is appropriate for your particular case? Or maybe\n  you have analyzed some data using metaGEM and need help understanding the output?\n  Post your questions here!\ntitle: \"[Question]: _______________ ?\"\nlabels: question\nassignees: franciscozorrilla\n\n---\n\nPlease include any and all relevant information here, for example:\n\n* What are you trying to achieve? i.e. What question are you trying to answer?\n* What microbial community are you studying? e.g. Human gut, free living soil, synthetic lab culture?\n* What steps have you taken so far? e.g. Assembled reads with MEGAHIT but cannot get binner X to work.\n* Provide any relevant error messages:\n\n```\n\n```\n\nBefore posting, please search for key terms in the issues section as it is likely that your problem may have already been addressed to a certain extent. If not, please continue and post a new issue.\n"
  },
  {
    "path": ".github/PULL_REQUEST_TEMPLATE/PR.md",
    "content": "# ✏️ Description\n\nPlease include a summary of the changes and the related issue, as well as relevant motivation and context. List any dependencies that are required for this change.\n\nFixes # (issue)\n\n## Type of change\n\nPlease delete options that are not relevant.\n\n- [ ] Bug fix (non-breaking change which fixes an issue)\n- [ ] New feature (non-breaking change which adds functionality)\n- [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)\n- [ ] This change requires a documentation update\n\n# 📐 How Has This Been Tested?\n\nPlease describe the tests that you ran to verify your changes. Also provide instructions so we can reproduce, including any relevant details for your test configuration. \nDelete options that are not relevant or add new options as necessary.\n\n- [ ] Test A\n- [ ] Test B\n\n**Test Configuration**:\n* Tool X version:\n* Tool Y version:\n* Snakemake version:\n* OS:\n\n# 🐍 Snakefile chores\n\nPlease explain how Snakefile rules were created, modified, or expanded to provide new functionalities or bugfixes. Delete options that are not relevant or add new options as necessary.\n\n- [ ] create Snakefile rule for ...\n- [ ] modified Snakefile rule for ... \n- [ ] create helper rule ...\n\n# 🔨 Additional chores\n\nPlease ensure that the appropriate config files + wrapper script have been modified to support new functionalies or bugfixes. Delete options that are not relevant or add new options as necessary.\n\n- [ ] metaGEM.sh: add options + support for new tasks\n- [ ] config.yaml: add params used in new tasks\n- [ ] config readme file: update\n- [ ] conda recipe: update\n- [ ] main readme file: text/figure\n\n# 📝 Final checklist\n\nDelete options that are not relevant or add new options as necessary.\n\n- [ ] My code follows the style guidelines of this project\n- [ ] I have performed a self-review of my code\n- [ ] I have commented my code, particularly in hard-to-understand areas\n- [ ] I have made corresponding changes to the documentation\n- [ ] My changes generate no new warnings\n- [ ] I have added tests that prove my fix is effective or that my feature works\n"
  },
  {
    "path": ".snakemake-workflow-catalog.yml",
    "content": "usage:\n  mandatory-flags: # optional definition of additional flags\n    desc: # describe your flags here in a few sentences (they will be inserted below the example commands)\n    flags: # put your flags here\n  software-stack-deployment: # definition of software deployment method (at least one of conda, singularity, or singularity+conda)\n    conda: false # whether pipeline works with --use-conda\n    singularity: false # whether pipeline works with --use-singularity\n    singularity+conda: false # whether pipeline works with --use-singularity --use-conda\n  report: false # add this to confirm that the workflow allows to use 'snakemake --report report.zip' to generate a report containing all results and explanations\n"
  },
  {
    "path": ".travis.yml",
    "content": "language: python\npython:\n # We don't actually use the Travis Python, but this keeps it organized.\n - \"3.8.6\"\n\ninstall:\n # set up conda\n - sudo apt-get update\n - wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh;\n - bash miniconda.sh -b -p $HOME/miniconda\n - export PATH=\"$HOME/miniconda/bin:$PATH\"\n - hash -r\n - conda config --set always_yes yes --set changeps1 no\n - conda update -q conda\n # clone metagem repo\n - git clone https://github.com/franciscozorrilla/metaGEM.git && cd metaGEM\n # set up mamba\n - travis_wait 30 conda create --quiet --prefix ./envs/mamba mamba -c conda-forge && source activate envs/mamba\n # set up metagem\n - travis_wait 30 mamba env create --quiet --prefix ./envs/metagem -f envs/metaGEM_env.yml\n - source activate envs/metagem \n # This causes TRAVIS CI specific error, something to do with the setuptools version  - pip install --quiet --user memote carveme smetana\n - source deactivate && source activate envs/mamba\n # set up metawrap\n - travis_wait 30 mamba env create --quiet --prefix ./envs/metawrap -f envs/metaWRAP_env.yml\n # set up prokka-roary\n - travis_wait 30 mamba env create --quiet --prefix ./envs/prokkaroary -f envs/prokkaroary_env.yml\n # set root dir\n - echo -e \"Setting current directory to root in config.yaml file ... \\n\" && root=$(pwd) && sed  -i \"2s~/.*$~$root~\" config.yaml\n # set scratch dir\n - mkdir -p tmp\n - echo -e \"Setting tmp directory in config.yaml file ... \\n\" && scratch=$(pwd|sed 's|$|/tmp|g') && sed  -i \"3s~/.*$~$scratch~\" config.yaml\n\nscript:\n - source activate envs/metagem\n # run createFolders\n - snakemake createFolders -j1\n # run downloadToy\n - snakemake downloadToy -j1\n # run fastp\n - snakemake all -j1\n"
  },
  {
    "path": "CITATION.bib",
    "content": "@article{10.1093/nar/gkab815,\n    author = {Zorrilla, Francisco and Buric, Filip and Patil, Kiran R and Zelezniak, Aleksej},\n    title = \"{metaGEM: reconstruction of genome scale metabolic models directly from metagenomes}\",\n    journal = {Nucleic Acids Research},\n    volume = {49},\n    number = {21},\n    pages = {e126-e126},\n    year = {2021},\n    month = {10},\n    abstract = \"{Metagenomic analyses of microbial communities have revealed a large degree of interspecies and intraspecies genetic diversity through the reconstruction of metagenome assembled genomes (MAGs). Yet, metabolic modeling efforts mainly rely on reference genomes as the starting point for reconstruction and simulation of genome scale metabolic models (GEMs), neglecting the immense intra- and inter-species diversity present in microbial communities. Here, we present metaGEM (https://github.com/franciscozorrilla/metaGEM), an end-to-end pipeline enabling metabolic modeling of multi-species communities directly from metagenomes. The pipeline automates all steps from the extraction of context-specific prokaryotic GEMs from MAGs to community level flux balance analysis (FBA) simulations. To demonstrate the capabilities of metaGEM, we analyzed 483 samples spanning lab culture, human gut, plant-associated, soil, and ocean metagenomes, reconstructing over 14,000 GEMs. We show that GEMs reconstructed from metagenomes have fully represented metabolism comparable to isolated genomes. We demonstrate that metagenomic GEMs capture intraspecies metabolic diversity and identify potential differences in the progression of type 2 diabetes at the level of gut bacterial metabolic exchanges. Overall, metaGEM enables FBA-ready metabolic model reconstruction directly from metagenomes, provides a resource of metabolic models, and showcases community-level modeling of microbiomes associated with disease conditions allowing generation of mechanistic hypotheses.}\",\n    issn = {0305-1048},\n    doi = {10.1093/nar/gkab815},\n    url = {https://doi.org/10.1093/nar/gkab815},\n    eprint = {https://academic.oup.com/nar/article-pdf/49/21/e126/41503923/gkab815.pdf},\n}\n"
  },
  {
    "path": "CODE_OF_CONDUCT.md",
    "content": "# Contributor Covenant Code of Conduct\n\n## Our Pledge\n\nWe as members, contributors, and leaders pledge to make participation in our\ncommunity a harassment-free experience for everyone, regardless of age, body\nsize, visible or invisible disability, ethnicity, sex characteristics, gender\nidentity and expression, level of experience, education, socio-economic status,\nnationality, personal appearance, race, religion, or sexual identity\nand orientation.\n\nWe pledge to act and interact in ways that contribute to an open, welcoming,\ndiverse, inclusive, and healthy community.\n\n## Our Standards\n\nExamples of behavior that contributes to a positive environment for our\ncommunity include:\n\n* Demonstrating empathy and kindness toward other people\n* Being respectful of differing opinions, viewpoints, and experiences\n* Giving and gracefully accepting constructive feedback\n* Accepting responsibility and apologizing to those affected by our mistakes,\n  and learning from the experience\n* Focusing on what is best not just for us as individuals, but for the\n  overall community\n\nExamples of unacceptable behavior include:\n\n* The use of sexualized language or imagery, and sexual attention or\n  advances of any kind\n* Trolling, insulting or derogatory comments, and personal or political attacks\n* Public or private harassment\n* Publishing others' private information, such as a physical or email\n  address, without their explicit permission\n* Other conduct which could reasonably be considered inappropriate in a\n  professional setting\n\n## Enforcement Responsibilities\n\nCommunity leaders are responsible for clarifying and enforcing our standards of\nacceptable behavior and will take appropriate and fair corrective action in\nresponse to any behavior that they deem inappropriate, threatening, offensive,\nor harmful.\n\nCommunity leaders have the right and responsibility to remove, edit, or reject\ncomments, commits, code, wiki edits, issues, and other contributions that are\nnot aligned to this Code of Conduct, and will communicate reasons for moderation\ndecisions when appropriate.\n\n## Scope\n\nThis Code of Conduct applies within all community spaces, and also applies when\nan individual is officially representing the community in public spaces.\nExamples of representing our community include using an official e-mail address,\nposting via an official social media account, or acting as an appointed\nrepresentative at an online or offline event.\n\n## Enforcement\n\nInstances of abusive, harassing, or otherwise unacceptable behavior may be\nreported to the community leaders responsible for enforcement at\nfz274@cam.ac.uk.\nAll complaints will be reviewed and investigated promptly and fairly.\n\nAll community leaders are obligated to respect the privacy and security of the\nreporter of any incident.\n\n## Enforcement Guidelines\n\nCommunity leaders will follow these Community Impact Guidelines in determining\nthe consequences for any action they deem in violation of this Code of Conduct:\n\n### 1. Correction\n\n**Community Impact**: Use of inappropriate language or other behavior deemed\nunprofessional or unwelcome in the community.\n\n**Consequence**: A private, written warning from community leaders, providing\nclarity around the nature of the violation and an explanation of why the\nbehavior was inappropriate. A public apology may be requested.\n\n### 2. Warning\n\n**Community Impact**: A violation through a single incident or series\nof actions.\n\n**Consequence**: A warning with consequences for continued behavior. No\ninteraction with the people involved, including unsolicited interaction with\nthose enforcing the Code of Conduct, for a specified period of time. This\nincludes avoiding interactions in community spaces as well as external channels\nlike social media. Violating these terms may lead to a temporary or\npermanent ban.\n\n### 3. Temporary Ban\n\n**Community Impact**: A serious violation of community standards, including\nsustained inappropriate behavior.\n\n**Consequence**: A temporary ban from any sort of interaction or public\ncommunication with the community for a specified period of time. No public or\nprivate interaction with the people involved, including unsolicited interaction\nwith those enforcing the Code of Conduct, is allowed during this period.\nViolating these terms may lead to a permanent ban.\n\n### 4. Permanent Ban\n\n**Community Impact**: Demonstrating a pattern of violation of community\nstandards, including sustained inappropriate behavior,  harassment of an\nindividual, or aggression toward or disparagement of classes of individuals.\n\n**Consequence**: A permanent ban from any sort of public interaction within\nthe community.\n\n## Attribution\n\nThis Code of Conduct is adapted from the [Contributor Covenant][homepage],\nversion 2.0, available at\nhttps://www.contributor-covenant.org/version/2/0/code_of_conduct.html.\n\nCommunity Impact Guidelines were inspired by [Mozilla's code of conduct\nenforcement ladder](https://github.com/mozilla/diversity).\n\n[homepage]: https://www.contributor-covenant.org\n\nFor answers to common questions about this code of conduct, see the FAQ at\nhttps://www.contributor-covenant.org/faq. Translations are available at\nhttps://www.contributor-covenant.org/translations.\n"
  },
  {
    "path": "LICENSE",
    "content": "MIT License\n\nCopyright (c) 2021 Francisco Zorrilla\n\nPermission is hereby granted, free of charge, to any person obtaining a copy\nof this software and associated documentation files (the \"Software\"), to deal\nin the Software without restriction, including without limitation the rights\nto use, copy, modify, merge, publish, distribute, sublicense, and/or sell\ncopies of the Software, and to permit persons to whom the Software is\nfurnished to do so, subject to the following conditions:\n\nThe above copyright notice and this permission notice shall be included in all\ncopies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\nFITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\nAUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\nLIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\nOUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\nSOFTWARE.\n"
  },
  {
    "path": "README.md",
    "content": "# 💎 `metaGEM`\n> **Note** \n> An easy-to-use workflow for generating context specific genome-scale metabolic models and predicting metabolic interactions within microbial communities directly from metagenomic data.\n\n[![Nucleic Acids Research](https://img.shields.io/badge/Nucleic%20Acids%20Research-10.1093%2Fnar%2Fgkab815-critical)](https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkab815/6382386)\n[![bioRxiv](https://img.shields.io/badge/bioRxiv-10.1101%2F2020.12.31.424982%20-B31B1B)](https://www.biorxiv.org/content/10.1101/2020.12.31.424982v2.full)\n[![Build Status](https://app.travis-ci.com/franciscozorrilla/metaGEM.svg?branch=master)](https://app.travis-ci.com/github/franciscozorrilla/metaGEM)\n[![GitHub license](https://img.shields.io/github/license/franciscozorrilla/metaGEM)](https://github.com/franciscozorrilla/metaGEM/blob/master/LICENSE)\n[![Snakemake](https://img.shields.io/badge/Snakemake->=5.10.0,<5.31.1-green)](https://snakemake.readthedocs.io/en/stable/project_info/history.html#id407)\n[![Anaconda-Server Badge](https://anaconda.org/bioconda/metagem/badges/downloads.svg)](https://anaconda.org/bioconda/metagem)\n[![Gitter chat](https://badges.gitter.im/gitterHQ/gitter.png)](https://gitter.im/metaGEM/community)\n[![DOI](https://img.shields.io/badge/Zenodo-10.5281%2F4707723-blue)](https://zenodo.org/badge/latestdoi/137376259)\n[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1I1S8AoGuJ9Oc2292vqAGTDmZcbnolbuj#scrollTo=awiAaVwSF5Fz)\n[![Anaconda-Server Badge](https://anaconda.org/bioconda/metagem/badges/version.svg)](https://anaconda.org/bioconda/metagem)\n[![Anaconda-Server Badge](https://anaconda.org/bioconda/metagem/badges/latest_release_date.svg)](https://anaconda.org/bioconda/metagem)\n\n![metawrapfigs_final4 001](https://user-images.githubusercontent.com/35606471/116543667-0d0f8f00-a8e6-11eb-835c-bc1fe935f43e.png)\n\n`metaGEM` is a Snakemake workflow that integrates an array of existing bioinformatics and metabolic modeling tools, for the purpose of predicting metabolic interactions within bacterial communities of microbiomes. From whole metagenome shotgun datasets, metagenome assembled genomes (MAGs) are reconstructed, which are then converted into genome-scale metabolic models (GEMs) for *in silico* simulations. Additional outputs include abundance estimates, taxonomic assignment, growth rate estimation, pangenome analysis, and eukaryotic MAG identification.\n\n## ⚙️ Installation\n\nYou can start using `metaGEM` on your cluster with just one line of code with the [mamba package manager](https://github.com/mamba-org/mamba)\n\n```\nmamba create -n metagem -c bioconda metagem\n```\n\nThis will create an environment called `metagem` and start installing dependencies. Please consult the `config/README.md` page for more detailed setup instructions.\n\n[![installation](https://img.shields.io/badge/metaGEM-config-%2331a354)](https://github.com/franciscozorrilla/metaGEM/tree/master/config)\n\n## 🔧 Usage\n\nClone this repo\n\n```\ngit clone https://github.com/franciscozorrilla/metaGEM.git && cd metaGEM/workflow\n```\n\nRun `metaGEM` without any arguments to see usage instructions:\n\n```\nbash metaGEM.sh\n```\n```\nUsage: bash metaGEM.sh [-t|--task TASK] \n                       [-j|--nJobs NUMBER OF JOBS] \n                       [-c|--cores NUMBER OF CORES] \n                       [-m|--mem GB RAM] \n                       [-h|--hours MAX RUNTIME]\n                       [-l|--local]\n                       \n Options:\n  -t, --task        Specify task to complete:\n\n                        SETUP\n                            createFolders\n                            downloadToy\n                            organizeData\n                            check\n\n                        CORE WORKFLOW\n                            fastp \n                            megahit \n                            crossMapSeries\n                            kallistoIndex\n                            crossMapParallel\n                            kallisto2concoct \n                            concoct \n                            metabat\n                            maxbin \n                            binRefine \n                            binReassemble \n                            extractProteinBins\n                            carveme\n                            memote\n                            organizeGEMs\n                            smetana\n                            extractDnaBins\n                            gtdbtk\n                            abundance\n\n                        BONUS\n                            grid\n                            prokka\n                            roary\n                            eukrep\n                            eukcc\n\n                        VISUALIZATION (in development)\n                            stats\n                            qfilterVis\n                            assemblyVis\n                            binningVis\n                            taxonomyVis\n                            modelVis\n                            interactionVis\n                            growthVis\n\n  -j, --nJobs       Specify number of jobs to run in parallel\n  -c, --nCores      Specify number of cores per job\n  -m, --mem         Specify memory in GB required for job\n  -h, --hours       Specify number of hours to allocated to job runtime\n  -l, --local       Run jobs on local machine for non-cluster usage\n```\n\n## 🧉 Try it now\n\nYou can set up and use `metaGEM` on the cloud by following along the google colab notebook. \n\n[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1I1S8AoGuJ9Oc2292vqAGTDmZcbnolbuj#scrollTo=awiAaVwSF5Fz)\n\nPlease note that google colab does not provide the computational resources necessary to fully run `metaGEM` on a real dataset. This notebook demonstrates how to set up and use `metaGEM` by perfoming the first steps in the workflow on a toy dataset.\n\n## 💩 Tutorials\n\n`metaGEM` can be used to explore your own gut microbiome sequencing data from at-home-test-kit services such as [unseen bio](https://unseenbio.com/). The following tutorial showcases the `metaGEM` workflow on two unseenbio samples.\n\n[![Tutorial](https://img.shields.io/badge/metaGEM-Tutorial-%23d8b365)](https://github.com/franciscozorrilla/unseenbio_metaGEM)\n\nFor an introductory metabolic modeling tutorial, refer to the resources compiled for the [EMBOMicroCom: Metabolite and species dynamics in microbial communities](https://www.embl.org/about/info/course-and-conference-office/events/mcd22-01/) workshop in 2022.\n\n[![Tutorial3](https://img.shields.io/badge/MicroCom-Tutorial-green)](https://github.com/franciscozorrilla/EMBOMicroCom)\n\nFor a more advanced tutorial, check out the resources we put together for the [SymbNET: from metagenomics to metabolic interactions](https://www.ebi.ac.uk/training/events/symbnet-2022/) course in 2022.\n\n[![Tutorial2](https://img.shields.io/badge/SymbNET-Tutorial-red)](https://github.com/franciscozorrilla/SymbNET)\n\n## 🏛️ Wiki\n\nRefer to the wiki for additional usage tips, frequently asked questions, and implementation details.\n\n[![wiki](https://img.shields.io/badge/metaGEM-Wiki-blue)](https://github.com/franciscozorrilla/metaGEM/wiki)\n\n## 📦 Datasets\n\n* You can access the metaGEM-generated results for the publication [here](https://github.com/franciscozorrilla/metaGEM_paper).\n```\n    🧪 Small communities of gut microbes from lab cultures\n    💩 Real gut microbiome samples from Swedish diabetes paper\n    🪴 Plant-associated soil samples from Chinese rhizobiome study\n    🌏 Bulk-soil samples from Australian biodiversity analysis\n    🌊 Ocean water samples from global TARA Oceans expeditions\n```\n* Additionally, you can access metaGEM-generated results from a reanalysis of recently published ancient metagenomes [here](https://zenodo.org/record/7414438#.Y5HSFYLP3bs).\n\n## 🐍 Workflow\n\n### Core\n\n1. Quality filter reads with [fastp](https://github.com/OpenGene/fastp)\n2. Assembly with [megahit](https://github.com/voutcn/megahit)\n3. Draft bin sets with [CONCOCT](https://github.com/BinPro/CONCOCT), [MaxBin2](https://sourceforge.net/projects/maxbin2/), and [MetaBAT2](https://sourceforge.net/projects/maxbin2/)\n4. Refine & reassemble bins with [metaWRAP](https://github.com/bxlab/metaWRAP)\n5. Taxonomic assignment with [GTDB-tk](https://github.com/Ecogenomics/GTDBTk)\n6. Relative abundances with [bwa](https://github.com/lh3/bwa) and [samtools](https://github.com/samtools/samtools)\n7. Reconstruct & evaluate genome-scale metabolic models with [CarveMe](https://github.com/cdanielmachado/carveme) and [memote](https://github.com/opencobra/memote)\n8. Species metabolic coupling analysis with [SMETANA](https://github.com/cdanielmachado/smetana)\n\n### Bonus\n\n9. Growth rate estimation with [GRiD](https://github.com/ohlab/GRiD), [SMEG](https://github.com/ohlab/SMEG) or [CoPTR](https://github.com/tyjo/coptr)\n10. Pangenome analysis with [roary](https://github.com/sanger-pathogens/Roary)\n11. Eukaryotic draft bins with [EukRep](https://github.com/patrickwest/EukRep) and [EukCC](https://github.com/Finn-Lab/EukCC)\n\n## 🏗️ Active Development\n\nIf you want to see any new additional or alternative tools incorporated into the `metaGEM` workflow please raise an issue or create a pull request. Snakemake allows workflows to be very flexible, so adding new rules is as easy as filling out the following template and adding it to the Snakefile:\n\n```\nrule package-name:\n    input:\n        rules.rulename.output\n    output:\n        f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"X\"]}/{{IDs}}/output.file'\n    message:\n        \"\"\"\n        Helpful and descriptive message detailing goal of this rule/package.\n        \"\"\"\n    shell:\n        \"\"\"\n        # Well documented command line instructions go here\n        \n        # Load conda environment \n        set +u;source activate {config[envs][package]};set -u;\n\n        # Run tool\n        package-name -i {input} -o {output}\n        \"\"\"\n```\n\n## 🖇️ Publications\n\nThe `metaGEM` workflow has been used in multiple studies, including the following non-exhaustive list:\n\n```\nPlastic-degrading potential across the global microbiome correlates with recent pollution trends\nJ Zrimec, M Kokina, S Jonasson, F Zorrilla, A Zelezniak\nMBio, 2021\n```\n\n```\nCompetition-cooperation in the chemoautotrophic ecosystem of Movile Cave: first metagenomic approach on sediments\nChiciudean, I., Russo, G., Bogdan, D.F. et al. \nEnvironmental Microbiome, 2022\n```\n\n```\nThe National Ecological Observatory Network’s soil metagenomes: assembly and basic analysis\nWerbin ZR, Hackos B, Lopez-Nava J et al. \nF1000Research, 2022\n```\n\n```\nMicrobial interactions shape cheese flavour formation\nMelkonian, C., Zorrilla, F., Kjærbølling, I. et al.\nNature Communications, 2023\n```\n\n## 🍾 Please cite\n\n```\nmetaGEM: reconstruction of genome scale metabolic models directly from metagenomes\nFrancisco Zorrilla, Filip Buric, Kiran R Patil, Aleksej Zelezniak\nNucleic Acids Research, 2021; gkab815, https://doi.org/10.1093/nar/gkab815\n``` \n\n[![Nucleic Acids Research](https://img.shields.io/badge/Nucleic%20Acids%20Research-10.1093%2Fnar%2Fgkab815-critical)](https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkab815/6382386)\n\n## ⭐ Star History\n\n<a href=\"https://star-history.com/#franciscozorrilla/metaGEM&Date\">\n  <picture>\n    <source media=\"(prefers-color-scheme: dark)\" srcset=\"https://api.star-history.com/svg?repos=franciscozorrilla/metaGEM&type=Date&theme=dark\" />\n    <source media=\"(prefers-color-scheme: light)\" srcset=\"https://api.star-history.com/svg?repos=franciscozorrilla/metaGEM&type=Date\" />\n    <img alt=\"Star History Chart\" src=\"https://api.star-history.com/svg?repos=franciscozorrilla/metaGEM&type=Date\" />\n  </picture>\n</a>\n\n## 📲 Contact\n\nPlease reach out with any comments, concerns, or discussions regarding `metaGEM`.\n\n[![Gitter chat](https://badges.gitter.im/gitterHQ/gitter.png)](https://gitter.im/metaGEM/community)\n[![Twitter](https://img.shields.io/badge/Twitter-%40metagenomez-lightblue)](https://twitter.com/metagenomez)\n[![LinkedIn](https://img.shields.io/badge/LinkedIn-fzorrilla94-blue)](https://www.linkedin.com/in/fzorrilla94/)\n[![email](https://img.shields.io/badge/email-fz274%40cam.ac.uk-%23a6bddb)](fz274@cam.ac.uk)\n"
  },
  {
    "path": "config/README.md",
    "content": "# 💎 Setup guide\n\n## 🔩 Config files\n\nMake sure to inspect and set up the two config files in this folder.\n\n### Snakemake configuration \n`config.yaml`: handles all the tunable parameters, subfolder names, paths, and more. The `root` path is automatically set by the `metaGEM.sh` parser to be the current working directory. Most importantly, you should make sure that the `scratch` path is properly configured. Most clusters have a location for temporary or high I/O operations such as `$TMPDIR` or `$SCRATCH`, e.g. [see here](https://www.c3se.chalmers.se/documentation/filesystem/#using-node-local-disk-tmpdir). Please refer to the config.yaml [wiki page](https://github.com/franciscozorrilla/metaGEM/wiki/Snakefile-config) for a more in depth look at this config file.\n\n### Cluster configuration\n`cluster_config.json`: handles parameters for submitting jobs to the cluster workload manager. Most importantly, you should make sure that the `account` is properly defined to be able to submit jobs to your cluster. Please refer to the cluster_config.json wiki page for a more in depth look at this config file.\n\n## 🛢️ Environments\n\nConda can take *ages* to solve environment dependencies when installing many tools at once, we can use [mamba](https://github.com/mamba-org/mamba) instead for faster installation.\n\nSet up two environments:\n1. `metagem`: Contains most `metaGEM` core workflow tools, Python 3 & Snakemake>=5.10.0,<5.31.1\n2. `metawrap` Contains `metaWRAP` and its dependencies, Python 2\n\n\n### 1. metaGEM\n\nClone metaGEM repo\n\n```\ngit clone https://github.com/franciscozorrilla/metaGEM.git\n```\n\nMove into metaGEM/workflow folder\n\n```\ncd metaGEM/workflow\n```\n\nClean up unnecessary ~250 Mb of unnecessary git history files\n\n```\nrm -r ../.git\n```\n\nPress `y` and `Enter` when prompted to remove write-protected files, these are not necessary and just eat your precious space.\n\n```\nrm: remove write-protected regular file ‘.git/objects/pack/pack-f4a65f7b63c09419a9b30e64b0e4405c524a5b35.pack’? y\nrm: remove write-protected regular file ‘.git/objects/pack/pack-f4a65f7b63c09419a9b30e64b0e4405c524a5b35.idx’? y\n```\n\nRecommended method: install from bioconda\n```\nconda config --add channels conda-forge && mamba create --prefix envs/metagem -c bioconda metagem \n```\n\nAlternative method: create metaGEM env using recipe .yml file\n\n```\nmamba env create --prefix ./envs/metagem -f envs/metaGEM_env.yml\n```\n\nDeactivate activate metaGEM env\n\n```\nsource activate envs/metagem\n```\n\nInstall pip tools\n\n```\npip install --user memote carveme smetana\n```\n\n### 2. metaWRAP\n\nIt is best to set up `metaWRAP` in its own isolated environment to prevent version conflicts with `metaGEM`. Note that `metaWRAP v1.3.2` has not migrated from python 2 to python 3 yet.\n\n```\nconda create -n metawrap\nsource activate metawrap\nconda install -c ursky metawrap-mg=1.3.2\n```\n\nOr using the conda recipe file:\n\n```\nmamba env create --prefix ./envs/metawrap -f envs/metaWRAP_env.yml\n```\n\n## 🔮 Check installation\n\nTo make sure that the basics have been properly configured, run the `check` task using the `metaGEM.sh` parser:\n\n```\nbash metaGEM.sh -t check\n```\n\nThis will check if conda is installed/available and verify that the environments were properly set up.\nAdditionally, this `check` function will prompt you to create results folders if they are not already present.\nFinally, this task will check if any sequencing files are present in the dataset folder, prompting the user to the either organize already existing files into sample-specific subfolders or to download a small [toy dataset](https://zenodo.org/record/3534949/). \n\n## Tools requiring additional configuration\n\n> **Warning** Please note that you will need to set up the following tools/databases to run the complete core metaGEM workflow:\n\n### 1. CheckM\n\n`CheckM` is used extensively within the `metaWRAP` modules to evaluate the output of various intermediate steps. Although the `CheckM` package is installed in the `metawrap` environment, the user is required to download the `CheckM` database and run `checkm data setRoot <db_dir>` as outlined in the [`CheckM` installation guide](https://github.com/Ecogenomics/CheckM/wiki/Installation#how-to-install-checkm).\n\n### 2. GTDB-Tk\n\n`GTDB-Tk` is used for taxonomic assignment of MAGs, and requires a database to be downloaded and configured. Please refer to the [installation documentation](https://ecogenomics.github.io/GTDBTk/installing/index.html) for detailed instructions.\n\n### 3. CPLEX\n\nUnfortunately `CPLEX` cannot be automatically installed in the `env_setup.sh` script, you must install this dependency manually within the metagem conda environment. GEM reconstruction and GEM community simulations require the `IBM CPLEX solver`, which is [free to download with an academic license](https://www.ibm.com/academic/home). Refer to the [`CarveMe`](https://carveme.readthedocs.io/en/latest/installation.html) and [`SMETANA`](https://smetana.readthedocs.io/en/latest/installation.html) installation instructions for further information or troubleshooting. Note: `CPLEX v.12.8` is recommended.\n"
  },
  {
    "path": "config/cluster_config.json",
    "content": "{\n\"__default__\" : {\n        \"account\" : \"your-account-name\",\n        \"time\" : \"0-06:00:00\",\n        \"n\" : 48,\n        \"tasks\" : 1,\n        \"mem\" : 180G,\n        \"name\"      : \"DL.{rule}\",\n        \"output\"    : \"logs/{wildcards}.%N.{rule}.out.log\",\n},\n}\n"
  },
  {
    "path": "config/config.yaml",
    "content": "path:\n    root: /path/to/project/folder/on/your/cluster\n    scratch: /path/to/temporary/or/scratch/directory/for/intermediate/files\nfolder:\n    data: dataset\n    logs: logs\n    assemblies: assemblies\n    scripts: scripts\n    crossMap: crossMap \n    concoct: concoct\n    maxbin: maxbin\n    metabat: metabat\n    refined: refined_bins\n    reassembled: reassembled_bins\n    classification: GTDBTk\n    abundance: abundance\n    GRiD: GRiD\n    GEMs: GEMs\n    SMETANA: SMETANA\n    memote: memote\n    qfiltered: qfiltered\n    stats: stats\n    proteinBins: protein_bins\n    dnaBins: dna_bins\n    pangenome: pangenome\n    kallisto: kallisto\n    kallistoIndex: kallistoIndex\n    benchmarks: benchmarks\n    prodigal: prodigal\n    blastp: blastp\n    blastp_db: blastp_db\nscripts:\n    kallisto2concoct: kallisto2concoct.py\n    prepRoary: prepareRoaryInput.R\n    binFilter: binFilter.py\n    qfilterVis: qfilterVis.R\n    assemblyVis: assemblyVis.R\n    binningVis: binningVis.R\n    modelVis: modelVis.R\n    compositionVis: compositionVis.R\n    taxonomyVis: taxonomyVis.R\n    carveme: media_db.tsv\n    toy: download_toydata.txt\n    GTDBtkVis: \ncores:\n    fastp: 4\n    megahit: 48\n    crossMap: 48\n    concoct: 48\n    metabat: 48\n    maxbin: 48\n    refine: 48\n    reassemble: 48\n    classify: 2\n    gtdbtk: 48\n    abundance: 16\n    carveme: 4\n    smetana: 12\n    memote: 4\n    grid: 24\n    prokka: 2\n    roary: 12\n    diamond: 12\nparams:\n    cutfasta: 10000\n    assemblyPreset: meta-sensitive\n    assemblyMin: 1000\n    concoct: 800\n    metabatMin: 50000\n    seed: 420\n    minBin: 1500\n    refineMem: 1600\n    refineComp: 50\n    refineCont: 10\n    reassembleMem: 1600\n    reassembleComp: 50\n    reassembleCont: 10\n    carveMedia: M8\n    smetanaMedia: M1,M2,M3,M4,M5,M7,M8,M9,M10,M11,M13,M14,M15A,M15B,M16\n    smetanaSolver: CPLEX\n    roaryI: 90\n    roaryCD: 90\nenvs:\n    metagem: envs/metagem\n    metawrap: envs/metawrap\n    prokkaroary: envs/prokkaroary\n"
  },
  {
    "path": "workflow/Snakefile",
    "content": "configfile: \"../config/config.yaml\"\n\nimport os\nimport glob\n\ndef get_ids_from_path_pattern(path_pattern):\n    ids = sorted([os.path.basename(os.path.splitext(val)[0])\n                  for val in (glob.glob(path_pattern))])\n    return ids\n\n\ngemIDs = get_ids_from_path_pattern('GEMs/*.xml')\nbinIDs = get_ids_from_path_pattern('protein_bins/*.faa')\nIDs = get_ids_from_path_pattern('dataset/*')\nspeciesIDs = get_ids_from_path_pattern('pangenome/speciesBinIDs/*.txt')\nDATA_READS_1 = f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"data\"]}/{{IDs}}/{{IDs}}_R1.fastq.gz'\nDATA_READS_2 = f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"data\"]}/{{IDs}}/{{IDs}}_R2.fastq.gz'\nfocal = get_ids_from_path_pattern('dataset/*')\n\nrule all:\n    input:\n        expand(config[\"path\"][\"root\"]+\"/\"+config[\"folder\"][\"qfiltered\"]+\"/{IDs}/{IDs}_R1.fastq.gz\", IDs = IDs)\n    message:\n        \"\"\"\n        WARNING: Be very careful when adding/removing any lines above this message.\n        The metaGEM.sh parser is presently hardcoded to edit line 22 of this Snakefile to expand target rules accordingly,\n        therefore adding/removing any lines before this message will likely result in parser malfunction.\n        \"\"\"\n    shell:\n        \"\"\"\n        echo \"Gathering {input} ... \"\n        \"\"\"\n\nrule createFolders:\n    input:\n        config[\"path\"][\"root\"]\n    message:\n        \"\"\"\n        Very simple rule to check that the metaGEM.sh parser, Snakefile, and config.yaml file are set up correctly. \n        Generates folders from config.yaml config file, not strictly necessary to run this rule.\n        \"\"\"\n    shell:\n        \"\"\"\n        cd {input}\n        echo -e \"Setting up result folders in the following work directory: $(echo {input}) \\n\"\n\n        # Generate folders.txt by extracting folder names from config.yaml file\n        paste config.yaml |cut -d':' -f2|tail -n +4|head -n 25|sed '/^$/d' > folders.txt # NOTE: hardcoded numbers (tail 4, head 25) for folder names, increase number as new folders are introduced.\n        \n        while read line;do \n            echo \"Creating $line folder ... \"\n            mkdir -p $line;\n        done < folders.txt\n        \n        echo -e \"\\nDone creating folders. \\n\"\n\n        rm folders.txt\n        \"\"\"\n\n\nrule downloadToy:\n    input:\n        f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"scripts\"]}/{config[\"scripts\"][\"toy\"]}'\n    message:\n        \"\"\"\n        Downloads toy samples into dataset folder and organizes into sample-specific sub-folders.\n        Download a real dataset by replacing the links in the download_toydata.txt file with links to files from your dataset of intertest.\n        Note: Make sure that the only underscores (e.g. _) that appear in the filenames are between the sample ID and R1/R2 identifier.\n        \"\"\"\n    shell:\n        \"\"\"\n        cd {config[path][root]}/{config[folder][data]}\n\n        # Download each link in download_toydata.txt\n        echo -e \"\\nBegin downloading toy dataset ... \"\n        while read line;do \n            wget $line;\n        done < {input}\n        echo -e \"\\nDone donwloading dataset.\"\n        \n        # Rename downloaded files, this is only necessary for toy dataset (will cause error if used for real dataset)\n        echo -ne \"\\nRenaming downloaded files ... \"\n        for file in *;do \n            mv $file ./$(echo $file|sed 's/?download=1//g'|sed 's/_/_R/g');\n        done\n        echo -e \" done. \\n\"\n\n        # Organize data into sample specific sub-folders\n\n        echo -ne \"Generating list of unique sample IDs ... \"\n        for file in *.gz; do \n            echo $file; \n        done | sed 's/_.*$//g' | sed 's/.fastq.gz//g' | uniq > ID_samples.txt\n        echo -e \" done.\\n $(less ID_samples.txt|wc -l) samples identified.\"\n\n        echo -ne \"\\nOrganizing downloaded files into sample specific sub-folders ... \"\n        while read line; do \n            mkdir -p $line; \n            mv $line*.gz $line; \n        done < ID_samples.txt\n        echo \" done.\"\n        \n        rm ID_samples.txt\n        \"\"\"\n\n\nrule organizeData:\n    input:\n        f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"data\"]}'\n    message:\n        \"\"\"\n        Sorts paired end raw reads into sample specific sub folders within the dataset folder specified in the config.yaml file.\n        Assumes all samples are present in dataset folder.\n        \n        Note: This rule is meant to be run on real datasets. \n        downloadToy rule above sorts the downloaded data already.\n\n        Assumes file names have format: SAMPLEID_R1|R2.fastq.gz, e.g. ERR599026_R2.fastq.gz\n        \"\"\"\n    shell:\n        \"\"\"\n        cd {input}\n    \n        echo -ne \"\\nGenerating list of unique sample IDs ... \"\n\n        # Create list of unique sample IDs\n        for file in *.gz; do \n            echo $file; \n        done | sed 's/_[^_]*$//g' | sed 's/.fastq.gz//g' | uniq > ID_samples.txt\n\n        echo -e \" done.\\n $(less ID_samples.txt|wc -l) samples identified.\\n\"\n\n        # Create folder and move corresponding files for each sample\n\n        echo -ne \"\\nOrganizing dataset into sample specific sub-folders ... \"\n        while read line; do \n            mkdir -p $line; \n            mv $line*.gz $line; \n        done < ID_samples.txt\n        echo -e \" done. \\n\"\n        \n        rm ID_samples.txt\n        \"\"\"\n\nrule qfilter:\n    input:\n        R1 = DATA_READS_1,\n        R2 = DATA_READS_2\n    output:\n        R1 = f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"qfiltered\"]}/{{IDs}}/{{IDs}}_R1.fastq.gz', \n        R2 = f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"qfiltered\"]}/{{IDs}}/{{IDs}}_R2.fastq.gz' \n    shell:\n        \"\"\"\n        # Activate metagem environment\n        echo -e \"Activating {config[envs][metagem]} conda environment ... \"\n        set +u;source activate {config[envs][metagem]};set -u;\n\n        # This is just to make sure that output folder exists\n        mkdir -p $(dirname {output.R1})\n\n        # Make job specific scratch dir\n        idvar=$(echo $(basename $(dirname {output.R1}))|sed 's/_R1.fastq.gz//g')\n        echo -e \"\\nCreating temporary directory {config[path][scratch]}/{config[folder][qfiltered]}/${{idvar}} ... \"\n        mkdir -p {config[path][scratch]}/{config[folder][qfiltered]}/${{idvar}}\n\n        # Move into scratch dir\n        cd {config[path][scratch]}/{config[folder][qfiltered]}/${{idvar}}\n\n        # Copy files\n        echo -e \"Copying {input.R1} and {input.R2} to {config[path][scratch]}/{config[folder][qfiltered]}/${{idvar}} ... \"\n        cp {input.R1} {input.R2} .\n\n        echo -e \"Appending .raw to temporary input files to avoid name conflict ... \"\n        for file in *.gz; do mv -- \"$file\" \"${{file}}.raw.gz\"; done\n\n        # Run fastp\n        echo -n \"Running fastp ... \"\n        fastp --thread {config[cores][fastp]} \\\n            -i *R1*raw.gz \\\n            -I *R2*raw.gz \\\n            -o $(basename {output.R1}) \\\n            -O $(basename {output.R2}) \\\n            -j $(dirname {output.R1})/$(echo $(basename $(dirname {output.R1}))).json \\\n            -h $(dirname {output.R1})/$(echo $(basename $(dirname {output.R1}))).html\n\n        # Move output files to root dir\n        echo -e \"Moving output files $(basename {output.R1}) and $(basename {output.R2}) to $(dirname {output.R1})\"\n        mv $(basename {output.R1}) $(basename {output.R2}) $(dirname {output.R1})\n\n        # Warning\n        echo -e \"Note that you must manually clean up these temporary directories if your scratch directory points to a static location instead of variable with a job specific location ... \"\n\n        # Done message\n        echo -e \"Done quality filtering sample ${{idvar}}\"\n        \"\"\"\n\n\nrule qfilterVis:\n    input: \n        f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"qfiltered\"]}'\n    output: \n        text = f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"stats\"]}/qfilter.stats',\n        plot = f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"stats\"]}/qfilterVis.pdf'\n    shell:\n        \"\"\"\n        # Activate metagem env\n        set +u;source activate {config[envs][metagem]};set -u;\n\n        # Make sure stats folder exists\n        mkdir -p $(dirname {output.text})\n\n        # Move into qfiltered folder\n        cd {input}\n\n        # Read and summarize files\n        echo -e \"\\nGenerating quality filtering results file qfilter.stats: ... \"\n        for folder in */;do\n            for file in $folder*json;do\n\n                # Define sample\n                ID=$(echo $file|sed 's|/.*$||g')\n\n                # Reads before filtering\n                readsBF=$(head -n 25 $file|grep total_reads|cut -d ':' -f2|sed 's/,//g'|head -n 1)\n\n                # Reads after filtering\n                readsAF=$(head -n 25 $file|grep total_reads|cut -d ':' -f2|sed 's/,//g'|tail -n 1)\n\n                # Bases before filtering\n                basesBF=$(head -n 25 $file|grep total_bases|cut -d ':' -f2|sed 's/,//g'|head -n 1)\n\n                # Bases after filtering\n                basesAF=$(head -n 25 $file|grep total_bases|cut -d ':' -f2|sed 's/,//g'|tail -n 1)\n\n                # Q20 bases before filtering\n                q20BF=$(head -n 25 $file|grep q20_rate|cut -d ':' -f2|sed 's/,//g'|head -n 1)\n\n                # Q20 bases after filtering\n                q20AF=$(head -n 25 $file|grep q20_rate|cut -d ':' -f2|sed 's/,//g'|tail -n 1)\n\n                # Q30 bases before filtering\n                q30BF=$(head -n 25 $file|grep q30_rate|cut -d ':' -f2|sed 's/,//g'|head -n 1)\n\n                # Q30 bases after filtering\n                q30AF=$(head -n 25 $file|grep q30_rate|cut -d ':' -f2|sed 's/,//g'|tail -n 1)\n\n                # Percentage of reads kept after filtering\n                percent=$(awk -v RBF=\"$readsBF\" -v RAF=\"$readsAF\" 'BEGIN{{print RAF/RBF}}' )\n\n                # Write values to qfilter.stats file\n                echo \"$ID $readsBF $readsAF $basesBF $basesAF $percent $q20BF $q20AF $q30BF $q30AF\" >> qfilter.stats\n\n                # Print values\n                echo \"Sample $ID retained $percent * 100 % of reads ... \"\n            done\n        done\n\n        echo \"Done summarizing quality filtering results ... \\nMoving to /stats/ folder and running plotting script ... \"\n        mv qfilter.stats {config[path][root]}/{config[folder][stats]}\n\n        # Move to stats folder\n        cd {config[path][root]}/{config[folder][stats]}\n\n        # Run script for quality filter visualization\n        Rscript {config[path][root]}/{config[folder][scripts]}/{config[scripts][qfilterVis]}\n        echo \"Done. \"\n\n        # Remove duplicate/extra plot\n        rm Rplots.pdf\n        \"\"\"\n\n\nrule megahit:\n    input:\n        R1 = rules.qfilter.output.R1, \n        R2 = rules.qfilter.output.R2\n    output:\n        f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"assemblies\"]}/{{IDs}}/contigs.fasta.gz'\n    benchmark:\n        f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"benchmarks\"]}/{{IDs}}.megahit.benchmark.txt'\n    shell:\n        \"\"\"\n        # Activate metagem environment\n        set +u;source activate {config[envs][metagem]};set -u;\n\n        # Make sure that output folder exists\n        mkdir -p $(dirname {output})\n\n        # Make job specific scratch dir\n        idvar=$(echo $(basename $(dirname {output})))\n        echo -e \"\\nCreating temporary directory {config[path][scratch]}/{config[folder][assemblies]}/${{idvar}} ... \"\n        mkdir -p {config[path][scratch]}/{config[folder][assemblies]}/${{idvar}}\n\n        # Move into scratch dir\n        cd {config[path][scratch]}/{config[folder][assemblies]}/${{idvar}}\n\n        # Copy files\n        echo -n \"Copying qfiltered reads to {config[path][scratch]}/${{idvar}} ... \"\n        cp {input.R1} {input.R2} .\n        echo \"done. \"\n\n        # Run megahit\n        echo -n \"Running MEGAHIT ... \"\n        megahit -t {config[cores][megahit]} \\\n            --presets {config[params][assemblyPreset]} \\\n            --verbose \\\n            --min-contig-len {config[params][assemblyMin]} \\\n            -1 $(basename {input.R1}) \\\n            -2 $(basename {input.R2}) \\\n            -o tmp;\n        echo \"done. \"\n\n        # Rename assembly\n        echo \"Renaming assembly ... \"\n        mv tmp/final.contigs.fa contigs.fasta\n        \n        # Remove spaces from contig headers and replace with hyphens\n        echo \"Fixing contig header names: replacing spaces with hyphens ... \"\n        sed -i 's/ /-/g' contigs.fasta\n\n        # Zip and move assembly to output folder\n        echo \"Zipping and moving assembly ... \"\n        gzip contigs.fasta\n        mv contigs.fasta.gz $(dirname {output})\n\n        # Done message\n        echo -e \"Done assembling quality filtered reads for sample ${{idvar}}\"\n        \"\"\"\n\nrule assemblyVis:\n    input: \n        f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"assemblies\"]}'\n    message:\n        \"\"\"\n        Note that this rule is designed to read megahit assemblies with hyphens instead of \n        spaces in contig headers as generated by the megahit rule above.\n        \"\"\"\n    output: \n        text = f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"stats\"]}/assembly.stats',\n        plot = f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"stats\"]}/assemblyVis.pdf',\n    shell:\n        \"\"\"\n        # Activate metagem env\n        set +uo pipefail;source activate {config[envs][metagem]};set -u;\n\n        # Make sure stats folder exists\n        mkdir -p $(dirname {output.text})\n\n        # Move into assembly folder\n        cd {input}\n    \n        echo -e \"\\nGenerating assembly results file assembly.stats: ... \"\n        while read assembly;do\n\n            # Define sample ID\n            ID=$(echo $(basename $(dirname $assembly)))\n\n            # Check if assembly file is empty\n            check=$(zcat $assembly | head | wc -l)\n            if [ $check -eq 0 ]\n            then\n                N=0\n                L=0\n            else\n                N=$(zcat $assembly | grep -c \">\");\n                L=$(zcat $assembly | grep \">\"|cut -d '-' -f4|sed 's/len=//'|awk '{{sum+=$1}}END{{print sum}}');\n            fi\n\n            # Write values to stats file\n            echo $ID $N $L >> assembly.stats;\n\n            # Print values to terminal\n            echo -e \"Sample $ID has a total of $L bp across $N contigs ... \"\n        done< <(find {input} -name \"*.gz\")\n\n        echo \"Done summarizing assembly results ... \\nMoving to /stats/ folder and running plotting script ... \"\n        mv assembly.stats {config[path][root]}/{config[folder][stats]}\n\n        # Move to stats folder\n        cd {config[path][root]}/{config[folder][stats]}\n\n        # Running assembly Vis R script\n        Rscript {config[path][root]}/{config[folder][scripts]}/{config[scripts][assemblyVis]}\n        echo \"Done. \"\n\n        # Remove unnecessary file\n        rm Rplots.pdf\n        \"\"\"\n\nrule crossMapSeries:\n    input:\n        contigs = rules.megahit.output,\n        reads = f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"qfiltered\"]}'\n    output:\n        concoct = directory(f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"concoct\"]}/{{IDs}}/cov'),\n        metabat = directory(f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"metabat\"]}/{{IDs}}/cov'),\n        maxbin = directory(f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"maxbin\"]}/{{IDs}}/cov')\n    benchmark:\n        f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"benchmarks\"]}/{{IDs}}.crossMapSeries.benchmark.txt'\n    message:\n        \"\"\"\n        Cross map in seies:\n        Use this approach to provide all 3 binning tools with cross-sample coverage information.\n        Will likely provide superior binning results, but may no be feasible for datasets with \n        many large samples such as the tara oceans dataset. \n        \"\"\"\n    shell:\n        \"\"\"\n        # Activate metagem environment\n        set +u;source activate {config[envs][metagem]};set -u;\n\n        # Create output folders\n        mkdir -p {output.concoct}\n        mkdir -p {output.metabat}\n        mkdir -p {output.maxbin}\n\n        # Make job specific scratch dir\n        idvar=$(echo $(basename $(dirname {output.concoct})))\n        echo -e \"\\nCreating temporary directory {config[path][scratch]}/{config[folder][crossMap]}/${{idvar}} ... \"\n        mkdir -p {config[path][scratch]}/{config[folder][crossMap]}/${{idvar}}\n\n        # Move into scratch dir\n        cd {config[path][scratch]}/{config[folder][crossMap]}/${{idvar}}\n\n        # Copy files\n        cp {input.contigs} .\n\n        # Define the focal sample ID, fsample: \n        # The one sample's assembly that all other samples' read will be mapped against in a for loop\n        fsampleID=$(echo $(basename $(dirname {input.contigs})))\n        echo -e \"\\nFocal sample: $fsampleID ... \"\n\n        echo \"Renaming and unzipping assembly ... \"\n        mv $(basename {input.contigs}) $(echo $fsampleID|sed 's/$/.fa.gz/g')\n        gunzip $(echo $fsampleID|sed 's/$/.fa.gz/g')\n\n        echo -e \"\\nIndexing assembly ... \"\n        bwa index $fsampleID.fa\n        \n        for folder in {input.reads}/*;do \n\n                id=$(basename $folder)\n\n                echo -e \"\\nCopying sample $id to be mapped against the focal sample $fsampleID ...\"\n                cp $folder/*.gz .\n                \n                # Maybe I should be piping the lines below to reduce I/O ?\n\n                echo -e \"\\nMapping sample to assembly ... \"\n                bwa mem -t {config[cores][crossMap]} $fsampleID.fa *.fastq.gz > $id.sam\n                \n                echo -e \"\\nConverting SAM to BAM with samtools view ... \" \n                samtools view -@ {config[cores][crossMap]} -Sb $id.sam > $id.bam\n\n                echo -e \"\\nSorting BAM file with samtools sort ... \" \n                samtools sort -@ {config[cores][crossMap]} -o $id.sort $id.bam\n\n                echo -e \"\\nRunning jgi_summarize_bam_contig_depths script to generate contig abundance/depth file for maxbin2 input ... \"\n                jgi_summarize_bam_contig_depths --outputDepth $id.depth $id.sort\n\n                echo -e \"\\nMoving depth file to sample $fsampleID maxbin2 folder ... \"\n                mv $id.depth {output.maxbin}\n\n                echo -e \"\\nIndexing sorted BAM file with samtools index for CONCOCT input table generation ... \" \n                samtools index $id.sort\n\n                echo -e \"\\nRemoving temporary files ... \"\n                rm *.fastq.gz *.sam *.bam\n\n        done\n        \n        nSamples=$(ls {input.reads}|wc -l)\n        echo -e \"\\nDone mapping focal sample $fsampleID agains $nSamples samples in dataset folder.\"\n\n        echo -e \"\\nRunning jgi_summarize_bam_contig_depths for all sorted bam files to generate metabat2 input ... \"\n        jgi_summarize_bam_contig_depths --outputDepth $id.all.depth *.sort\n\n        echo -e \"\\nMoving input file $id.all.depth to $fsampleID metabat2 folder... \"\n        mv $id.all.depth {output.metabat}\n\n        echo -e \"Done. \\nCutting up contigs to 10kbp chunks (default), not to be used for mapping!\"\n        cut_up_fasta.py -c {config[params][cutfasta]} -o 0 -m $fsampleID.fa -b assembly_c10k.bed > assembly_c10k.fa\n\n        echo -e \"\\nSummarizing sorted and indexed BAM files with concoct_coverage_table.py to generate CONCOCT input table ... \" \n        concoct_coverage_table.py assembly_c10k.bed *.sort > coverage_table.tsv\n\n        echo -e \"\\nMoving CONCOCT input table to $fsampleID concoct folder\"\n        mv coverage_table.tsv {output.concoct}\n\n        echo -e \"\\nRemoving intermediate sorted bam files ... \"\n        rm *.sort\n        \"\"\"\n\nrule kallistoIndex:\n    input:\n        f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"assemblies\"]}/{{focal}}/contigs.fasta.gz'\n    output:\n        f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"kallistoIndex\"]}/{{focal}}/index.kaix'\n    benchmark:\n        f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"benchmarks\"]}/{{focal}}.kallistoIndex.benchmark.txt'\n    message:\n        \"\"\"\n        Needed for the crossMapParallel implementation, which uses kalliso for fast mapping instead of bwa.\n        Saves a lot of computational power/time to only create once and re-use for each job.\n        \"\"\"\n    shell:\n        \"\"\"\n        # Activate metagem environment\n        set +u;source activate {config[envs][metagem]};set -u;\n\n        # Create output folder\n        mkdir -p $(dirname {output})\n\n        # Make job specific scratch dir\n        sampleID=$(echo $(basename $(dirname {input})))\n        echo -e \"\\nCreating temporary directory {config[path][scratch]}/{config[folder][kallistoIndex]}/${{sampleID}} ... \"\n        mkdir -p {config[path][scratch]}/{config[folder][kallistoIndex]}/${{sampleID}}\n        \n        # Move into scratch dir\n        cd {config[path][scratch]}/{config[folder][kallistoIndex]}/${{sampleID}}\n\n        # Copy files\n        echo -e \"\\nCopying and unzipping sample $sampleID assembly ... \"\n        cp {input} .\n\n        # Rename files\n        mv $(basename {input}) $(echo $sampleID|sed 's/$/.fa.gz/g')\n        gunzip $(echo $sampleID|sed 's/$/.fa.gz/g')\n\n        echo -e \"\\nCutting up assembly contigs >= 20kbp into 10kbp chunks ... \"\n        cut_up_fasta.py $sampleID.fa -c 10000 -o 0 --merge_last > contigs_10K.fa\n\n        echo -e \"\\nCreating kallisto index ... \"\n        kallisto index contigs_10K.fa -i index.kaix\n\n        mv index.kaix $(dirname {output})\n        \"\"\"\n\n\nrule crossMapParallel:  \n    input:\n        index = f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"kallistoIndex\"]}/{{focal}}/index.kaix',\n        R1 = f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"qfiltered\"]}/{{IDs}}/{{IDs}}_R1.fastq.gz',\n        R2 = f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"qfiltered\"]}/{{IDs}}/{{IDs}}_R2.fastq.gz'\n    output:\n        directory(f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"kallisto\"]}/{{focal}}/{{IDs}}')\n    benchmark:\n        f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"benchmarks\"]}/{{focal}}.{{IDs}}.crossMapParallel.benchmark.txt'\n    message:\n        \"\"\"\n        This rule is an alternative implementation of crossMapSeries, using kallisto \n        instead of bwa for mapping operations. This implementation is recommended for\n        large datasets.\n        \"\"\"\n    shell:\n        \"\"\"\n        # Activate metagem environment\n        set +u;source activate {config[envs][metagem]};set -u;\n\n        # Create output folder\n        mkdir -p {output}\n\n        # Make job specific scratch dir\n        focal=$(echo $(basename $(dirname {input.index})))\n        mapping=$(echo $(basename $(dirname {input.R1})))\n        echo -e \"\\nCreating temporary directory {config[path][scratch]}/{config[folder][kallisto]}/${{focal}}_${{mapping}} ... \"\n        mkdir -p {config[path][scratch]}/{config[folder][kallisto]}/${{focal}}_${{mapping}}\n\n        # Move into tmp dir\n        cd {config[path][scratch]}/{config[folder][kallisto]}/${{focal}}_${{mapping}}\n\n        # Copy files\n        echo -e \"\\nCopying assembly index {input.index} and reads {input.R1} {input.R2} to $(pwd) ... \"\n        cp {input.index} {input.R1} {input.R2} .\n\n        # Run kallisto\n        echo -e \"\\nRunning kallisto ... \"\n        kallisto quant --threads {config[cores][crossMap]} --plaintext -i index.kaix -o . $(basename {input.R1}) $(basename {input.R2})\n        \n        # Zip file\n        echo -e \"\\nZipping abundance file ... \"\n        gzip abundance.tsv\n\n        # Move mapping file out output folder\n        mv abundance.tsv.gz {output}\n\n        # Cleanup temp folder\n        echo -e \"\\nRemoving temporary directory {config[path][scratch]}/{config[folder][kallisto]}/${{focal}}_${{mapping}} ... \"\n        cd -\n        rm -r {config[path][scratch]}/{config[folder][kallisto]}/${{focal}}_${{mapping}}\n        \"\"\"\n\nrule gatherCrossMapParallel: \n    input:\n        expand(f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"kallisto\"]}/{{focal}}/{{IDs}}', focal = focal , IDs = IDs)\n    shell:\n        \"\"\"\n        echo \"Gathering cross map jobs ...\" \n        \"\"\"\n\nrule concoct:\n    input:\n        table = f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"concoct\"]}/{{IDs}}/cov/coverage_table.tsv',\n        contigs = rules.megahit.output\n    output:\n        directory(f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"concoct\"]}/{{IDs}}/{{IDs}}.concoct-bins')\n    benchmark:\n        f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"benchmarks\"]}/{{IDs}}.concoct.benchmark.txt'\n    shell:\n        \"\"\"\n        # Activate metagem environment\n        set +u;source activate {config[envs][metagem]};set -u;\n\n        # Create output folder\n        mkdir -p $(dirname {output})\n\n        # Make job specific scratch dir\n        sampleID=$(echo $(basename $(dirname {input.contigs})))\n        echo -e \"\\nCreating temporary directory {config[path][scratch]}/{config[folder][concoct]}/${{sampleID}} ... \"\n        mkdir -p {config[path][scratch]}/{config[folder][concoct]}/${{sampleID}}\n\n        # Move into scratch dir\n        cd {config[path][scratch]}/{config[folder][concoct]}/${{sampleID}}\n\n        # Copy files\n        cp {input.contigs} {input.table} .\n\n        echo \"Unzipping assembly ... \"\n        gunzip $(basename {input.contigs})\n\n        echo -e \"Done. \\nCutting up contigs (default 10kbp chunks) ... \"\n        cut_up_fasta.py -c {config[params][cutfasta]} -o 0 -m $(echo $(basename {input.contigs})|sed 's/.gz//') > assembly_c10k.fa\n        \n        echo -e \"\\nRunning CONCOCT ... \"\n        concoct --coverage_file $(basename {input.table}) \\\n            --composition_file assembly_c10k.fa \\\n            -b $(basename $(dirname {output})) \\\n            -t {config[cores][concoct]} \\\n            -c {config[params][concoct]}\n            \n        echo -e \"\\nMerging clustering results into original contigs ... \"\n        merge_cutup_clustering.py $(basename $(dirname {output}))_clustering_gt1000.csv > $(basename $(dirname {output}))_clustering_merged.csv\n        \n        echo -e \"\\nExtracting bins ... \"\n        mkdir -p $(basename {output})\n        extract_fasta_bins.py $(echo $(basename {input.contigs})|sed 's/.gz//') $(basename $(dirname {output}))_clustering_merged.csv --output_path $(basename {output})\n        \n        # Move final result files to output folder\n        mv $(basename {output}) *.txt *.csv $(dirname {output})\n        \"\"\"\n\nrule metabatCross:\n    input:\n        assembly = rules.megahit.output,\n        depth = f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"metabat\"]}/{{IDs}}/cov'\n    output:\n        directory(f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"metabat\"]}/{{IDs}}/{{IDs}}.metabat-bins')\n    benchmark:\n        f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"benchmarks\"]}/{{IDs}}.metabat.benchmark.txt'\n    shell:\n        \"\"\"\n        # Activate metagem environment\n        set +u;source activate {config[envs][metagem]};set -u;\n\n        # Create output folder\n        mkdir -p {output}\n\n        # Make job specific scratch dir\n        fsampleID=$(echo $(basename $(dirname {input.assembly})))\n        echo -e \"\\nCreating temporary directory {config[path][scratch]}/{config[folder][metabat]}/${{fsampleID}} ... \"\n        mkdir -p {config[path][scratch]}/{config[folder][metabat]}/${{fsampleID}}\n\n        # Move into scratch dir\n        cd {config[path][scratch]}/{config[folder][metabat]}/${{fsampleID}}\n\n        # Copy files to tmp\n        cp {input.assembly} {input.depth}/*.all.depth .\n\n        # Unzip assembly\n        gunzip $(basename {input.assembly})\n\n        # Run metabat2\n        echo -e \"\\nRunning metabat2 ... \"\n        metabat2 -i contigs.fasta -a *.all.depth -s {config[params][metabatMin]} -v --seed {config[params][seed]} -t 0 -m {config[params][minBin]} -o $(basename $(dirname {output}))\n\n        # Move result files to output dir\n        mv *.fa {output}\n        \"\"\"\n\nrule maxbinCross:\n    input:\n        assembly = rules.megahit.output,\n        depth = f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"maxbin\"]}/{{IDs}}/cov'\n    output:\n        directory(f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"maxbin\"]}/{{IDs}}/{{IDs}}.maxbin-bins')\n    benchmark:\n        f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"benchmarks\"]}/{{IDs}}.maxbin.benchmark.txt'\n    shell:\n        \"\"\"\n        # Activate metagem environment\n        set +u;source activate {config[envs][metagem]};set -u;\n\n        # Create output folder\n        mkdir -p $(dirname {output})\n\n        # Make job specific scratch dir\n        fsampleID=$(echo $(basename $(dirname {input.assembly})))\n        echo -e \"\\nCreating temporary directory {config[path][scratch]}/{config[folder][maxbin]}/${{fsampleID}} ... \"\n        mkdir -p {config[path][scratch]}/{config[folder][maxbin]}/${{fsampleID}}\n\n        # Move into scratch dir\n        cd {config[path][scratch]}/{config[folder][maxbin]}/${{fsampleID}}\n\n        # Copy files to tmp\n        cp -r {input.assembly} {input.depth}/*.depth .\n\n        echo -e \"\\nUnzipping assembly ... \"\n        gunzip $(basename {input.assembly})\n\n        echo -e \"\\nGenerating list of depth files based on crossMapSeries rule output ... \"\n        find . -name \"*.depth\" > abund.list\n        \n        echo -e \"\\nRunning maxbin2 ... \"\n        run_MaxBin.pl -thread {config[cores][maxbin]} -contig contigs.fasta -out $(basename $(dirname {output})) -abund_list abund.list\n        \n        # Clean up un-needed files\n        rm *.depth abund.list contigs.fasta\n\n        # Move files into output dir\n        mkdir -p $(basename {output})\n        while read bin;do mv $bin $(basename {output});done< <(ls|grep fasta)\n        mv * $(dirname {output})\n        \"\"\"\n\nrule binning:\n    input:\n        concoct = expand(config[\"path\"][\"root\"]+\"/\"+config[\"folder\"][\"concoct\"]+\"/{IDs}/{IDs}.concoct-bins\", IDs = IDs),\n        maxbin = expand(config[\"path\"][\"root\"]+\"/\"+config[\"folder\"][\"maxbin\"]+\"/{IDs}/{IDs}.maxbin-bins\", IDs = IDs),\n        metabat = expand(config[\"path\"][\"root\"]+\"/\"+config[\"folder\"][\"metabat\"]+\"/{IDs}/{IDs}.metabat-bins\", IDs = IDs)\n\n\nrule binRefine:\n    input:\n        concoct = f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"concoct\"]}/{{IDs}}/{{IDs}}.concoct-bins',\n        metabat = f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"metabat\"]}/{{IDs}}/{{IDs}}.metabat-bins',\n        maxbin = f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"maxbin\"]}/{{IDs}}/{{IDs}}.maxbin-bins'\n    output:\n        directory(f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"refined\"]}/{{IDs}}')\n    benchmark:\n        f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"benchmarks\"]}/{{IDs}}.binRefine.benchmark.txt'\n    shell:\n        \"\"\"\n        # Activate metawrap environment\n        set +u;source activate {config[envs][metawrap]};set -u;\n\n        # Create output folder\n        mkdir -p {output}\n\n        # Make job specific scratch dir\n        fsampleID=$(echo $(basename $(dirname {input.concoct})))\n        echo -e \"\\nCreating temporary directory {config[path][scratch]}/{config[folder][refined]}/${{fsampleID}} ... \"\n        mkdir -p {config[path][scratch]}/{config[folder][refined]}/${{fsampleID}}\n\n        # Move into scratch dir\n        cd {config[path][scratch]}/{config[folder][refined]}/${{fsampleID}}\n\n        # Copy files to tmp\n        echo \"Copying bins from CONCOCT, metabat2, and maxbin2 to {config[path][scratch]} ... \"\n        cp -r {input.concoct} {input.metabat} {input.maxbin} .\n\n        echo \"Renaming bin folders to avoid errors with metaWRAP ... \"\n        mv $(basename {input.concoct}) $(echo $(basename {input.concoct})|sed 's/-bins//g')\n        mv $(basename {input.metabat}) $(echo $(basename {input.metabat})|sed 's/-bins//g')\n        mv $(basename {input.maxbin}) $(echo $(basename {input.maxbin})|sed 's/-bins//g')\n        \n        echo \"Running metaWRAP bin refinement module ... \"\n        metaWRAP bin_refinement -o . \\\n            -A $(echo $(basename {input.concoct})|sed 's/-bins//g') \\\n            -B $(echo $(basename {input.metabat})|sed 's/-bins//g') \\\n            -C $(echo $(basename {input.maxbin})|sed 's/-bins//g') \\\n            -t {config[cores][refine]} \\\n            -m {config[params][refineMem]} \\\n            -c {config[params][refineComp]} \\\n            -x {config[params][refineCont]}\n \n        rm -r $(echo $(basename {input.concoct})|sed 's/-bins//g') $(echo $(basename {input.metabat})|sed 's/-bins//g') $(echo $(basename {input.maxbin})|sed 's/-bins//g') work_files\n        mv * {output}\n        \"\"\"\n\n\nrule binReassemble:\n    input:\n        R1 = rules.qfilter.output.R1, \n        R2 = rules.qfilter.output.R2,\n        refinedBins = rules.binRefine.output\n    output:\n        directory(f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"reassembled\"]}/{{IDs}}')\n    benchmark:\n        f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"benchmarks\"]}/{{IDs}}.binReassemble.benchmark.txt'\n    shell:\n        \"\"\"\n        # Activate metawrap environment\n        set +u;source activate {config[envs][metawrap]};set -u;\n\n        # Prevents spades from using just one thread\n        export OMP_NUM_THREADS={config[cores][reassemble]}\n\n        # Create output folder\n        mkdir -p {output}\n\n        # Make job specific scratch dir\n        fsampleID=$(echo $(basename $(dirname {input.R1})))\n        echo -e \"\\nCreating temporary directory {config[path][scratch]}/{config[folder][reassembled]}/${{fsampleID}} ... \"\n        mkdir -p {config[path][scratch]}/{config[folder][reassembled]}/${{fsampleID}}\n\n        # Move into scratch dir\n        cd {config[path][scratch]}/{config[folder][reassembled]}/${{fsampleID}}\n\n        # Copy files to tmp   \n        cp -r {input.refinedBins}/metawrap_*_bins {input.R1} {input.R2} .\n        \n        echo \"Running metaWRAP bin reassembly ... \"\n        metaWRAP reassemble_bins --parallel -o $(basename {output}) \\\n            -b metawrap_*_bins \\\n            -1 $(basename {input.R1}) \\\n            -2 $(basename {input.R2}) \\\n            -t {config[cores][reassemble]} \\\n            -m {config[params][reassembleMem]} \\\n            -c {config[params][reassembleComp]} \\\n            -x {config[params][reassembleCont]}\n        \n        # Cleaning up files\n        rm -r metawrap_*_bins\n        rm -r $(basename {output})/work_files\n        rm *.fastq.gz \n\n        # Move results to output folder\n        mv * $(dirname {output})\n        \"\"\"\n\nrule binEvaluation:\n    input: \n        refined = expand(config[\"path\"][\"root\"]+\"/\"+config[\"folder\"][\"refined\"]+\"/{IDs}\", IDs = IDs),\n        reassembled = expand(config[\"path\"][\"root\"]+\"/\"+config[\"folder\"][\"reassembled\"]+\"/{IDs}\", IDs = IDs)\n\n\nrule binningVis:\n    input: \n        f'{config[\"path\"][\"root\"]}'\n    output: \n        text = f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"stats\"]}/reassembled_bins.stats',\n        plot = f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"stats\"]}/binningVis.pdf'\n    message:\n        \"\"\"\n        Generate bar plot with number of bins and density plot of bin contigs, \n        total length, completeness, and contamination across different tools.\n        \"\"\"\n    shell:\n        \"\"\"\n        # Activate metagem env\n        set +u;source activate {config[envs][metagem]};set -u;\n        \n        # Read CONCOCT bins\n        echo \"Generating concoct_bins.stats file containing bin ID, number of contigs, and length ... \"\n        cd {input}/{config[folder][concoct]}\n        for folder in */;do \n\n            # Define sample name\n            var=$(echo $folder|sed 's|/||g');\n            for bin in $folder*concoct-bins/*.fa;do \n\n                # Define bin name\n                name=$(echo $bin | sed \"s|^.*/|$var.bin.|g\" | sed 's/.fa//g');\n\n                # Count contigs\n                N=$(less $bin | grep -c \">\");\n\n                # Sum length\n                L=$(less $bin |grep \">\"|cut -d '-' -f4|sed 's/len=//g'|awk '{{sum+=$1}}END{{print sum}}')\n\n                # Print values to terminal and write to stats file\n                echo \"Reading bin $bin ... Contigs: $N , Length: $L \"\n                echo $name $N $L >> concoct_bins.stats;\n            done;\n        done\n        mv *.stats {input}/{config[folder][reassembled]}\n\n        # Read MetaBAT2 bins\n        echo \"Generating metabat_bins.stats file containing bin ID, number of contigs, and length ... \"\n        cd {input}/{config[folder][metabat]}\n        for folder in */;do \n\n            # Define sample name\n            var=$(echo $folder | sed 's|/||');\n            for bin in $folder*metabat-bins/*.fa;do \n\n                # Define bin name\n                name=$(echo $bin|sed 's/.fa//g'|sed 's|^.*/||g'|sed \"s/^/$var./g\"); \n\n                # Count contigs\n                N=$(less $bin | grep -c \">\");\n\n                # Sum length\n                L=$(less $bin |grep \">\"|cut -d '-' -f4|sed 's/len=//g'|awk '{{sum+=$1}}END{{print sum}}')\n\n                # Print values to terminal and write to stats file\n                echo \"Reading bin $bin ... Contigs: $N , Length: $L \"\n                echo $name $N $L >> metabat_bins.stats;\n            done;\n        done\n        mv *.stats {input}/{config[folder][reassembled]}\n\n        # Read MaxBin2 bins\n        echo \"Generating maxbin_bins.stats file containing bin ID, number of contigs, and length ... \"\n        cd {input}/{config[folder][maxbin]}\n        for folder in */;do\n            for bin in $folder*maxbin-bins/*.fasta;do \n\n                # Define bin name\n                name=$(echo $bin | sed 's/.fasta//g' | sed 's|^.*/||g');\n\n                # Count contigs\n                N=$(less $bin | grep -c \">\");\n\n                # Sum length\n                L=$(less $bin |grep \">\"|cut -d '-' -f4|sed 's/len=//g'|awk '{{sum+=$1}}END{{print sum}}')\n\n                # Print values to terminal and write to stats file\n                echo \"Reading bin $bin ... Contigs: $N , Length: $L \"\n                echo $name $N $L >> maxbin_bins.stats;\n            done;\n        done\n        mv *.stats {input}/{config[folder][reassembled]}\n\n        # Read metaWRAP refined bins\n        echo \"Generating refined_bins.stats file containing bin ID, number of contigs, and length ... \"\n        cd {input}/{config[folder][refined]}\n        for folder in */;do \n\n            # Define sample name \n            samp=$(echo $folder | sed 's|/||');\n            for bin in $folder*metawrap_*_bins/*.fa;do \n\n                # Define bin name\n                name=$(echo $bin | sed 's/.fa//g'|sed 's|^.*/||g'|sed \"s/^/$samp./g\");\n\n                # Count contigs\n                N=$(less $bin | grep -c \">\");\n\n                # Sum length\n                L=$(less $bin |grep \">\"|cut -d '-' -f4|sed 's/len_//g'|awk '{{sum+=$1}}END{{print sum}}')\n\n                # Print values to terminal and write to stats file\n                echo \"Reading bin $bin ... Contigs: $N , Length: $L \"\n                echo $name $N $L >> refined_bins.stats;\n            done;\n        done\n\n        # Compile CONCOCT, MetaBAT2, MaxBin2, and metaWRAP checkM files\n        echo \"Generating CheckM summary files across samples: concoct.checkm, metabat.checkm, maxbin.checkm, and refined.checkm ... \"\n        for folder in */;do \n\n            # Define sample name\n            var=$(echo $folder|sed 's|/||g');\n\n            # Write values to checkm files\n            paste $folder*concoct.stats|tail -n +2 | sed \"s/^/$var.bin./g\" >> concoct.checkm\n            paste $folder*metabat.stats|tail -n +2 | sed \"s/^/$var./g\" >> metabat.checkm\n            paste $folder*maxbin.stats|tail -n +2 >> maxbin.checkm\n            paste $folder*metawrap_*_bins.stats|tail -n +2|sed \"s/^/$var./g\" >> refined.checkm\n        done \n        mv *.stats *.checkm {input}/{config[folder][reassembled]}\n\n        # Read metaWRAP reassembled bins\n        echo \"Generating reassembled_bins.stats file containing bin ID, number of contigs, and length ... \"\n        cd {input}/{config[folder][reassembled]}\n        for folder in */;do \n\n            # Define sample name \n            samp=$(echo $folder | sed 's|/||');\n            for bin in $folder*reassembled_bins/*.fa;do \n\n                # Define bin name\n                name=$(echo $bin | sed 's/.fa//g' | sed 's|^.*/||g' | sed \"s/^/$samp./g\");\n                N=$(less $bin | grep -c \">\");\n\n                # Check if bins are original (megahit-assembled) or strict/permissive (metaspades-assembled)\n                if [[ $name == *.strict ]] || [[ $name == *.permissive ]];then\n                    L=$(less $bin |grep \">\"|cut -d '_' -f4|awk '{{sum+=$1}}END{{print sum}}')\n                else\n                    L=$(less $bin |grep \">\"|cut -d '-' -f4|sed 's/len_//g'|awk '{{sum+=$1}}END{{print sum}}')\n                fi\n\n                # Print values to terminal and write to stats file\n                echo \"Reading bin $bin ... Contigs: $N , Length: $L \"\n                echo $name $N $L >> reassembled_bins.stats;\n            done;\n        done\n        echo \"Done reading metawrap reassembled bins ... \"\n\n        # Read metaWRAP reassembled checkM file\n        echo \"Generating CheckM summary file reassembled.checkm across samples for reassembled bins ... \"\n        for folder in */;do \n\n            # Define sample name\n            var=$(echo $folder|sed 's|/||g');\n\n            # Write values to checkM file\n            paste $folder*reassembled_bins.stats|tail -n +2|sed \"s/^/$var./g\";\n        done >> reassembled.checkm\n        echo \"Done generating all statistics files for binning results ... running plotting script ... \"\n\n        # Move files and cd to stats folder\n        mv *.stats *.checkm {config[path][root]}/{config[folder][stats]}\n        cd {config[path][root]}/{config[folder][stats]}\n\n        # Run Rscript\n        Rscript {config[path][root]}/{config[folder][scripts]}/{config[scripts][binningVis]}\n\n        # Delete redundant pdf file\n        rm Rplots.pdf\n        \"\"\"\n\nrule abundance:\n    input:\n        bins = f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"reassembled\"]}/{{IDs}}/reassembled_bins',\n        R1 = rules.qfilter.output.R1, \n        R2 = rules.qfilter.output.R2\n    output:\n        directory(f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"abundance\"]}/{{IDs}}')\n    benchmark:\n        f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"benchmarks\"]}/{{IDs}}.abundance.benchmark.txt'\n    message:\n        \"\"\"\n        Calculate bin abundance fraction using the following:\n\n        binAbundanceFraction = ( X / Y / Z) * 1000000\n\n        X = # of reads mapped to bin_i from sample_k\n        Y = length of bin_i (bp)\n        Z = # of reads mapped to all bins in sample_k\n\n        Note: 1000000 scaling factor converts length in bp to Mbp\n\n        \"\"\"\n    shell:\n        \"\"\"     \n        # Activate metagem environment\n        set +u;source activate {config[envs][metagem]};set -u;\n\n        # Make sure output folder exists\n        mkdir -p {output}\n\n        # Make job specific scratch dir\n        sampleID=$(echo $(basename $(dirname {input.R1})))\n        echo -e \"\\nCreating temporary directory {config[path][scratch]}/{config[folder][abundance]}/${{sampleID}} ... \"\n        mkdir -p {config[path][scratch]}/{config[folder][abundance]}/${{sampleID}}\n\n        # Move into scratch dir\n        cd {config[path][scratch]}/{config[folder][abundance]}/${{sampleID}}\n\n        # Copy files\n        echo -e \"\\nCopying quality filtered paired end reads and generated MAGs to {config[path][scratch]} ... \"\n        cp {input.R1} {input.R2} {input.bins}/* .\n\n        echo -e \"\\nConcatenating all bins into one FASTA file ... \"\n        cat *.fa > $(basename {output}).fa\n\n        echo -e \"\\nCreating bwa index for concatenated FASTA file ... \"\n        bwa index $(basename {output}).fa\n\n        echo -e \"\\nMapping quality filtered paired end reads to concatenated FASTA file with bwa mem ... \"\n        bwa mem -t {config[cores][abundance]} $(basename {output}).fa \\\n            $(basename {input.R1}) $(basename {input.R2}) > $(basename {output}).sam\n\n        echo -e \"\\nConverting SAM to BAM with samtools view ... \"\n        samtools view -@ {config[cores][abundance]} -Sb $(basename {output}).sam > $(basename {output}).bam\n\n        echo -e \"\\nSorting BAM file with samtools sort ... \"\n        samtools sort -@ {config[cores][abundance]} -o $(basename {output}).sort.bam $(basename {output}).bam\n\n        echo -e \"\\nExtracting stats from sorted BAM file with samtools flagstat ... \"\n        samtools flagstat $(basename {output}).sort.bam > map.stats\n\n        echo -e \"\\nCopying sample_map.stats file to root/abundance/sample for bin concatenation and deleting temporary FASTA file ... \"\n        cp map.stats {output}/$(basename {output})_map.stats\n        rm $(basename {output}).fa\n        \n        echo -e \"\\nRepeat procedure for each bin ... \"\n        for bin in *.fa;do\n\n            echo -e \"\\nSetting up temporary sub-directory to map against bin $bin ... \"\n            mkdir -p $(echo \"$bin\"| sed \"s/.fa//\")\n\n            # Move bin into subirectory\n            mv $bin $(echo \"$bin\"| sed \"s/.fa//\")\n            cd $(echo \"$bin\"| sed \"s/.fa//\")\n\n            echo -e \"\\nCreating bwa index for bin $bin ... \"\n            bwa index $bin\n\n            echo -e \"\\nMapping quality filtered paired end reads to bin $bin with bwa mem ... \"\n            bwa mem -t {config[cores][abundance]} $bin \\\n                ../$(basename {input.R1}) ../$(basename {input.R2}) > $(echo \"$bin\"|sed \"s/.fa/.sam/\")\n\n            echo -e \"\\nConverting SAM to BAM with samtools view ... \"\n            samtools view -@ {config[cores][abundance]} -Sb $(echo \"$bin\"|sed \"s/.fa/.sam/\") > $(echo \"$bin\"|sed \"s/.fa/.bam/\")\n\n            echo -e \"\\nSorting BAM file with samtools sort ... \"\n            samtools sort -@ {config[cores][abundance]} -o $(echo \"$bin\"|sed \"s/.fa/.sort.bam/\") $(echo \"$bin\"|sed \"s/.fa/.bam/\")\n\n            echo -e \"\\nExtracting stats from sorted BAM file with samtools flagstat ... \"\n            samtools flagstat $(echo \"$bin\"|sed \"s/.fa/.sort.bam/\") > $(echo \"$bin\"|sed \"s/.fa/.map/\")\n\n            echo -e \"\\nAppending bin length to bin.map stats file ... \"\n            echo -n \"Bin Length = \" >> $(echo \"$bin\"|sed \"s/.fa/.map/\")\n\n            # Check if bins are original (megahit-assembled) or strict/permissive (metaspades-assembled)\n            if [[ $bin == *.strict.fa ]] || [[ $bin == *.permissive.fa ]] || [[ $bin == *.s.fa ]] || [[ $bin == *.p.fa ]];then\n                less $bin |grep \">\"|cut -d '_' -f4|awk '{{sum+=$1}}END{{print sum}}' >> $(echo \"$bin\"|sed \"s/.fa/.map/\")\n            else\n                less $bin |grep \">\"|cut -d '-' -f4|sed 's/len_//g'|awk '{{sum+=$1}}END{{print sum}}' >> $(echo \"$bin\"|sed \"s/.fa/.map/\")\n            fi\n\n            paste $(echo \"$bin\"|sed \"s/.fa/.map/\")\n\n            echo -e \"\\nCalculating abundance for bin $bin ... \"\n            echo -n \"$bin\"|sed \"s/.fa//\" >> $(echo \"$bin\"|sed \"s/.fa/.abund/\")\n            echo -n $'\\t' >> $(echo \"$bin\"|sed \"s/.fa/.abund/\")\n\n            X=$(less $(echo \"$bin\"|sed \"s/.fa/.map/\")|grep \"mapped (\"|awk -F' ' '{{print $1}}')\n            Y=$(less $(echo \"$bin\"|sed \"s/.fa/.map/\")|tail -n 1|awk -F' ' '{{print $4}}')\n            Z=$(less \"../map.stats\"|grep \"mapped (\"|awk -F' ' '{{print $1}}')\n            awk -v x=\"$X\" -v y=\"$Y\" -v z=\"$Z\" 'BEGIN{{print (x/y/z) * 1000000}}' >> $(echo \"$bin\"|sed \"s/.fa/.abund/\")\n            \n            paste $(echo \"$bin\"|sed \"s/.fa/.abund/\")\n            \n            echo -e \"\\nRemoving temporary files for bin $bin ... \"\n            rm $bin\n            cp $(echo \"$bin\"|sed \"s/.fa/.map/\") {output}\n            mv $(echo \"$bin\"|sed \"s/.fa/.abund/\") ../\n            cd ..\n            rm -r $(echo \"$bin\"| sed \"s/.fa//\")\n        done\n\n        echo -e \"\\nDone processing all bins, summarizing results into sample.abund file ... \"\n        cat *.abund > $(basename {output}).abund\n\n        echo -ne \"\\nSumming calculated abundances to obtain normalization value ... \"\n        norm=$(less $(basename {output}).abund |awk '{{sum+=$2}}END{{print sum}}');\n        echo $norm\n\n        echo -e \"\\nGenerating column with abundances normalized between 0 and 1 ... \"\n        awk -v NORM=\"$norm\" '{{printf $1\"\\t\"$2\"\\t\"$2/NORM\"\\\\n\"}}' $(basename {output}).abund > abundance.txt\n\n        rm $(basename {output}).abund\n        mv abundance.txt $(basename {output}).abund\n\n        mv $(basename {output}).abund {output}\n        \"\"\"\n\nrule GTDBTk:\n    input:\n        f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"reassembled\"]}/{{IDs}}/reassembled_bins'\n    output:\n        directory(f'{config[\"path\"][\"root\"]}/GTDBTk/{{IDs}}')\n    benchmark:\n        f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"benchmarks\"]}/{{IDs}}.GTDBTk.benchmark.txt'\n    message:\n        \"\"\"\n        Please make sure that the GTDB-Tk database was downloaded and configured.\n        \"\"\"\n    shell:\n        \"\"\"\n        # Activate metagem environment\n        set +u;source activate {config[envs][metagem]};set -u;\n\n        # Make sure output folder exists\n        mkdir -p {output}\n\n        # Make job specific scratch dir\n        sampleID=$(echo $(basename $(dirname {input})))\n        echo -e \"\\nCreating temporary directory {config[path][scratch]}/{config[folder][classification]}/${{sampleID}} ... \"\n        mkdir -p {config[path][scratch]}/{config[folder][classification]}/${{sampleID}}\n\n        # Move into scratch dir\n        cd {config[path][scratch]}/{config[folder][classification]}/${{sampleID}}\n\n        # Copy files\n        echo -e \"\\nCopying files to tmp dir ... \"\n        cp -r {input} .\n        \n        # In case you GTDBTk is not properly configured you may need to export the GTDBTK_DATA_PATH variable,\n        # Simply uncomment the following line and fill in the path to your GTDBTk database:\n        # export GTDBTK_DATA_PATH=/path/to/the/gtdbtk/database/you/downloaded\n\n        # Run GTDBTk\n        gtdbtk classify_wf --genome_dir $(basename {input}) --out_dir GTDBTk -x fa --cpus {config[cores][gtdbtk]}\n\n        mv GTDBTk/* {output}\n        \"\"\"\n\nrule compositionVis:\n    input:\n        taxonomy = f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"classification\"]}' ,\n        abundance = f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"abundance\"]}'\n    output:\n        #file = f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"stats\"]}/composition.tsv',\n        plot = f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"stats\"]}/compositionVis.pdf'\n    message:\n        \"\"\"\n        Summarize and visualize abundance + taxonomy of MAGs across samples.\n        Note: compositionVis should only be run after the gtdbtk and abundance rules.\n        \"\"\"\n    shell:\n        \"\"\"\n        set +u;source activate {config[envs][metagem]};set -u\n\n        # Generate summary abundance file\n\n        cd {input.abundance}\n        for folder in */;do\n            # Define sample ID\n            sample=$(echo $folder|sed 's|/||g')\n            # Same as in taxonomyVis rule, modify bin names by adding sample ID and shortening metaWRAP naming scheme (orig/permissive/strict)\n            paste $sample/$sample.abund | sed 's/orig/o/g' | sed 's/permissive/p/g' | sed 's/strict/s/g' | sed \"s/^/$sample./g\" >> abundance.stats\n        done\n        mv abundance.stats {config[path][root]}/{config[folder][stats]}\n\n        # Generate summary taxonomy file\n\n        cd {input.taxonomy}\n        # Summarize GTDBTk output across samples\n        for folder in */;do \n            samp=$(echo $folder|sed 's|^.*/||');\n            cat $folder/classify/*summary.tsv| sed 's/orig/o/g' | sed 's/permissive/p/g' | sed 's/strict/s/g' | sed \"s/^/$samp./g\";\n        done > GTDBTk.stats\n        # Clean up stats file\n        header=$(head -n 1 GTDBTk.stats | sed 's/^.*\\.//g')\n        sed -i '/other_related_references(genome_id,species_name,radius,ANI,AF)/d' GTDBTk.stats\n        sed -i \"1i$header\" GTDBTk.stats\n        mv GTDBTk.stats {config[path][root]}/{config[folder][stats]}\n\n        cd {config[path][root]}/{config[folder][stats]}\n        Rscript {config[path][root]}/{config[folder][scripts]}/{config[scripts][compositionVis]}\n        \"\"\"\n\nrule extractProteinBins:\n    message:\n        \"\"\"\n        Extract ORF annotated protein fasta files for each bin from reassembly checkm files,\n        place into sample specific subdirectories within protein_bins folder. \n        \"\"\"\n    shell:\n        \"\"\"\n        # Move to root directory\n        cd {config[path][root]}\n\n        # Make sure protein bins folder exists\n        mkdir -p {config[folder][proteinBins]}\n\n        echo -e \"Begin moving and renaming ORF annotated protein fasta bins from reassembled_bins/ to protein_bins/ ... \\n\"\n        for folder in reassembled_bins/*/;do \n            #Loop through each sample\n            echo \"Copying bins from sample $(echo $(basename $folder)) ... \"\n            for bin in $folder*reassembled_bins.checkm/bins/*;do \n                # Loop through each bin\n                var=$(echo $bin/genes.faa | sed 's|reassembled_bins/||g'|sed 's|/reassembled_bins.checkm/bins||'|sed 's|/genes||g'|sed 's|/|_|g'|sed 's/permissive/p/g'|sed 's/orig/o/g'|sed 's/strict/s/g');\n                cp $bin/*.faa {config[path][root]}/{config[folder][proteinBins]}/$var;\n            done;\n        done\n        \"\"\"\n\n\nrule carveme:\n    input:\n        bin = f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"proteinBins\"]}/{{binIDs}}.faa',\n        media = f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"scripts\"]}/{config[\"scripts\"][\"carveme\"]}'\n    output:\n        f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"GEMs\"]}/{{binIDs}}.xml'\n    benchmark:\n        f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"benchmarks\"]}/{{binIDs}}.carveme.benchmark.txt'\n    message:\n        \"\"\"\n        Make sure that the input files are ORF annotated and preferably protein fasta.\n        If given raw fasta files, Carveme will run without errors but each contig will be treated as one gene.\n        \"\"\"\n    shell:\n        \"\"\"\n        # Activate metagem environment\n        set +u;source activate {config[envs][metagem]};set -u;\n\n        # Make sure output folder exists\n        mkdir -p $(dirname {output})\n\n        # Make job specific scratch dir\n        binID=$(echo $(basename {input})|sed 's/.faa//g')\n        echo -e \"\\nCreating temporary directory {config[path][scratch]}/{config[folder][GEMs]}/${{binID}} ... \"\n        mkdir -p {config[path][scratch]}/{config[folder][GEMs]}/${{binID}}\n\n        # Move into tmp dir\n        cd {config[path][scratch]}/{config[folder][GEMs]}/${{binID}}\n\n        # Copy files\n        cp {input.bin} {input.media} .\n        \n        echo \"Begin carving GEM ... \"\n        carve -g {config[params][carveMedia]} \\\n            -v \\\n            --mediadb $(basename {input.media}) \\\n            --fbc2 \\\n            -o $(echo $(basename {input.bin}) | sed 's/.faa/.xml/g') $(basename {input.bin})\n        \n        echo \"Done carving GEM. \"\n        [ -f *.xml ] && mv *.xml $(dirname {output})\n        \"\"\"\n\n\nrule modelVis:\n    input: \n        f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"GEMs\"]}'\n    output: \n        text = f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"stats\"]}/GEMs.stats',\n        plot = f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"stats\"]}/modelVis.pdf'\n    message:\n        \"\"\"\n        Generate bar plot with GEMs generated across samples and density plots showing number of \n        unique metabolites, reactions, and genes across GEMs.\n        \"\"\"\n    shell:\n        \"\"\"\n        set +u;source activate {config[envs][metagem]};set -u;\n        cd {input}\n\n        echo -e \"\\nBegin reading models ... \\n\"\n        while read model;do \n            id=$(echo $(basename $model)|sed 's/.xml//g'); \n            mets=$(less $model| grep \"species id=\"|cut -d ' ' -f 8|sed 's/..$//g'|sort|uniq|wc -l);\n            rxns=$(less $model|grep -c 'reaction id=');\n            genes=$(less $model|grep 'fbc:geneProduct fbc:id='|grep -vic spontaneous);\n            echo \"Model: $id has $mets mets, $rxns reactions, and $genes genes ... \"\n            echo \"$id $mets $rxns $genes\" >> GEMs.stats;\n        done< <(find . -name \"*.xml\")\n\n        echo -e \"\\nDone generating GEMs.stats summary file, moving to stats/ folder and running modelVis.R script ... \"\n        mv GEMs.stats {config[path][root]}/{config[folder][stats]}\n        cd {config[path][root]}/{config[folder][stats]}\n\n        Rscript {config[path][root]}/{config[folder][scripts]}/{config[scripts][modelVis]}\n        rm Rplots.pdf # Delete redundant pdf file\n        echo \"Done. \"\n        \"\"\"\n\nrule ECvis:\n    input: \n        f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"GEMs\"]}'\n    output:\n        directory(f'{config[\"path\"][\"root\"]}/ecfiles')\n    message:\n        \"\"\"\n        Get EC information from GEMs. \n        Switch the input folder and grep|sed expressions to match the ec numbers in you model sets.\n        Currently configured for UHGG GEM set.\n        \"\"\"\n    shell:\n        \"\"\"\n        echo -e \"\\nCopying GEMs from specified input directory to {config[path][scratch]} ... \"\n        cp -r {input} {config[path][scratch]}\n\n        cd {config[path][scratch]}\n        mkdir ecfiles\n\n        while read model; do\n\n            # Read E.C. numbers from  each sbml file and write to a unique file, note that grep expression is hardcoded for specific GEM batches           \n            less $(basename {input})/$model|\n                grep 'EC Number'| \\\n                sed 's/^.*: //g'| \\\n                sed 's/<.*$//g'| \\\n                sed '/-/d'|sed '/N\\/A/d' | \\\n                sort|uniq -c \\\n            > ecfiles/$model.ec\n\n            echo -ne \"Reading E.C. numbers in model $model, unique E.C. : \"\n            ECNUM=$(less ecfiles/$model.ec|wc -l)\n            echo $ECNUM\n\n        done< <(ls $(basename {input}))\n\n        echo -e \"\\nMoving ecfiles folder back to {config[path][root]}\"\n        mv ecfiles {config[path][root]}\n        cd {config[path][root]}\n\n        echo -e \"\\nCreating sorted unique file EC.summary for easy EC inspection ... \"\n        cat ecfiles/*.ec|awk '{{print $NF}}'|sort|uniq -c > EC.summary\n\n        paste EC.summary\n\n        \"\"\"\n\nrule organizeGEMs:\n    input:\n        f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"refined\"]}'\n    message:\n        \"\"\"\n        Organizes GEMs into sample specific subfolders, assumes that the refined_bins folder has sample-specific subfolders. \n        Necessary to run smetana per sample using the IDs wildcard.\n        \"\"\"\n    shell:\n        \"\"\"\n        cd {input}\n        for folder in */;do\n            echo -n \"Creating GEM subfolder for sample $folder ... \"\n            mkdir -p ../{config[folder][GEMs]}/$folder;\n            echo -n \"moving GEMs ... \"\n            mv ../{config[folder][GEMs]}/$(echo $folder|sed 's|/||')_*.xml ../{config[folder][GEMs]}/$folder;\n            echo \"done. \"\n        done\n        \"\"\"\n\n\nrule smetana:\n    input:\n        f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"GEMs\"]}/{{IDs}}'\n    output:\n        f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"SMETANA\"]}/{{IDs}}_detailed.tsv'\n    benchmark:\n        f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"benchmarks\"]}/{{IDs}}.smetana.benchmark.txt'\n    shell:\n        \"\"\"\n        # Activate metagem env\n        set +u;source activate {config[envs][metagem]};set -u\n\n        # Make sure output folder exists\n        mkdir -p $(dirname {output})\n\n        # Make job specific scratch dir\n        sampleID=$(echo $(basename {input}))\n        echo -e \"\\nCreating temporary directory {config[path][scratch]}/{config[folder][SMETANA]}/${{sampleID}} ... \"\n        mkdir -p {config[path][scratch]}/{config[folder][SMETANA]}/${{sampleID}}\n\n        # Move to tmp dir\n        cd {config[path][scratch]}/{config[folder][SMETANA]}/${{sampleID}}\n\n        # Copy media db and GEMs\n        cp {config[path][root]}/{config[folder][scripts]}/{config[scripts][carveme]} {input}/*.xml .\n\n        # Run SMETANA        \n        smetana -o $(basename {input}) --flavor fbc2 \\\n            --mediadb media_db.tsv -m {config[params][smetanaMedia]} \\\n            --detailed \\\n            --solver {config[params][smetanaSolver]} -v *.xml\n        \n        # Copy results to output folder\n        cp *.tsv $(dirname {output})\n        \"\"\"\n\n\nrule interactionVis:\n    input:\n        f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"SMETANA\"]}'\n    shell:\n        \"\"\"\n        cd {input}\n        mv media_db.tsv ../scripts/\n        cat *.tsv|sed '/community/d' > smetana.all\n        less smetana.all |cut -f2|sort|uniq > media.txt\n        ll|grep tsv|awk '{print $NF}'|sed 's/_.*$//g'>samples.txt\n        while read sample;do echo -n \"$sample \";while read media;do var=$(less smetana.all|grep $sample|grep -c $media); echo -n \"$var \" ;done < media.txt; echo \"\";done < samples.txt > sampleMedia.stats\n        \"\"\"\n\n\nrule memote:\n    input:\n        f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"GEMs\"]}/{{gemIDs}}.xml'\n    output:\n        directory(f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"memote\"]}/{{gemIDs}}')\n    benchmark:\n        f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"benchmarks\"]}/{{gemIDs}}.memote.benchmark.txt'\n    shell:\n        \"\"\"\n        # Activate metagem env\n        set +u;source activate {config[envs][metagem]};set -u\n\n        # Make sure output folder exists\n        mkdir -p {output}\n\n        # Make job specific scratch dir\n        gemID=$(echo $(basename {input})|sed 's/.xml//g')\n        echo -e \"\\nCreating temporary directory {config[path][scratch]}/{config[folder][memote]}/${{gemID}} ... \"\n        mkdir -p {config[path][scratch]}/{config[folder][memote]}/${{gemID}}\n\n        # Move to tmp dir\n        cd {config[path][scratch]}/{config[folder][memote]}/${{gemID}}\n\n        # Copy GEM to tmp\n        cp {input} .\n\n        # Uncomment the following line in case errors are raised about missing git module,\n        # also ensure that module name matches that of your cluster\n        # module load git\n\n        # Run memote\n        memote report snapshot --skip test_find_metabolites_produced_with_closed_bounds --skip test_find_metabolites_consumed_with_closed_bounds --skip test_find_metabolites_not_produced_with_open_bounds --skip test_find_metabolites_not_consumed_with_open_bounds --skip test_find_incorrect_thermodynamic_reversibility --filename $(echo $(basename {input})|sed 's/.xml/.html/') *.xml\n        memote run --skip test_find_metabolites_produced_with_closed_bounds --skip test_find_metabolites_consumed_with_closed_bounds --skip test_find_metabolites_not_produced_with_open_bounds --skip test_find_metabolites_not_consumed_with_open_bounds --skip test_find_incorrect_thermodynamic_reversibility *.xml\n\n        # Rename output file with sample ID\n        mv result.json.gz $(echo $(basename {input})|sed 's/.xml/.json.gz/')\n\n        # Move results to output folder\n        mv *.gz *.html {output}\n        \"\"\"\n\n\nrule grid:\n    input:\n        bins = f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"reassembled\"]}/{{IDs}}/reassembled_bins',\n        R1 = rules.qfilter.output.R1, \n        R2 = rules.qfilter.output.R2\n    output:\n        directory(f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"GRiD\"]}/{{IDs}}')\n    benchmark:\n        f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"benchmarks\"]}/{{IDs}}.grid.benchmark.txt'\n    shell:\n        \"\"\"\n        set +u;source activate {config[envs][metagem]};set -u\n\n        cp -r {input.bins} {input.R1} {input.R2} {config[path][scratch]}\n        cd {config[path][scratch]}\n\n        cat *.gz > $(basename $(dirname {input.bins})).fastq.gz\n        rm $(basename {input.R1}) $(basename {input.R2})\n\n        mkdir MAGdb out\n        update_database -d MAGdb -g $(basename {input.bins}) -p MAGdb\n        rm -r $(basename {input.bins})\n\n        grid multiplex -r . -e fastq.gz -d MAGdb -p -c 0.2 -o out -n {config[cores][grid]}\n\n        rm $(basename $(dirname {input.bins})).fastq.gz\n        mkdir {output}\n        mv out/* {output}\n        \"\"\"\n\n\nrule extractDnaBins:\n    message:\n        \"\"\"\n        Extract dna fasta files for each bin from reassembly output, place into sample specific\n        subdirectories within the dna_bins folder\n        \"\"\"\n    shell:\n        \"\"\"\n        # Move into root dir\n        cd {config[path][root]}\n\n        # Make sure dnaBins folder exists\n        mkdir -p {config[folder][dnaBins]}\n\n        # Copy files\n        echo -e \"Begin copying and renaming dna fasta bins from reassembled_bins/ to dna_bins/ ... \\n\"\n        for folder in reassembled_bins/*/;do\n            # Loop through each sample\n            sample=$(echo $(basename $folder));\n            mkdir -p {config[path][root]}/{config[folder][dnaBins]}/$sample\n            echo \"Copying bins from sample $sample ... \"\n            for bin in $folder*reassembled_bins/*;do \n                # Loop through each bin\n                var=$(echo $bin| sed 's|reassembled_bins/||g'|sed 's|/|_|g'|sed 's/permissive/p/g'|sed 's/orig/o/g'|sed 's/strict/s/g');\n                cp $bin {config[path][root]}/{config[folder][dnaBins]}/$sample/$var;\n            done;\n        done\n        \"\"\"\n\n\nrule prokka:\n    input:\n        bins = f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"dnaBins\"]}/{{binIDs}}.fa'\n    output:\n        directory(f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"pangenome\"]}/prokka/unorganized/{{binIDs}}')\n    benchmark:\n        f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"benchmarks\"]}/{{binIDs}}.prokka.benchmark.txt'\n    shell:\n        \"\"\"\n        set +u;source activate {config[envs][prokkaroary]};set -u\n        mkdir -p $(dirname $(dirname {output}))\n        mkdir -p $(dirname {output})\n\n        cp {input} {config[path][scratch]}\n        cd {config[path][scratch]}\n\n        id=$(echo $(basename {input})|sed \"s/.fa//g\")\n        prokka -locustag $id --cpus {config[cores][prokka]} --centre MAG --compliant -outdir prokka/$id -prefix $id $(basename {input})\n\n        mv prokka/$id $(dirname {output})\n        \"\"\"\n\nrule roary:\n    input:\n        f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"pangenome\"]}/prokka/organized/{{speciesIDs}}/'\n    output:\n        directory(f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"pangenome\"]}/roary/{{speciesIDs}}/')\n    benchmark:\n        f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"benchmarks\"]}/{{speciesIDs}}.roary.benchmark.txt'\n    shell:\n        \"\"\"\n        set +u;source activate {config[envs][prokkaroary]};set -u\n        mkdir -p $(dirname {output})\n        cd {config[path][scratch]}\n        cp -r {input} .\n                \n        roary -s -p {config[cores][roary]} -i {config[params][roaryI]} -cd {config[params][roaryCD]} -f yes_al -e -v $(basename {input})/*.gff\n        cd yes_al\n        create_pan_genome_plots.R \n        cd ..\n        mkdir -p {output}\n\n        mv yes_al/* {output}\n        \"\"\"\n\nrule run_prodigal:\n    \"\"\"Use Prodigal for coding genes predictions in contigs.\"\"\"\n    input:\n        f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"assemblies\"]}/{{IDs}}/contigs.fasta.gz'\n    output:\n        gff = f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"prodigal\"]}/{{IDs}}/{{IDs}}_genes.gff',\n        faa = f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"prodigal\"]}/{{IDs}}/{{IDs}}_genes_prot.fa',\n        fna = f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"prodigal\"]}/{{IDs}}/{{IDs}}_genes_nucl.fa',\n        log = f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"prodigal\"]}/{{IDs}}/{{IDs}}_log.out'\n    shell:\n        \"\"\"\n        mkdir -p $(dirname {output.gff})\n        prodigal -i <(gunzip -c {input}) -o {output.gff} -a {output.faa} -d {output.fna} -p meta  &> {output.log}\n        \"\"\"\n\nrule run_blastp:\n    \"\"\"Use Diamond blastp for searching against coding genes predictions from contigs. Uses snakemake wrapper: https://snakemake-wrappers.readthedocs.io/en/stable/wrappers/diamond/blastp.html\"\"\"\n    input:\n        fname_fasta=f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"prodigal\"]}/{{IDs}}/{{IDs}}_genes_prot.fa',\n        fname_db=f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"blastp_db\"]}'\n    output:\n        fname=f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"blastp\"]}/{{IDs}}.xml'\n    threads: 8\n    wrapper:\n        \"https://github.com/snakemake/snakemake-wrappers/raw/0.80.1/bio/diamond/blastp\"\n"
  },
  {
    "path": "workflow/envs/metaGEM_env.yml",
    "content": "name: metagem\nchannels:\n  - conda-forge\n  - bioconda\n  - defaults\ndependencies:\n  - bedtools>=2.29.2\n  - bwa>=0.7.17\n  - concoct>=1.1.0\n  - diamond>=2.0.6\n  - fastp>=0.20.1\n  - gtdbtk>=1.4.0\n  - maxbin2>=2.2.7\n  - megahit>=1.2.9\n  - metabat2>=2.15\n  - r-base>=3.5.1\n  - r-gridextra>=2.2.1\n  - r-tidyverse\n  - r-tidytext\n  - samtools>=1.9\n  - snakemake>=5.10.0,<5.31.1\n"
  },
  {
    "path": "workflow/envs/metaGEM_env_long.yml",
    "content": "name: metagem\nchannels:\n  - conda-forge\n  - bioconda\n  - defaults\ndependencies:\n  - _libgcc_mutex=0.1=conda_forge\n  - _openmp_mutex=4.5=1_gnu\n  - _r-mutex=1.0.1=anacondar_1\n  - aioeasywebdav=2.4.0=py38h32f6830_1001\n  - aiohttp=3.7.3=py38h497a2fe_0\n  - amply=0.1.4=py_0\n  - appdirs=1.4.4=pyh9f0ad1d_0\n  - aragorn=1.2.38=h516909a_3\n  - async-timeout=3.0.1=py_1000\n  - attrs=20.3.0=pyhd3deb0d_0\n  - backports=1.0=py_2\n  - backports.functools_lru_cache=1.6.1=py_0\n  - bamtools=2.5.1=he513fc3_6\n  - barrnap=0.9=2\n  - bcrypt=3.2.0=py38h1e0a361_1\n  - bedtools=2.29.2=hc088bd4_0\n  - binutils_impl_linux-64=2.35.1=h193b22a_1\n  - binutils_linux-64=2.35=hc3fd857_29\n  - biopython=1.78=py38h25fe258_1\n  - blas=1.0=mkl\n  - blast=2.10.1=pl526he19e7b1_3\n  - boost=1.70.0=py38h9de70de_1\n  - boost-cpp=1.70.0=h8e57a91_2\n  - boto3=1.16.50=pyhd8ed1ab_0\n  - botocore=1.19.50=pyhd8ed1ab_0\n  - bowtie2=2.4.2=py38h1c8e9b9_1\n  - brotlipy=0.7.0=py38h8df0ef7_1001\n  - bwa=0.7.17=hed695b0_7\n  - bwidget=1.9.14=0\n  - bzip2=1.0.8=h7f98852_4\n  - c-ares=1.17.1=h36c2ea0_0\n  - ca-certificates=2020.12.8=h06a4308_0\n  - cachetools=4.2.0=pyhd3eb1b0_0\n  - cairo=1.16.0=hcf35c78_1003\n  - capnproto=0.6.1=hfc679d8_1\n  - cd-hit=4.8.1=h8b12597_3\n  - certifi=2020.12.5=py38h578d9bd_1\n  - cffi=1.14.4=py38ha312104_0\n  - chardet=3.0.4=py38h924ce5b_1008\n  - coincbc=2.10.5=hab63836_1\n  - concoct=1.1.0=py38h7be5676_2\n  - conda=4.9.2=py38h578d9bd_0\n  - conda-package-handling=1.7.2=py38h8df0ef7_0\n  - configargparse=1.2.3=pyh9f0ad1d_0\n  - cryptography=3.3.1=py38h2b97feb_0\n  - curl=7.71.1=he644dc0_8\n  - cython=0.29.21=py38h348cfbe_1\n  - datrie=0.8.2=py38h1e0a361_1\n  - dbus=1.13.18=hb2f20db_0\n  - decorator=4.4.2=py_0\n  - dendropy=4.5.1=pyh3252c3a_0\n  - diamond=2.0.6=h56fc30b_0\n  - docutils=0.16=py38h924ce5b_2\n  - dropbox=10.10.0=py38h06a4308_0\n  - entrez-direct=13.9=pl526h375a9b1_0\n  - ete3=3.1.2=pyh9f0ad1d_0\n  - eukcc=0.2=py_0\n  - eukrep=0.6.7=pyh864c0ab_1\n  - expat=2.2.10=he6710b0_2\n  - fastani=1.32=he1c1bb9_0\n  - fastp=0.20.1=h8b12597_0\n  - fasttree=2.1.10=h516909a_4\n  - fftw=3.3.9=h27cfd23_1\n  - filechunkio=1.8=py_2\n  - font-ttf-dejavu-sans-mono=2.37=hab24e00_0\n  - font-ttf-inconsolata=2.001=hab24e00_0\n  - font-ttf-source-code-pro=2.030=hab24e00_0\n  - font-ttf-ubuntu=0.83=hab24e00_0\n  - fontconfig=2.13.1=h86ecdb6_1001\n  - fonts-conda-forge=1=0\n  - fraggenescan=1.31=h516909a_2\n  - freetype=2.10.4=he06d7ca_0\n  - fribidi=1.0.10=h516909a_0\n  - ftputil=4.0.0=py_0\n  - future=0.18.2=py38h578d9bd_2\n  - gcc_impl_linux-64=7.5.0=hd9e1a51_17\n  - gcc_linux-64=7.5.0=he2a3fca_29\n  - gdk-pixbuf=2.42.2=h19a9c64_1\n  - gettext=0.19.8.1=hf34092f_1004\n  - gfortran_impl_linux-64=7.5.0=hfca37b7_17\n  - gfortran_linux-64=7.5.0=ha081f1e_29\n  - ghostscript=9.53.3=h58526e2_2\n  - giflib=5.2.1=h516909a_2\n  - gitdb=4.0.5=py_0\n  - gitpython=3.1.12=pyhd8ed1ab_0\n  - glib=2.66.3=h58526e2_0\n  - google-api-core=1.24.1=pyhd3deb0d_0\n  - google-api-python-client=1.12.8=pyhd3deb0d_0\n  - google-auth=1.24.0=pyhd3deb0d_0\n  - google-auth-httplib2=0.0.4=pyh9f0ad1d_0\n  - google-cloud-core=1.5.0=pyhd3deb0d_0\n  - google-cloud-storage=1.35.0=pyhd3deb0d_0\n  - google-crc32c=1.1.0=py38h8838a9a_0\n  - google-resumable-media=1.2.0=pyhd3deb0d_0\n  - googleapis-common-protos=1.52.0=py38h578d9bd_1\n  - graphite2=1.3.14=h23475e2_0\n  - graphviz=2.42.3=h0511662_0\n  - grpcio=1.34.0=py38hdd6454d_0\n  - gsl=2.6=he838d99_1\n  - gst-plugins-base=1.14.5=h0935bb2_2\n  - gstreamer=1.14.5=h36ae1b5_2\n  - gtdbtk=1.4.0=py_0\n  - gxx_impl_linux-64=7.5.0=h7ea4de1_17\n  - gxx_linux-64=7.5.0=h547f3ba_29\n  - h5py=2.10.0=nompi_py38h513d04c_102\n  - harfbuzz=2.4.0=h9f30f68_3\n  - hdf5=1.10.5=nompi_h7c3c948_1111\n  - hmmer=3.1b2=3\n  - htslib=1.9=h244ad75_9\n  - httplib2=0.18.1=pyh9f0ad1d_0\n  - icu=64.2=he1b5a44_1\n  - idba=1.1.3=1\n  - idna=2.10=pyh9f0ad1d_0\n  - imagemagick=7.0.10_6=pl526ha9fe49d_0\n  - importlib-metadata=3.3.0=py38h578d9bd_2\n  - importlib_metadata=3.3.0=hd8ed1ab_2\n  - infernal=1.1.3=h516909a_0\n  - ipython_genutils=0.2.0=py_1\n  - jbig=2.1=h516909a_2002\n  - jinja2=2.11.2=pyh9f0ad1d_0\n  - jmespath=0.10.0=pyh9f0ad1d_0\n  - joblib=1.0.0=pyhd8ed1ab_0\n  - jpeg=9d=h516909a_0\n  - jsonschema=3.2.0=py38h32f6830_1\n  - jupyter_core=4.7.0=py38h578d9bd_0\n  - kallisto=0.46.2=h4f7b962_1\n  - kernel-headers_linux-64=2.6.32=h77966d4_13\n  - kpal=2.1.1=pyh864c0ab_3\n  - krb5=1.17.2=h926e7f8_0\n  - ld_impl_linux-64=2.35.1=hea4e1c9_1\n  - libarchive=3.3.3=hddc7a2b_1008\n  - libblas=3.9.0=1_h6e990d7_netlib\n  - libcblas=3.9.0=3_h893e4fe_netlib\n  - libclang=9.0.1=default_hde54327_0\n  - libcrc32c=1.1.1=he1b5a44_2\n  - libcurl=7.71.1=hcdd3856_8\n  - libdeflate=1.3=h516909a_0\n  - libedit=3.1.20191231=h46ee950_2\n  - libev=4.33=h516909a_1\n  - libffi=3.2.1=he1b5a44_1007\n  - libgcc=7.2.0=h69d50b8_2\n  - libgcc-devel_linux-64=7.5.0=h42c25f5_17\n  - libgcc-ng=9.3.0=h5dbcf3e_17\n  - libgd=2.2.5=h307a58e_1007\n  - libgenome=1.3.1=hc9558a2_2\n  - libgfortran-ng=7.5.0=hae1eefd_17\n  - libgfortran4=7.5.0=hae1eefd_17\n  - libglib=2.66.3=hbe7bbb4_0\n  - libgomp=9.3.0=h5dbcf3e_17\n  - libiconv=1.16=h516909a_0\n  - libidn11=1.34=h1cef754_0\n  - liblapack=3.9.0=3_h893e4fe_netlib\n  - libllvm9=9.0.1=hf817b99_2\n  - libmems=1.6.0=h78a066a_2\n  - libmuscle=3.7=h470a237_1\n  - libnghttp2=1.41.0=h8cfc5f6_2\n  - libpng=1.6.37=hed695b0_2\n  - libprotobuf=3.14.0=h780b84a_0\n  - librsvg=2.50.2=h3442318_1\n  - libsodium=1.0.18=h516909a_1\n  - libsolv=0.7.16=h8b12597_0\n  - libssh2=1.9.0=hab1572f_5\n  - libstdcxx-devel_linux-64=7.5.0=h4084dd6_17\n  - libstdcxx-ng=9.3.0=h2ae2ef3_17\n  - libtiff=4.1.0=hc3755c2_3\n  - libtool=2.4.6=h58526e2_1007\n  - libuuid=2.32.1=h14c3975_1000\n  - libwebp=1.0.2=h56121f0_5\n  - libxcb=1.14=h7b6447c_0\n  - libxkbcommon=0.10.0=he1b5a44_0\n  - libxml2=2.9.10=hee79883_0\n  - libxslt=1.1.33=h31b3aaa_0\n  - llvm-openmp=8.0.1=hc9558a2_0\n  - lxml=4.6.2=py38hf1fe3a4_0\n  - lz4-c=1.9.2=he1b5a44_3\n  - lzo=2.10=h516909a_1000\n  - mafft=7.475=h516909a_0\n  - make=4.3=hd18ef5c_1\n  - mamba=0.7.6=py38h4c9354d_0\n  - markupsafe=1.1.1=py38h8df0ef7_2\n  - mash=2.2.2=ha61e061_2\n  - mauve=2.4.0.snapshot_2015_02_13=h2688d6d_2\n  - mauvealigner=1.2.0=h8b68381_1\n  - maxbin2=2.2.7=he1b5a44_2\n  - mcl=14.137=pl526h516909a_5\n  - megahit=1.2.9=h8b12597_0\n  - metabat2=2.15=h137b6e9_0\n  - metasnv=1.0.3=h230ddbb_2\n  - minced=0.4.2=0\n  - motus=2.5.1=py_0\n  - multidict=5.1.0=py38h497a2fe_0\n  - nbformat=5.0.8=py_0\n  - ncurses=6.1=hf484d3e_1002\n  - networkx=2.5=py_0\n  - nose=1.3.7=py38h32f6830_1004\n  - nspr=4.29=he1b5a44_1\n  - nss=3.55=he751ad9_0\n  - numpy=1.19.5=py38h18fd61f_0\n  - oauth2client=4.1.3=py_0\n  - openjdk=8.0.192=h516909a_1005\n  - openjpeg=2.3.1=hf7af979_3\n  - openmp=8.0.1=0\n  - openssl=1.1.1i=h7f98852_0\n  - pandas=1.2.0=py38h51da96c_0\n  - pango=1.42.4=h7062337_4\n  - parallel=20201122=ha770c72_0\n  - paramiko=2.7.2=pyh9f0ad1d_0\n  - pcre=8.44=he1b5a44_0\n  - pcre2=10.35=h032f7d1_2\n  - perl=5.26.2=h36c2ea0_1008\n  - perl-aceperl=1.92=pl526_2\n  - perl-algorithm-diff=1.1903=pl526_2\n  - perl-algorithm-munkres=0.08=pl526_1\n  - perl-apache-test=1.40=pl526_1\n  - perl-app-cpanminus=1.7044=pl526_1\n  - perl-appconfig=1.71=pl526_1\n  - perl-archive-tar=2.32=pl526_0\n  - perl-array-compare=3.0.1=pl526_1\n  - perl-array-utils=0.5=pl526_2\n  - perl-autoloader=5.74=pl526_2\n  - perl-base=2.23=pl526_1\n  - perl-bio-asn1-entrezgene=1.73=pl526_0\n  - perl-bio-featureio=1.6.905=pl526_1\n  - perl-bio-phylo=0.58=pl526_1\n  - perl-bio-samtools=1.43=pl526h1341992_1\n  - perl-bioperl=1.6.924=6\n  - perl-bioperl-core=1.6.924=1\n  - perl-bioperl-run=1.007002=pl526_3\n  - perl-business-isbn=3.004=pl526_0\n  - perl-business-isbn-data=20140910.003=pl526_0\n  - perl-cache-cache=1.08=pl526_0\n  - perl-capture-tiny=0.48=pl526_0\n  - perl-carp=1.38=pl526_3\n  - perl-cgi=4.44=pl526h14c3975_1\n  - perl-class-data-inheritable=0.08=pl526_1\n  - perl-class-inspector=1.34=pl526_0\n  - perl-class-load=0.25=pl526_0\n  - perl-class-load-xs=0.10=pl526h6bb024c_2\n  - perl-class-method-modifiers=2.12=pl526_0\n  - perl-clone=0.42=pl526h516909a_0\n  - perl-common-sense=3.74=pl526_2\n  - perl-compress-raw-bzip2=2.087=pl526he1b5a44_0\n  - perl-compress-raw-zlib=2.087=pl526hc9558a2_0\n  - perl-constant=1.33=pl526_1\n  - perl-convert-binary-c=0.78=pl526h6bb024c_3\n  - perl-convert-binhex=1.125=pl526_1\n  - perl-crypt-rc4=2.02=pl526_1\n  - perl-data-dumper=2.173=pl526_0\n  - perl-data-optlist=0.110=pl526_2\n  - perl-data-stag=0.14=pl526_1\n  - perl-date-format=2.30=pl526_2\n  - perl-dbd-sqlite=1.64=pl526h516909a_0\n  - perl-dbi=1.642=pl526_0\n  - perl-devel-globaldestruction=0.14=pl526_0\n  - perl-devel-overloadinfo=0.005=pl526_0\n  - perl-devel-stacktrace=2.04=pl526_0\n  - perl-digest-hmac=1.03=pl526_3\n  - perl-digest-md5=2.55=pl526_0\n  - perl-digest-perl-md5=1.9=pl526_1\n  - perl-digest-sha1=2.13=pl526h6bb024c_1\n  - perl-dist-checkconflicts=0.11=pl526_2\n  - perl-dynaloader=1.25=pl526_1\n  - perl-email-date-format=1.005=pl526_2\n  - perl-encode=2.88=pl526_1\n  - perl-encode-locale=1.05=pl526_6\n  - perl-env-path=0.19=pl526_2\n  - perl-error=0.17027=pl526_1\n  - perl-eval-closure=0.14=pl526h6bb024c_4\n  - perl-exception-class=1.44=pl526_0\n  - perl-exporter=5.72=pl526_1\n  - perl-exporter-tiny=1.002001=pl526_0\n  - perl-extutils-makemaker=7.36=pl526_1\n  - perl-file-find-rule=0.34=pl526_5\n  - perl-file-grep=0.02=pl526_3\n  - perl-file-listing=6.04=pl526_1\n  - perl-file-path=2.16=pl526_0\n  - perl-file-slurp-tiny=0.004=pl526_1\n  - perl-file-slurper=0.012=pl526_0\n  - perl-file-sort=1.01=pl526_2\n  - perl-file-temp=0.2304=pl526_2\n  - perl-file-which=1.23=pl526_0\n  - perl-font-afm=1.20=pl526_2\n  - perl-font-ttf=1.06=pl526_0\n  - perl-gd=2.71=pl526he860b03_0\n  - perl-getopt-long=2.50=pl526_1\n  - perl-graph=0.9704=pl526_1\n  - perl-graph-readwrite=2.09=pl526_2\n  - perl-graphviz=2.24=pl526h734ff71_0\n  - perl-html-element-extended=1.18=pl526_1\n  - perl-html-entities-numbered=0.04=pl526_1\n  - perl-html-formatter=2.16=pl526_0\n  - perl-html-parser=3.72=pl526h6bb024c_5\n  - perl-html-tableextract=2.13=pl526_2\n  - perl-html-tagset=3.20=pl526_3\n  - perl-html-tidy=1.60=pl526_0\n  - perl-html-tree=5.07=pl526_1\n  - perl-html-treebuilder-xpath=0.14=pl526_1\n  - perl-http-cookies=6.04=pl526_0\n  - perl-http-daemon=6.01=pl526_1\n  - perl-http-date=6.02=pl526_3\n  - perl-http-message=6.18=pl526_0\n  - perl-http-negotiate=6.01=pl526_3\n  - perl-image-info=1.38=pl526_1\n  - perl-image-size=3.300=pl526_2\n  - perl-io-compress=2.087=pl526he1b5a44_0\n  - perl-io-html=1.001=pl526_2\n  - perl-io-sessiondata=1.03=pl526_1\n  - perl-io-socket-ssl=2.066=pl526_0\n  - perl-io-string=1.08=pl526_3\n  - perl-io-stringy=2.111=pl526_1\n  - perl-io-tty=1.12=pl526_1\n  - perl-io-zlib=1.10=pl526_2\n  - perl-ipc-run=20180523.0=pl526_0\n  - perl-ipc-sharelite=0.17=pl526h6bb024c_1\n  - perl-jcode=2.07=pl526_2\n  - perl-json=4.02=pl526_0\n  - perl-json-xs=2.34=pl526h6bb024c_3\n  - perl-lib=0.63=pl526_1\n  - perl-libwww-perl=6.39=pl526_0\n  - perl-libxml-perl=0.08=pl526_2\n  - perl-list-moreutils=0.428=pl526_1\n  - perl-list-moreutils-xs=0.428=pl526_0\n  - perl-log-log4perl=1.49=pl526_0\n  - perl-lwp-mediatypes=6.04=pl526_0\n  - perl-lwp-protocol-https=6.07=pl526_4\n  - perl-lwp-simple=6.15=pl526h470a237_4\n  - perl-mailtools=2.21=pl526_0\n  - perl-math-cdf=0.1=pl526h14c3975_5\n  - perl-math-derivative=1.01=pl526_0\n  - perl-math-random=0.72=pl526h14c3975_2\n  - perl-math-spline=0.02=pl526_2\n  - perl-mime-base64=3.15=pl526_1\n  - perl-mime-lite=3.030=pl526_1\n  - perl-mime-tools=5.508=pl526_1\n  - perl-mime-types=2.17=pl526_0\n  - perl-mldbm=2.05=pl526_1\n  - perl-module-implementation=0.09=pl526_2\n  - perl-module-runtime=0.016=pl526_1\n  - perl-module-runtime-conflicts=0.003=pl526_0\n  - perl-moo=2.003004=pl526_0\n  - perl-moose=2.2011=pl526hf484d3e_1\n  - perl-mozilla-ca=20180117=pl526_1\n  - perl-mro-compat=0.13=pl526_0\n  - perl-net-http=6.19=pl526_0\n  - perl-net-ssleay=1.88=pl526h90d6eec_0\n  - perl-ntlm=1.09=pl526_4\n  - perl-number-compare=0.03=pl526_2\n  - perl-ole-storage_lite=0.19=pl526_3\n  - perl-package-deprecationmanager=0.17=pl526_0\n  - perl-package-stash=0.38=pl526hf484d3e_1\n  - perl-package-stash-xs=0.28=pl526hf484d3e_1\n  - perl-params-util=1.07=pl526h6bb024c_4\n  - perl-parent=0.236=pl526_1\n  - perl-parse-recdescent=1.967015=pl526_0\n  - perl-parse-yapp=1.21=pl526_0\n  - perl-pathtools=3.75=pl526h14c3975_1\n  - perl-pdf-api2=2.035=pl526_0\n  - perl-perlio-utf8_strict=0.007=pl526h6bb024c_1\n  - perl-pod-escapes=1.07=pl526_1\n  - perl-pod-usage=1.69=pl526_1\n  - perl-postscript=0.06=pl526_2\n  - perl-role-tiny=2.000008=pl526_0\n  - perl-scalar-list-utils=1.52=pl526h516909a_0\n  - perl-set-scalar=1.29=pl526_2\n  - perl-soap-lite=1.19=pl526_1\n  - perl-socket=2.027=pl526_1\n  - perl-sort-naturally=1.03=pl526_2\n  - perl-spreadsheet-parseexcel=0.65=pl526_2\n  - perl-spreadsheet-writeexcel=2.40=pl526_2\n  - perl-statistics-descriptive=3.0702=pl526_0\n  - perl-storable=3.15=pl526h14c3975_0\n  - perl-sub-exporter=0.987=pl526_2\n  - perl-sub-exporter-progressive=0.001013=pl526_0\n  - perl-sub-identify=0.14=pl526h14c3975_0\n  - perl-sub-install=0.928=pl526_2\n  - perl-sub-name=0.21=pl526_1\n  - perl-sub-quote=2.006003=pl526_1\n  - perl-sub-uplevel=0.2800=pl526h14c3975_2\n  - perl-svg=2.84=pl526_0\n  - perl-svg-graph=0.02=pl526_3\n  - perl-task-weaken=1.06=pl526_0\n  - perl-template-toolkit=2.26=pl526_1\n  - perl-test=1.26=pl526_1\n  - perl-test-builder-tester=1.23_002=pl526_1\n  - perl-test-deep=1.128=pl526_1\n  - perl-test-differences=0.67=pl526_0\n  - perl-test-exception=0.43=pl526_2\n  - perl-test-files=0.14=pl526_2\n  - perl-test-harness=3.42=pl526_0\n  - perl-test-leaktrace=0.16=pl526h14c3975_2\n  - perl-test-most=0.35=pl526_0\n  - perl-test-output=1.031=pl526_0\n  - perl-test-requiresinternet=0.05=pl526_0\n  - perl-test-warn=0.36=pl526_1\n  - perl-text-csv=2.00=pl526_0\n  - perl-text-diff=1.45=pl526_0\n  - perl-text-glob=0.11=pl526_1\n  - perl-threaded=5.26.0=0\n  - perl-tie-ixhash=1.23=pl526_2\n  - perl-time-hires=1.9760=pl526h14c3975_1\n  - perl-time-local=1.28=pl526_1\n  - perl-timedate=2.30=pl526_1\n  - perl-tree-dag_node=1.31=pl526_0\n  - perl-try-tiny=0.30=pl526_1\n  - perl-type-tiny=1.004004=pl526_0\n  - perl-types-serialiser=1.0=pl526_2\n  - perl-unicode-map=0.112=pl526h6bb024c_3\n  - perl-uri=1.76=pl526_0\n  - perl-www-robotrules=6.02=pl526_3\n  - perl-xml-dom=1.46=pl526_0\n  - perl-xml-dom-xpath=0.14=pl526_1\n  - perl-xml-filter-buffertext=1.01=pl526_2\n  - perl-xml-libxml=2.0132=pl526h7ec2d77_1\n  - perl-xml-libxslt=1.94=pl526_1\n  - perl-xml-namespacesupport=1.12=pl526_0\n  - perl-xml-parser=2.44_01=pl526ha1d75be_1002\n  - perl-xml-regexp=0.04=pl526_2\n  - perl-xml-sax=1.02=pl526_0\n  - perl-xml-sax-base=1.09=pl526_0\n  - perl-xml-sax-expat=0.51=pl526_3\n  - perl-xml-sax-writer=0.57=pl526_0\n  - perl-xml-simple=2.25=pl526_1\n  - perl-xml-twig=3.52=pl526_2\n  - perl-xml-writer=0.625=pl526_2\n  - perl-xml-xpath=1.44=pl526_0\n  - perl-xml-xpathengine=0.14=pl526_2\n  - perl-xsloader=0.24=pl526_0\n  - perl-yaml=1.29=pl526_0\n  - pip=20.3.3=pyhd8ed1ab_0\n  - pixman=0.38.0=h516909a_1003\n  - pkg-config=0.29.2=h516909a_1008\n  - pplacer=1.1.alpha19=1\n  - prank=v.170427=hc9558a2_3\n  - prettytable=2.0.0=pyhd8ed1ab_0\n  - prodigal=2.6.3=h516909a_2\n  - prokka=1.13=2\n  - protobuf=3.14.0=py38h709712a_0\n  - psutil=5.8.0=py38h497a2fe_0\n  - pulp=2.3.1=py38h32f6830_0\n  - pyasn1=0.4.8=py_0\n  - pyasn1-modules=0.2.8=py_0\n  - pycosat=0.6.3=py38h8df0ef7_1005\n  - pycparser=2.20=pyh9f0ad1d_2\n  - pyfaidx=0.5.9.2=pyh3252c3a_0\n  - pygments=2.7.3=pyhd8ed1ab_0\n  - pygmes=0.1.7=py_0\n  - pygraphviz=1.6=py38h25c7686_1\n  - pynacl=1.4.0=py38h1e0a361_2\n  - pyopenssl=20.0.1=pyhd8ed1ab_0\n  - pyparsing=2.4.7=pyh9f0ad1d_0\n  - pyqt=5.12.3=py38ha8c2ead_3\n  - pyrsistent=0.17.3=py38h25fe258_1\n  - pysftp=0.2.9=py_1\n  - pysocks=1.7.1=py38h924ce5b_2\n  - python=3.8.5=h425cb1d_2_cpython\n  - python-dateutil=2.8.1=py_0\n  - python-irodsclient=0.8.2=py_0\n  - python_abi=3.8=1_cp38\n  - pytz=2020.5=pyhd8ed1ab_0\n  - pyyaml=5.3.1=py38h8df0ef7_1\n  - qt=5.12.5=hd8c4c69_1\n  - r-ade4=1.7_16=r40h30ea16f_1\n  - r-ape=5.4_1=r40h51c796c_0\n  - r-assertthat=0.2.1=r40h6115d3f_2\n  - r-backports=1.2.1=r40hcfec24a_0\n  - r-base=4.0.2=h95c6c4b_0\n  - r-bitops=1.0_6=r40hcdcec82_1004\n  - r-brio=1.1.0=r40h9e2df91_1\n  - r-callr=3.5.1=r40h142f84f_0\n  - r-catools=1.18.0=r40h0357c0b_1\n  - r-cli=2.2.0=r40hc72bb7e_0\n  - r-colorspace=2.0_0=r40h9e2df91_0\n  - r-crayon=1.3.4=r40h6115d3f_1003\n  - r-data.table=1.13.6=r40hcfec24a_0\n  - r-desc=1.2.0=r40h6115d3f_1003\n  - r-diffobj=0.3.3=r40hcfec24a_0\n  - r-digest=0.6.27=r40h1b71b39_0\n  - r-dplyr=1.0.2=r40h0357c0b_0\n  - r-dynamictreecut=1.63_1=r40h6115d3f_1003\n  - r-ellipsis=0.3.1=r40hcdcec82_0\n  - r-evaluate=0.14=r40h6115d3f_2\n  - r-fansi=0.4.1=r40hcdcec82_1\n  - r-farver=2.0.3=r40h0357c0b_1\n  - r-generics=0.1.0=r40hc72bb7e_0\n  - r-getopt=1.20.3=r40_2\n  - r-ggplot2=3.3.3=r40hc72bb7e_0\n  - r-glue=1.4.2=r40hcdcec82_0\n  - r-gplots=3.1.1=r40hc72bb7e_0\n  - r-gsubfn=0.7=r40h6115d3f_1002\n  - r-gtable=0.3.0=r40h6115d3f_3\n  - r-gtools=3.8.2=r40hcdcec82_1\n  - r-hms=0.5.3=r40h6115d3f_1\n  - r-isoband=0.2.3=r40h03ef668_0\n  - r-jsonlite=1.7.2=r40hcfec24a_0\n  - r-kernsmooth=2.23_18=r40h7679c2e_0\n  - r-labeling=0.4.2=r40h142f84f_0\n  - r-lattice=0.20_41=r40hcdcec82_2\n  - r-lifecycle=0.2.0=r40h6115d3f_1\n  - r-magrittr=2.0.1=r40h9e2df91_1\n  - r-mass=7.3_53=r40hcdcec82_0\n  - r-matrix=1.3_2=r40he454529_0\n  - r-mgcv=1.8_33=r40h7fa42b6_0\n  - r-munsell=0.5.0=r40h6115d3f_1003\n  - r-nlme=3.1_150=r40h31ca83e_0\n  - r-pillar=1.4.7=r40hc72bb7e_0\n  - r-pixmap=0.4_11=r40h6115d3f_1003\n  - r-pkgbuild=1.2.0=r40hc72bb7e_0\n  - r-pkgconfig=2.0.3=r40h6115d3f_1\n  - r-pkgload=1.1.0=r40h0357c0b_0\n  - r-praise=1.0.0=r40h6115d3f_1004\n  - r-prettyunits=1.1.1=r40h6115d3f_1\n  - r-processx=3.4.5=r40hcfec24a_0\n  - r-progress=1.2.2=r40h6115d3f_2\n  - r-proto=1.0.0=r40_2003\n  - r-ps=1.5.0=r40hcfec24a_0\n  - r-purrr=0.3.4=r40hcdcec82_1\n  - r-r6=2.5.0=r40hc72bb7e_0\n  - r-rcolorbrewer=1.1_2=r40h6115d3f_1003\n  - r-rcpp=1.0.5=r40he524a50_0\n  - r-rematch2=2.1.2=r40h6115d3f_1\n  - r-rlang=0.4.10=r40hcfec24a_0\n  - r-rprojroot=2.0.2=r40hc72bb7e_0\n  - r-rstudioapi=0.13=r40hc72bb7e_0\n  - r-scales=1.1.1=r40h6115d3f_0\n  - r-segmented=1.3_1=r40hc72bb7e_0\n  - r-seqinr=4.2_5=r40hcfec24a_0\n  - r-sp=1.4_2=r40hcdcec82_0\n  - r-testthat=3.0.1=r40h03ef668_0\n  - r-tibble=3.0.4=r40h0eb13af_0\n  - r-tidyselect=1.1.0=r40h6115d3f_0\n  - r-utf8=1.1.4=r40hcdcec82_1003\n  - r-vctrs=0.3.6=r40hcfec24a_0\n  - r-viridislite=0.3.0=r40h6115d3f_1003\n  - r-waldo=0.2.3=r40hc72bb7e_0\n  - r-withr=2.3.0=r40h6115d3f_0\n  - r-zeallot=0.1.0=r40h6115d3f_1002\n  - ratelimiter=1.2.0=py38h32f6830_1001\n  - readline=8.0=h46ee950_1\n  - reproc=14.2.1=h36c2ea0_0\n  - reproc-cpp=14.2.1=h58526e2_0\n  - requests=2.25.1=pyhd3deb0d_0\n  - roary=3.7.0=0\n  - rsa=4.6=pyh9f0ad1d_0\n  - ruamel_yaml=0.15.87=py38h7b6447c_1\n  - s3transfer=0.3.3=py38h32f6830_2\n  - samtools=1.9=h10a08f8_12\n  - scikit-learn=0.24.0=py38h658cfdd_0\n  - scipy=1.5.3=py38h828c644_0\n  - sed=4.8=he412f7d_0\n  - semantic_version=2.8.5=pyh9f0ad1d_0\n  - setuptools=51.0.0=py38h06a4308_2\n  - simplejson=3.17.2=py38h497a2fe_1\n  - six=1.15.0=pyh9f0ad1d_0\n  - slacker=0.14.0=py_0\n  - smeg=1.1.1=0\n  - smmap=3.0.4=pyh9f0ad1d_0\n  - snakemake=5.31.1=0\n  - snakemake-minimal=5.31.1=py_0\n  - sqlite=3.32.3=hcee41ef_1\n  - sysroot_linux-64=2.12=h77966d4_13\n  - tar=1.32=hd4ba37b_0\n  - tbb=2020.3=hfd86e86_0\n  - tbl2asn=25.7=0\n  - threadpoolctl=2.1.0=pyh5ca1d4c_0\n  - tidyp=1.04=h516909a_2\n  - tk=8.6.10=h21135ba_1\n  - tktable=2.10=hb7b940f_3\n  - toposort=1.6=pyhd8ed1ab_0\n  - tqdm=4.55.1=pyhd8ed1ab_0\n  - traitlets=5.0.5=py_0\n  - typing-extensions=3.7.4.3=0\n  - typing_extensions=3.7.4.3=py_0\n  - tzdata=2020f=he74cb21_0\n  - uritemplate=3.0.1=py_0\n  - urllib3=1.26.2=pyhd8ed1ab_0\n  - wcwidth=0.2.5=pyh9f0ad1d_2\n  - wheel=0.36.2=pyhd3deb0d_0\n  - wrapt=1.12.1=py38h25fe258_2\n  - xmlrunner=1.7.7=py_0\n  - xorg-kbproto=1.0.7=h14c3975_1002\n  - xorg-libice=1.0.10=h516909a_0\n  - xorg-libsm=1.2.3=h84519dc_1000\n  - xorg-libx11=1.6.12=h516909a_0\n  - xorg-libxau=1.0.9=h14c3975_0\n  - xorg-libxdmcp=1.1.3=h516909a_0\n  - xorg-libxext=1.3.4=h516909a_0\n  - xorg-libxpm=3.5.13=h516909a_0\n  - xorg-libxrender=0.9.10=h516909a_1002\n  - xorg-libxt=1.1.5=h516909a_1003\n  - xorg-renderproto=0.11.1=h14c3975_1002\n  - xorg-xextproto=7.3.0=h14c3975_1002\n  - xorg-xproto=7.0.31=h14c3975_1007\n  - xz=5.2.5=h516909a_1\n  - yaml=0.2.5=h516909a_0\n  - yarl=1.5.1=py38h1e0a361_0\n  - zipp=3.4.0=py_0\n  - zlib=1.2.11=h516909a_1010\n  - zstd=1.4.8=hdf46e1d_0\n\n"
  },
  {
    "path": "workflow/envs/metaWRAP_env.yml",
    "content": "name: metawrap\nchannels:\n  - ursky\n  - bioconda\n  - conda-forge\n  - defaults\ndependencies:\n  - metawrap-mg>=1.2.3\n  - python=2.7\n  - biopython \n  - bowtie2 \n  - bwa \n  - checkm-genome \n  - matplotlib\n  - megahit \n  - pandas\n  - quast \n  - r-ggplot2 \n  - r-recommended \n  - salmon\n  - samtools\n  - seaborn\n  - spades\n"
  },
  {
    "path": "workflow/envs/prokkaroary_env.yml",
    "content": "name: prokkaroary\nchannels:\n  - conda-forge\n  - bioconda\n  - defaults\ndependencies:\n  - _libgcc_mutex=0.1=conda_forge\n  - _openmp_mutex=4.5=1_gnu\n  - alsa-lib=1.2.3=h516909a_0\n  - aragorn=1.2.38=h516909a_3\n  - barrnap=0.9=3\n  - bedtools=2.29.2=hc088bd4_0\n  - blast=2.10.1=pl526he19e7b1_3\n  - bzip2=1.0.8=h7f98852_4\n  - c-ares=1.17.1=h36c2ea0_0\n  - ca-certificates=2020.12.5=ha878542_0\n  - cairo=1.16.0=h7979940_1007\n  - cd-hit=4.8.1=hdbcaa40_0\n  - certifi=2020.12.5=py37h89c1867_1\n  - clustalw=2.1=hc9558a2_5\n  - curl=7.71.1=he644dc0_8\n  - entrez-direct=13.9=pl526h375a9b1_0\n  - expat=2.2.9=he1b5a44_2\n  - fasttree=2.1.10=h516909a_4\n  - fontconfig=2.13.1=h736d332_1003\n  - freetype=2.10.4=h7ca028e_0\n  - fribidi=1.0.10=h36c2ea0_0\n  - gettext=0.19.8.1=h0b5b191_1005\n  - giflib=5.2.1=h36c2ea0_2\n  - graphite2=1.3.13=h58526e2_1001\n  - graphviz=2.42.3=h0511662_0\n  - harfbuzz=2.7.4=h5cf4720_0\n  - hmmer=3.3.1=he1b5a44_0\n  - icu=68.1=h58526e2_0\n  - infernal=1.1.3=h516909a_0\n  - jpeg=9d=h36c2ea0_0\n  - krb5=1.17.2=h926e7f8_0\n  - lcms2=2.11=hcbb858e_1\n  - ld_impl_linux-64=2.35.1=hea4e1c9_1\n  - libcurl=7.71.1=hcdd3856_8\n  - libdb=6.2.32=h9c3ff4c_0\n  - libedit=3.1.20191231=he28a2e2_2\n  - libev=4.33=h516909a_1\n  - libffi=3.3=h58526e2_2\n  - libgcc-ng=9.3.0=h5dbcf3e_17\n  - libgd=2.3.0=h47910db_1\n  - libglib=2.66.4=h164308a_1\n  - libgomp=9.3.0=h5dbcf3e_17\n  - libiconv=1.16=h516909a_0\n  - libidn11=1.34=h1cef754_0\n  - libnghttp2=1.41.0=h8cfc5f6_2\n  - libpng=1.6.37=h21135ba_2\n  - libssh2=1.9.0=hab1572f_5\n  - libstdcxx-ng=9.3.0=h2ae2ef3_17\n  - libtiff=4.2.0=hdc55705_0\n  - libtool=2.4.6=h58526e2_1007\n  - libuuid=2.32.1=h7f98852_1000\n  - libwebp=1.1.0=h76fa15c_4\n  - libwebp-base=1.1.0=h36c2ea0_3\n  - libxcb=1.13=h14c3975_1002\n  - libxml2=2.9.10=h72842e0_3\n  - libxslt=1.1.33=h15afd5d_2\n  - lz4-c=1.9.3=h9c3ff4c_0\n  - mafft=7.475=h516909a_0\n  - mcl=14.137=pl526h516909a_5\n  - minced=0.4.2=0\n  - ncurses=6.2=h58526e2_4\n  - openjdk=11.0.8=hacce0ff_0\n  - openssl=1.1.1i=h7f98852_0\n  - paml=4.9=h516909a_5\n  - pango=1.42.4=h69149e4_5\n  - parallel=20201122=ha770c72_0\n  - pcre=8.44=he1b5a44_0\n  - perl=5.26.2=h36c2ea0_1008\n  - perl-aceperl=1.92=pl526_2\n  - perl-algorithm-diff=1.1903=pl526_2\n  - perl-algorithm-munkres=0.08=pl526_1\n  - perl-apache-test=1.40=pl526_1\n  - perl-app-cpanminus=1.7044=pl526_1\n  - perl-appconfig=1.71=pl526_1\n  - perl-archive-tar=2.32=pl526_0\n  - perl-array-compare=3.0.1=pl526_1\n  - perl-array-utils=0.5=pl526_2\n  - perl-autoloader=5.74=pl526_2\n  - perl-base=2.23=pl526_1\n  - perl-bio-asn1-entrezgene=1.73=pl526_1\n  - perl-bio-coordinate=1.007001=pl526_1\n  - perl-bio-featureio=1.6.905=pl526_2\n  - perl-bio-phylo=0.58=pl526_2\n  - perl-bio-samtools=1.43=pl526h1341992_1\n  - perl-bio-tools-phylo-paml=1.7.3=pl526_1\n  - perl-bio-tools-run-alignment-clustalw=1.7.4=pl526_1\n  - perl-bio-tools-run-alignment-tcoffee=1.7.4=pl526_2\n  - perl-bioperl=1.7.2=pl526_11\n  - perl-bioperl-core=1.007002=pl526_2\n  - perl-bioperl-run=1.007002=pl526_4\n  - perl-business-isbn=3.004=pl526_0\n  - perl-business-isbn-data=20140910.003=pl526_0\n  - perl-cache-cache=1.08=pl526_0\n  - perl-capture-tiny=0.48=pl526_0\n  - perl-carp=1.38=pl526_3\n  - perl-cgi=4.44=pl526h14c3975_1\n  - perl-class-data-inheritable=0.08=pl526_1\n  - perl-class-inspector=1.34=pl526_0\n  - perl-class-load=0.25=pl526_0\n  - perl-class-load-xs=0.10=pl526h6bb024c_2\n  - perl-class-method-modifiers=2.12=pl526_0\n  - perl-clone=0.42=pl526h516909a_0\n  - perl-common-sense=3.74=pl526_2\n  - perl-compress-raw-bzip2=2.087=pl526he1b5a44_0\n  - perl-compress-raw-zlib=2.087=pl526hc9558a2_0\n  - perl-constant=1.33=pl526_1\n  - perl-convert-binary-c=0.78=pl526h6bb024c_3\n  - perl-convert-binhex=1.125=pl526_1\n  - perl-crypt-rc4=2.02=pl526_1\n  - perl-data-dumper=2.173=pl526_0\n  - perl-data-optlist=0.110=pl526_2\n  - perl-data-stag=0.14=pl526_1\n  - perl-date-format=2.30=pl526_2\n  - perl-db-file=1.855=pl526h516909a_0\n  - perl-dbd-sqlite=1.64=pl526h516909a_0\n  - perl-dbi=1.642=pl526_0\n  - perl-devel-globaldestruction=0.14=pl526_0\n  - perl-devel-overloadinfo=0.005=pl526_0\n  - perl-devel-stacktrace=2.04=pl526_0\n  - perl-digest-hmac=1.03=pl526_3\n  - perl-digest-md5=2.55=pl526_0\n  - perl-digest-md5-file=0.08=pl526_2\n  - perl-digest-perl-md5=1.9=pl526_1\n  - perl-digest-sha1=2.13=pl526h6bb024c_1\n  - perl-dist-checkconflicts=0.11=pl526_2\n  - perl-dynaloader=1.25=pl526_1\n  - perl-email-date-format=1.005=pl526_2\n  - perl-encode=2.88=pl526_1\n  - perl-encode-locale=1.05=pl526_6\n  - perl-error=0.17027=pl526_1\n  - perl-eval-closure=0.14=pl526h6bb024c_4\n  - perl-exception-class=1.44=pl526_0\n  - perl-exporter=5.72=pl526_1\n  - perl-exporter-tiny=1.002001=pl526_0\n  - perl-extutils-makemaker=7.36=pl526_1\n  - perl-file-find-rule=0.34=pl526_5\n  - perl-file-grep=0.02=pl526_3\n  - perl-file-listing=6.04=pl526_1\n  - perl-file-path=2.16=pl526_0\n  - perl-file-slurp-tiny=0.004=pl526_1\n  - perl-file-slurper=0.012=pl526_0\n  - perl-file-sort=1.01=pl526_2\n  - perl-file-temp=0.2304=pl526_2\n  - perl-file-which=1.23=pl526_0\n  - perl-font-afm=1.20=pl526_2\n  - perl-font-ttf=1.06=pl526_0\n  - perl-gd=2.68=pl526he941832_0\n  - perl-getopt-long=2.50=pl526_1\n  - perl-graph=0.9704=pl526_1\n  - perl-graph-readwrite=2.09=pl526_2\n  - perl-graphviz=2.24=pl526h734ff71_0\n  - perl-html-element-extended=1.18=pl526_1\n  - perl-html-entities-numbered=0.04=pl526_1\n  - perl-html-formatter=2.16=pl526_0\n  - perl-html-parser=3.72=pl526h6bb024c_5\n  - perl-html-tableextract=2.13=pl526_2\n  - perl-html-tagset=3.20=pl526_3\n  - perl-html-tidy=1.60=pl526_0\n  - perl-html-tree=5.07=pl526_1\n  - perl-html-treebuilder-xpath=0.14=pl526_1\n  - perl-http-cookies=6.04=pl526_0\n  - perl-http-daemon=6.01=pl526_1\n  - perl-http-date=6.02=pl526_3\n  - perl-http-message=6.18=pl526_0\n  - perl-http-negotiate=6.01=pl526_3\n  - perl-image-info=1.38=pl526_1\n  - perl-image-size=3.300=pl526_2\n  - perl-io-compress=2.087=pl526he1b5a44_0\n  - perl-io-html=1.001=pl526_2\n  - perl-io-sessiondata=1.03=pl526_1\n  - perl-io-socket-ssl=2.066=pl526_0\n  - perl-io-string=1.08=pl526_3\n  - perl-io-stringy=2.111=pl526_1\n  - perl-io-tty=1.12=pl526_1\n  - perl-io-zlib=1.10=pl526_2\n  - perl-ipc-run=20180523.0=pl526_0\n  - perl-ipc-sharelite=0.17=pl526h6bb024c_1\n  - perl-jcode=2.07=pl526_2\n  - perl-json=4.02=pl526_0\n  - perl-json-xs=2.34=pl526h6bb024c_3\n  - perl-lib=0.63=pl526_1\n  - perl-libwww-perl=6.39=pl526_0\n  - perl-libxml-perl=0.08=pl526_2\n  - perl-list-moreutils=0.428=pl526_1\n  - perl-list-moreutils-xs=0.428=pl526_0\n  - perl-log-log4perl=1.49=pl526_0\n  - perl-lwp-mediatypes=6.04=pl526_0\n  - perl-lwp-protocol-https=6.07=pl526_4\n  - perl-lwp-simple=6.15=pl526h470a237_4\n  - perl-mailtools=2.21=pl526_0\n  - perl-math-cdf=0.1=pl526h14c3975_5\n  - perl-math-derivative=1.01=pl526_0\n  - perl-math-random=0.72=pl526h14c3975_2\n  - perl-math-spline=0.02=pl526_2\n  - perl-mime-base64=3.15=pl526_1\n  - perl-mime-lite=3.030=pl526_1\n  - perl-mime-tools=5.508=pl526_1\n  - perl-mime-types=2.17=pl526_0\n  - perl-mldbm=2.05=pl526_1\n  - perl-module-implementation=0.09=pl526_2\n  - perl-module-runtime=0.016=pl526_1\n  - perl-module-runtime-conflicts=0.003=pl526_0\n  - perl-moo=2.003004=pl526_0\n  - perl-moose=2.2011=pl526hf484d3e_1\n  - perl-mozilla-ca=20180117=pl526_1\n  - perl-mro-compat=0.13=pl526_0\n  - perl-net-http=6.19=pl526_0\n  - perl-net-ssleay=1.88=pl526h90d6eec_0\n  - perl-ntlm=1.09=pl526_4\n  - perl-number-compare=0.03=pl526_2\n  - perl-ole-storage_lite=0.19=pl526_3\n  - perl-package-deprecationmanager=0.17=pl526_0\n  - perl-package-stash=0.38=pl526hf484d3e_1\n  - perl-package-stash-xs=0.28=pl526hf484d3e_1\n  - perl-params-util=1.07=pl526h6bb024c_4\n  - perl-parent=0.236=pl526_1\n  - perl-parse-recdescent=1.967015=pl526_0\n  - perl-parse-yapp=1.21=pl526_0\n  - perl-pathtools=3.75=pl526h14c3975_1\n  - perl-pdf-api2=2.035=pl526_0\n  - perl-perlio-utf8_strict=0.007=pl526h6bb024c_1\n  - perl-pod-escapes=1.07=pl526_1\n  - perl-pod-usage=1.69=pl526_1\n  - perl-postscript=0.06=pl526_2\n  - perl-role-tiny=2.000008=pl526_0\n  - perl-scalar-list-utils=1.52=pl526h516909a_0\n  - perl-set-scalar=1.29=pl526_2\n  - perl-soap-lite=1.19=pl526_1\n  - perl-socket=2.027=pl526_1\n  - perl-sort-naturally=1.03=pl526_2\n  - perl-spreadsheet-parseexcel=0.65=pl526_2\n  - perl-spreadsheet-writeexcel=2.40=pl526_2\n  - perl-statistics-descriptive=3.0702=pl526_0\n  - perl-storable=3.15=pl526h14c3975_0\n  - perl-sub-exporter=0.987=pl526_2\n  - perl-sub-exporter-progressive=0.001013=pl526_0\n  - perl-sub-identify=0.14=pl526h14c3975_0\n  - perl-sub-install=0.928=pl526_2\n  - perl-sub-name=0.21=pl526_1\n  - perl-sub-quote=2.006003=pl526_1\n  - perl-sub-uplevel=0.2800=pl526h14c3975_2\n  - perl-svg=2.84=pl526_0\n  - perl-svg-graph=0.02=pl526_3\n  - perl-task-weaken=1.06=pl526_0\n  - perl-template-toolkit=2.26=pl526_1\n  - perl-test=1.26=pl526_1\n  - perl-test-deep=1.128=pl526_1\n  - perl-test-differences=0.67=pl526_0\n  - perl-test-exception=0.43=pl526_2\n  - perl-test-harness=3.42=pl526_0\n  - perl-test-leaktrace=0.16=pl526h14c3975_2\n  - perl-test-most=0.35=pl526_0\n  - perl-test-requiresinternet=0.05=pl526_0\n  - perl-test-warn=0.36=pl526_1\n  - perl-text-csv=2.00=pl526_0\n  - perl-text-diff=1.45=pl526_0\n  - perl-text-glob=0.11=pl526_1\n  - perl-tie-ixhash=1.23=pl526_2\n  - perl-time-hires=1.9760=pl526h14c3975_1\n  - perl-time-local=1.28=pl526_1\n  - perl-timedate=2.30=pl526_1\n  - perl-tree-dag_node=1.31=pl526_0\n  - perl-try-tiny=0.30=pl526_1\n  - perl-type-tiny=1.004004=pl526_0\n  - perl-types-serialiser=1.0=pl526_2\n  - perl-unicode-map=0.112=pl526h6bb024c_3\n  - perl-uri=1.76=pl526_0\n  - perl-www-robotrules=6.02=pl526_3\n  - perl-xml-dom=1.46=pl526_0\n  - perl-xml-dom-xpath=0.14=pl526_1\n  - perl-xml-filter-buffertext=1.01=pl526_2\n  - perl-xml-libxml=2.0132=pl526h7ec2d77_1\n  - perl-xml-libxslt=1.94=pl526_1\n  - perl-xml-namespacesupport=1.12=pl526_0\n  - perl-xml-parser=2.44_01=pl526ha1d75be_1002\n  - perl-xml-regexp=0.04=pl526_2\n  - perl-xml-sax=1.02=pl526_0\n  - perl-xml-sax-base=1.09=pl526_0\n  - perl-xml-sax-expat=0.51=pl526_3\n  - perl-xml-sax-writer=0.57=pl526_0\n  - perl-xml-simple=2.25=pl526_1\n  - perl-xml-twig=3.52=pl526_2\n  - perl-xml-writer=0.625=pl526_2\n  - perl-xml-xpath=1.44=pl526_0\n  - perl-xml-xpathengine=0.14=pl526_2\n  - perl-xsloader=0.24=pl526_0\n  - perl-yaml=1.29=pl526_0\n  - pip=20.3.3=pyhd8ed1ab_0\n  - pixman=0.40.0=h36c2ea0_0\n  - prank=v.170427=hc9558a2_3\n  - prodigal=2.6.3=h516909a_2\n  - prokka=1.14.6=pl526_0\n  - pthread-stubs=0.4=h36c2ea0_1001\n  - python=3.7.9=hffdb5ce_0_cpython\n  - python_abi=3.7=1_cp37m\n  - readline=8.0=he28a2e2_2\n  - roary=3.13.0=pl526h516909a_0\n  - setuptools=49.6.0=py37he5f6b98_2\n  - sqlite=3.34.0=h74cdb3f_0\n  - t_coffee=11.0.8=py37hea885bf_8\n  - tbl2asn-forever=25.7.2f=h516909a_0\n  - tidyp=1.04=h516909a_2\n  - tk=8.6.10=h21135ba_1\n  - wheel=0.36.2=pyhd3deb0d_0\n  - xorg-fixesproto=5.0=h14c3975_1002\n  - xorg-inputproto=2.3.2=h7f98852_1002\n  - xorg-kbproto=1.0.7=h7f98852_1002\n  - xorg-libice=1.0.10=h516909a_0\n  - xorg-libsm=1.2.3=h84519dc_1000\n  - xorg-libx11=1.6.12=h516909a_0\n  - xorg-libxau=1.0.9=h14c3975_0\n  - xorg-libxdmcp=1.1.3=h516909a_0\n  - xorg-libxext=1.3.4=h516909a_0\n  - xorg-libxfixes=5.0.3=h516909a_1004\n  - xorg-libxi=1.7.10=h516909a_0\n  - xorg-libxpm=3.5.13=h516909a_0\n  - xorg-libxrender=0.9.10=h516909a_1002\n  - xorg-libxt=1.1.5=h516909a_1003\n  - xorg-libxtst=1.2.3=h516909a_1002\n  - xorg-recordproto=1.14.2=h516909a_1002\n  - xorg-renderproto=0.11.1=h14c3975_1002\n  - xorg-xextproto=7.3.0=h7f98852_1002\n  - xorg-xproto=7.0.31=h7f98852_1007\n  - xz=5.2.5=h516909a_1\n  - zlib=1.2.11=h516909a_1010\n  - zstd=1.4.8=ha95c52a_1\n\n"
  },
  {
    "path": "workflow/metaGEM.sh",
    "content": "#!/bin/bash\n\n# Version\nVERSION=\"1.0.5\"\n\n# Logo\nprintLogo() {\n\n  echo \"\n=================================================================================================================================\nDeveloped by: Francisco Zorrilla, Kiran R. Patil, and Aleksej Zelezniak___________________________________________________________\nPublication: doi.org/10.1101/2020.12.31.424982___________________________/\\\\\\\\\\\\\\\\\\\\\\\\___/\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\___/\\\\\\\\____________/\\\\\\\\_        \n________________________________________________________________________/\\\\\\//////////___\\/\\\\\\///////////___\\/\\\\\\\\\\\\________/\\\\\\\\\\\\_       \n____________________________________________/\\\\\\________________________/\\\\\\______________\\/\\\\\\______________\\/\\\\\\//\\\\\\____/\\\\\\//\\\\\\_      \n_______/\\\\\\\\\\__/\\\\\\\\\\________/\\\\\\\\\\\\\\\\____/\\\\\\\\\\\\\\\\\\\\\\___/\\\\\\\\\\\\\\\\\\_____\\/\\\\\\____/\\\\\\\\\\\\\\__\\/\\\\\\\\\\\\\\\\\\\\\\______\\/\\\\\\\\///\\\\\\/\\\\\\/_\\/\\\\\\_     \n______/\\\\\\///\\\\\\\\\\///\\\\\\____/\\\\\\/////\\\\\\__\\////\\\\\\////___\\////////\\\\\\____\\/\\\\\\___\\/////\\\\\\__\\/\\\\\\///////_______\\/\\\\\\__\\///\\\\\\/___\\/\\\\\\_    \n______\\/\\\\\\_\\//\\\\\\__\\/\\\\\\___/\\\\\\\\\\\\\\\\\\\\\\______\\/\\\\\\_________/\\\\\\\\\\\\\\\\\\\\___\\/\\\\\\_______\\/\\\\\\__\\/\\\\\\______________\\/\\\\\\____\\///_____\\/\\\\\\_   \n_______\\/\\\\\\__\\/\\\\\\__\\/\\\\\\__\\//\\\\///////_______\\/\\\\\\_/\\\\____/\\\\\\/////\\\\\\___\\/\\\\\\_______\\/\\\\\\__\\/\\\\\\______________\\/\\\\\\_____________\\/\\\\\\_  \n________\\/\\\\\\__\\/\\\\\\__\\/\\\\\\___\\//\\\\\\\\\\\\\\\\\\\\_____\\//\\\\\\\\\\____\\//\\\\\\\\\\\\\\\\/\\\\__\\//\\\\\\\\\\\\\\\\\\\\\\\\/___\\/\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\__\\/\\\\\\_____________\\/\\\\\\_ \n_________\\///___\\///___\\///_____\\//////////_______\\/////______\\////////\\//____\\////////////_____\\///////////////___\\///______________\\///__\n=============================================================================================================================================\nA Snakemake-based pipeline desinged to predict metabolic interactions directly from metagenomics data using high performance computer clusters\n===============================================================================================================================================\nVersion: ${VERSION}\n\"\n\n}\n\n# Helpfile function\nusage() {\n\n  printLogo\n\n  echo -n \"Usage: bash metaGEM.sh [-t|--task TASK] \n                       [-j|--nJobs NUMBER OF JOBS] \n                       [-c|--cores NUMBER OF CORES] \n                       [-m|--mem GB RAM] \n                       [-h|--hours MAX RUNTIME]\n                       [-l|--local]\n\nSnakefile wrapper/parser for metaGEM, for more details visit https://github.com/franciscozorrilla/metaGEM.\n\nOptions:\n  -t, --task        Specify task to complete:\n\n                        SETUP\n                            createFolders\n                            downloadToy\n                            organizeData\n                            check\n\n                        CORE WORKFLOW\n                            fastp \n                            megahit \n                            crossMapSeries\n                            kallistoIndex\n                            crossMapParallel\n                            kallisto2concoct\n                            concoct \n                            metabat\n                            maxbin \n                            binRefine \n                            binReassemble \n                            extractProteinBins\n                            carveme\n                            memote\n                            organizeGEMs\n                            smetana\n                            extractDnaBins\n                            gtdbtk\n                            abundance\n\n                        BONUS\n                            grid\n                            prokka\n                            roary\n                            eukrep\n                            eukcc\n\n                        VISUALIZATION (in development)\n                            stats\n                            qfilterVis\n                            assemblyVis\n                            binningVis\n                            compositionVis\n                            modelVis\n                            interactionVis\n                            growthVis\n\n  -j, --nJobs       Specify number of jobs to run in parallel\n  -c, --nCores      Specify number of cores per job\n  -m, --mem         Specify memory in GB required for job\n  -h, --hours       Specify number of hours to allocated to job runtime\n  -l, --local       Run jobs on local machine for non-cluster usage\n\n\"\n}\n\n# Run check task\nrun_check() {\n\n#check if conda is installed/available\necho -ne \"Checking if conda is available ... \"\ncondatest=$(conda list|wc -l)\nif [[ \"$condatest\" -eq 0 ]]; then\n    echo -e \"WARNING: Conda is not available! Please load your cluster's conda module or install locally.\\n\" && exit\nelif [[ \"$condatest\" -gt 0 ]]; then\n    condav=$(conda --version|cut -d ' ' -f2)\n    echo -e \"detected version $condav!\"\nfi\n\n# check if conda environments are present\necho -ne \"Searching for metaGEM conda environment ... \"\nenvcheck1=$(conda info --envs|grep -w metagem|wc -l)\nif [[ \"$envcheck1\" -ge 1 ]]; then\n    echo \"detected! Activating metagem env ... \"\n    conda activate metagem\nelse\n    echo \"not detected, please run the env_setup.sh script!\"\nfi\n\necho -ne \"Searching for metaWRAP conda environment ... \"\nenvcheck2=$(conda info --envs|grep -w metawrap|wc -l)\nif [[ \"$envcheck2\" -ge 1 ]]; then\n    echo \"detected!\"\nelse\n    echo \"not detected, please run the env_setup.sh script!\"\nfi\n\necho -ne \"Searching for prokka-roary conda environment ... \"\nenvcheck3=$(conda info --envs|grep -w prokkaroary|wc -l)\nif [[ \"$envcheck3\" -ge 1 ]]; then\n    echo -e \"detected!\\n\"\nelse\n    echo -e \"not detected, please run the env_setup.sh script!\\n\"\nfi\n\n# run createFolders rule to create folders in case any of them are missing\necho -e \"Checking folders in workspace $pwd ... \"\nnFolders=$(ls -d */|wc -l)\nif [[ \"$nFolders\" -le 20 ]]; then\n    while true; do\n        read -p \"Some folders appear to be missing, do you wish to run the createFolders Snakefile rule? (y/n)\" yn\n        case $yn in\n            [Yy]* ) echo \"Running the createFolders snakefile rule ... \" && snakemake createFolders -j1; break;;\n            [Nn]* ) echo \"Skipping folder creation ... \"; break;;\n            * ) echo \"Please answer yes or no.\";;\n        esac\n    done\nfi\n\n# search for folders and files with .gz extension within dataset folder\ncount_files=$(find dataset -name \"*.gz\"|wc -l)\ncount_samp=$(ls dataset|grep -v gz|wc -l)\nif [[ \"$count_files\" -eq 0 ]]; then\n    echo -e \"\\nThere are no sequencing files (*.gz) in the dataset folder!\"\n    echo -e \"Please download or move your paired end files into sample specific subfolders within the dataset folder.\\n\"\n    while true; do\n        read -p \"Do you wish to download a 3 sample dataset using the downloadToy Snakefile rule? ~1.8 GB of storage required (y/n)\" yn\n        case $yn in\n            [Yy]* ) echo \"Running the downloadToy snakefile rule ... \" && snakemake downloadToy -j1; break;;\n            [Nn]* ) echo \"Skipping toy dataset download ... \"; break;;\n            * ) echo \"Please answer yes or no.\";;\n        esac\n    done\nelif [[ \"$count_samp\" -eq 0 && \"$count_files\" -ne 0 ]]; then \n    echo -e \"\\nDetected $count_files unorganized files (*.gz) in dataset folder, running organizeData rule ... \"\n    while true; do\n        read -p \"Do you wish to organize your samples using the organizeData Snakefile rule? (y/n)\" yn\n        case $yn in\n            [Yy]* ) echo \"Running the organizeData snakefile rule ... \" && snakemake organizeData -j1; break;;\n            [Nn]* ) echo \"Skipping toy dataset download ... \"; break;;\n            * ) echo \"Please answer yes or no.\";;\n        esac\n    done\nelif [[ \"$count_samp\" -ne 0 && \"$count_files\" -ne 0 ]]; then\n    echo -e \"\\nFiles appear to be organized into sample specific subdirectories within the dataset folder.\"\n    echo -e \"\\nPrinting sample IDs for user verification: \"\n    ls dataset|grep -v gz\n    echo \"\"\nfi\n\n# scratch dir\necho -e \"\\nPlease remember to set the scratch/ path in the config.yaml file\"\necho 'Ideally this path should be set to a job-specific variable that points to a location on your cluster for high I/O operations (e.g. $SCRATCH or $TMPDIR)'\necho \"However, it can also be a static directory and metaGEM will create job specific subdirectories automatically.\"\n\n}\n\n# Run stats task\nrun_stats() {\n\necho -e \"Checking status of current metaGEM analysis ... \\n\"\n\n#dataset: count subfolders to determine total number of samples\nnsamp=$(ls -d dataset/*/|wc -l)\necho \"Raw data: $nsamp samples were identified in the dataset folder ... \"\n\n#qfilter: count .json report files\nnqfilt=$(find qfiltered -name \"*.json\"|wc -l)\necho \"Quality filtering: $nqfilt / $nsamp samples processed ... \"\n    \n#assembly: count .gz fasta files\nnassm=$(find assemblies -name \"*.gz\"|wc -l)\necho \"Assembly: $nassm / $nsamp samples processed ... \"\n    \n#concoct: count *concoct-bins subfolders\nnconc=$(find concoct -name \"*.concoct-bins\"|wc -l)\necho \"Binning (CONCOCT): $nconc / $nsamp samples processed ... \"\n    \n#maxbin2: count *maxbin-bins subfolders\nnmaxb=$(find maxbin -name \"*.maxbin-bins\"|wc -l)\necho \"Binning (MaxBin2): $nmaxb / $nsamp samples processed ... \"\n    \n#metabat2: count *metabat-bins subfolders\nnmetab=$(find metabat -name \"*.metabat-bins\"|wc -l)\necho \"Binning (MetaBAT2): $nmetab / $nsamp samples processed ... \"\n\n#metawrap_refine: count subfolders\nnmwref=$(ls -d refined_bins/*|wc -l)\necho \"Bin refinement: $nmwref / $nsamp samples processed ... \"\n    \n#metawrap_reassemble: count subfolders, also determine total number of final MAGs across samples\nnmwrea=$(ls -d reassembled_bins/*|wc -l)\necho \"Bin reassembly: $nmwrea / $nsamp samples processed ... \"\n\n#taxonomy: count subfolders\nntax=$(ls -d GTDBTk/*|wc -l)\necho \"Taxonomy: $ntax / $nsamp samples processed ... \"\n    \n#abundances: count subfolders\nnabund=$(ls -d abundance/*|wc -l)\necho \"Abundance: $nabund / $nsamp samples processed ... \"\n    \n#models: count subfolders for sample progress and count .xml GEM files for total models generated\nngems=$(find GEMs -name \"*xml\"|wc -l)\nngemsamp=$(ls -d GEMs/*|wc -l)\necho \"GEMs: $ngems models generated from $ngemsamp samples ... \"\n    \n#model reports: count subfolders\nnmemo=$(find memote -name \"*.gz\"|wc -l)\necho \"GEM Reports: $nmemo / $ngems models samples ... \"\n    \n#simulations: count .tsv files\nnsmet=$(find memote -name \"*.gz\"|wc -l)\necho -e \"GEM Reports: $nsmet / $ngemsamp communities simulated ... \\n\"\n\n}\n\n# Prompt user to confirm input parameters/options\ncheckParams() {\n\n    echo \" \"\n    while true; do\n        read -p \"Do you wish to continue with these parameters? (y/n)\" yn\n        case $yn in\n            [Yy]* ) echo \"Proceeding with $task job(s) ... \" ; break;;\n            [Nn]* ) exit;;\n            * ) echo \"Please answer yes or no.\";;\n        esac\n    done\n\n}\n\n# Display config.yaml function for user inspection\nsnakeConfig() {\n\n    # Show config.yaml params\n    echo -e \"\\nPlease verify parameters set in the config.yaml file: \\n\"\n    paste ../config/config.yaml\n    echo -e \"\\nPlease pay close attention to make sure that your paths are properly configured!\"\n\n    while true; do\n        read -p \"Do you wish to proceed with this config.yaml file? (y/n)\" yn\n        case $yn in\n            [Yy]* ) echo \" \"; break;;\n            [Nn]* ) exit;;\n            * ) echo \"Please answer yes or no.\";;\n        esac\n    done\n\n}\n\n# Display cluster_config.json function for user inspection\nclusterConfig() {\n\n    # Show cluster_config.json params\n    echo -e \"Please verify parameters set in the cluster_config.json file: \\n\"\n    paste ../config/cluster_config.json\n    echo \" \"\n\n    while true; do\n        read -p \"Do you wish to proceed with this cluster_config.json file? (y/n)\" yn\n        case $yn in\n            [Yy]* ) echo \" \"; break;;\n            [Nn]* ) exit;;\n            * ) echo \"Please answer yes or no.\";;\n        esac\n    done\n    \n}\n\n# Prepare to submit cluster jobs function: display config files, unlock, and dry run\nsnakePrep() {\n\n    snakeConfig\n\n    clusterConfig\n\n    echo \"Unlocking snakemake ... \"\n    snakemake --unlock -j 1\n\n    echo -e \"\\nDry-running snakemake jobs ... \"\n    snakemake all -j $njobs -n -k --cluster-config ../config/cluster_config.json -c \"sbatch -A {cluster.account} -t {cluster.time} -n {cluster.n} --ntasks {cluster.tasks} --cpus-per-task {cluster.n} --output {cluster.output}\"\n}\n\n# Submit login node function, note that is only works for rules with no wildcard expansion\nsubmitLogin() {\n\n    echo \"No need to parse Snakefile for target rule: $task ... \"\n\n    checkParams\n\n    snakeConfig\n\n    echo \"Unlocking snakemake ... \"\n    snakemake --unlock -j 1\n    echo \" \"\n\n    while true; do\n        read -p \"Do you wish to submit this $task job? (y/n)\" yn\n        case $yn in\n            [Yy]* ) snakemake $task -j 1; break;;\n            [Nn]* ) exit;;\n            * ) echo \"Please answer yes or no.\";;\n        esac\n    done\n\n}\n\n# Submit local function, similar to submitLogin() but can handle wildcard expanded rules for non-cluster usage\nsubmitLocal() {\n\n    # Parse Snakefile rule all (line 22 of Snakefile) input to match output of desired target rule stored in \"$string\". Note: Hardcoded line number.\n    echo \"Parsing Snakefile to target rule: $task ... \"\n    sed  -i \"22s~^.*$~        $string~\" Snakefile\n\n    checkParams\n\n    snakeConfig\n\n    echo \"Unlocking snakemake ... \"\n    snakemake --unlock -j 1\n\n    echo -e \"\\nDry-running snakemake jobs ... \"\n    snakemake all -n\n\n    while true; do\n        read -p \"Do you wish to submit this batch of jobs on your local machine? (y/n)\" yn\n        case $yn in\n            [Yy]* ) echo \"snakemake all -j $njobs -k\"|bash; break;;\n            [Nn]* ) exit;;\n            * ) echo \"Please answer yes or no.\";;\n        esac\n    done\n\n}\n\n# Submit cluster function\nsubmitCluster() {\n\n    # Parse Snakefile rule all (line 22 of Snakefile) input to match output of desired target rule stored in \"$string\". Note: Hardcoded line number.\n    echo \"Parsing Snakefile to target rule: $task ... \"\n    sed  -i \"22s~^.*$~        $string~\" Snakefile\n\n    # Check if the number of jobs flag is specified by user for cluster job\n    if [[ -z \"$njobs\" ]]; then\n        \n        # No number of jobs provided.\n        echo \"WARNING: User is requesting to submit cluster job without specifying the number of jobs parameter (-j) ... \"\n    else   \n        # Number of jobs provided.\n        echo \"Number of jobs to be sumitted to cluster: $njobs ... \"\n    fi   \n\n    # Check if the number of cores flag is specified by user for cluster job\n    if [[ -z \"$ncores\" ]]; then\n        \n        # No number of cores provided.\n        echo \"WARNING: User is requesting to submit cluster job without specifying the number of cores parameter (-c) ... \"\n\n    else\n\n        # Parse cluster_config.json cores (line 5) to match number requested cores stored in \"$ncores\". Note: Hardcoded line number.\n        echo \"Parsing cluster_config.json to match requested number of cores: $ncores ... \"\n        sed -i \"5s/:.*$/: $ncores,/\" ../config/cluster_config.json\n\n    fi \n\n    # Check if the hours flag is specified by user for cluster job\n    if [[ -z \"$hours\" ]]; then\n        \n        # No number of jobs provided.\n        echo \"WARNING: User is requesting to submit cluster job without specifying the number of hours parameter (-h) ... \"\n\n    else\n\n        # Parse cluster_config.json time (line 4) to match number requested hours stored in \"$hours\". Note: Hardcoded line number.\n        echo \"Parsing cluster_config.json to match requested time (hours): $hours ... \"\n        sed -i \"4s/:.*$/: \\\"0-$hours:00:00\\\",/\" ../config/cluster_config.json\n\n    fi \n\n    # Check if memory input argument was provided by user. If so, parse cluster_config.json memory (line 7) to match requested memory stored in \"$mem\". Note: Hardcoded line number.\n    if [[ -z \"$mem\" ]]; then\n\n        # No memory flag provided.\n        echo \"WARNING: User is requesting to submit cluster job without specifying the memory flag (-m) ... \"\n\n        checkParams\n\n        snakePrep\n\n        while true; do\n            read -p \"Do you wish to submit this batch of $task jobs? (y/n)\" yn\n            case $yn in\n                [Yy]* ) echo \"nohup snakemake all -j $njobs -k --cluster-config ../config/cluster_config.json -c 'sbatch -A {cluster.account} -t {cluster.time} -n {cluster.n} --ntasks {cluster.tasks} --cpus-per-task {cluster.n} --output {cluster.output}' &\"|bash; break;;\n                [Nn]* ) exit;;\n                * ) echo \"Please answer yes or no.\";;\n            esac\n        done\n\n    else\n\n        # Memory flag was provided, parse cluster_config.json memory (line 7) to match number requested memory stored in \"$mem\". Note: Hardcoded line number.\n        echo \"Parsing cluster_config.json to match requested memory: $mem ... \"\n        sed -i \"7s/:.*$/: $(echo $mem)G,/\" ../config/cluster_config.json\n\n        checkParams\n\n        snakePrep\n\n        while true; do\n            read -p \"Do you wish to submit this batch of jobs? (y/n)\" yn\n            case $yn in\n                [Yy]* ) echo \"nohup snakemake all -j $njobs -k --cluster-config ../config/cluster_config.json -c 'sbatch -A {cluster.account} -t {cluster.time} --mem {cluster.mem} -n {cluster.n} --ntasks {cluster.tasks} --cpus-per-task {cluster.n} --output {cluster.output}' &\"|bash; break;;\n                [Nn]* ) exit;;\n                * ) echo \"Please answer yes or no.\";;\n            esac\n        done\n    fi\n\n}\n\n# Parse function\nparse() {\n\n  printLogo\n\n  # Set root folder\n  echo -e \"Setting current directory to root in config.yaml file ... \\n\"\n  root=$(pwd)\n  sed  -i \"2s~/.*$~$root~\" config.yaml # hardcoded line for root, change the number 2 if any new lines are added to the start of config.yaml\n\n  # No need to parse snakefile for login node jobs, submit the following locally\n  if [ $task == \"createFolders\" ] || [ $task == \"downloadToy\" ] || [ $task == \"organizeData\" ] || [ $task == \"qfilterVis\" ] || [ $task == \"assemblyVis\" ] || [ $task == \"binningVis\" ] || [ $task == \"compositionVis\" ] || [ $task == \"abundanceVis\" ] || [ $task == \"extractProteinBins\" ] || [ $task == \"extractDnaBins\" ] || [ $task == \"organizeGEMs\" ] || [ $task == \"modelVis\" ] || [ $task == \"interactionVis\" ] || [ $task == \"growthVis\" ] || [ $task == \"binning\" ] || [ $task == \"binEvaluation\" ] || [ $task == \"prepareRoary\" ]; then\n    submitLogin\n\n  elif [ $task == \"check\" ]; then\n    run_check\n\n  elif [ $task == \"stats\" ]; then\n    run_stats\n\n # Parse snakefile for cluster/local jobs\n  elif [ $task == \"fastp\" ]; then\n    string='expand(config[\"path\"][\"root\"]+\"/\"+config[\"folder\"][\"qfiltered\"]+\"/{IDs}/{IDs}_R1.fastq.gz\", IDs = IDs)'\n    if [ $local == \"true\" ]; then\n        submitLocal\n    else\n        submitCluster\n    fi\n\n  elif [ $task == \"megahit\" ]; then\n    string='expand(config[\"path\"][\"root\"]+\"/\"+config[\"folder\"][\"assemblies\"]+\"/{IDs}/contigs.fasta.gz\", IDs = IDs)'\n    if [ $local == \"true\" ]; then\n        submitLocal\n    else\n        submitCluster\n    fi\n\n  elif [ $task == \"crossMapSeries\" ]; then\n    string='expand(config[\"path\"][\"root\"]+\"/\"+config[\"folder\"][\"concoct\"]+\"/{IDs}/cov\", IDs = IDs)'\n    if [ $local == \"true\" ]; then\n        submitLocal\n    else\n        submitCluster\n    fi\n\n  elif [ $task == \"kallistoIndex\" ]; then\n    string='expand(config[\"path\"][\"root\"]+\"/\"+config[\"folder\"][\"kallistoIndex\"]+\"/{focal}/index.kaix\", focal = focal)'\n    if [ $local == \"true\" ]; then\n        submitLocal\n    else\n        submitCluster\n    fi\n\n  elif [ $task == \"kallistoIndex\" ]; then\n    string='expand(config[\"path\"][\"root\"]+\"/\"+config[\"folder\"][\"kallistoIndex\"]+\"/{focal}/index.kaix\", focal = focal)'\n    if [ $local == \"true\" ]; then\n        submitLocal\n    else\n        submitCluster\n    fi\n\n  elif [ $task == \"crossMapParallel\" ]; then\n    string='expand(config[\"path\"][\"root\"]+\"/\"+config[\"folder\"][\"kallisto\"]+\"/{focal}/{IDs}\", focal = focal , IDs = IDs)'\n    if [ $local == \"true\" ]; then\n        submitLocal\n    else\n        submitCluster\n    fi\n\n  elif [ $task == \"run_prodigal\" ]; then\n    string='expand(config[\"path\"][\"root\"]+\"/\"+config[\"folder\"][\"prodigal\"]+\"/{IDs}/{IDs}_genes.gff\", IDs = IDs)'\n    if [ $local == \"true\" ]; then\n        submitLocal\n    else\n        submitCluster\n    fi\n\n\n  elif [ $task == \"run_blastp\" ]; then\n    string='expand(config[\"path\"][\"root\"]+\"/\"+config[\"folder\"][\"blastp\"]+\"/{IDs}.xml\", IDs = IDs)'\n    if [ $local == \"true\" ]; then\n        submitLocal\n    else\n        submitCluster\n    fi\n\n  elif [ $task == \"concoct\" ]; then\n    string='expand(config[\"path\"][\"root\"]+\"/\"+config[\"folder\"][\"concoct\"]+\"/{IDs}/{IDs}.concoct-bins\", IDs = IDs)'\n    if [ $local == \"true\" ]; then\n        submitLocal\n    else\n        submitCluster\n    fi\n\n  elif [ $task == \"metabat\" ]; then\n    string='expand(config[\"path\"][\"root\"]+\"/\"+config[\"folder\"][\"metabat\"]+\"/{IDs}/{IDs}.metabat-bins\", IDs = IDs)'\n    if [ $local == \"true\" ]; then\n        submitLocal\n    else\n        submitCluster\n    fi\n\n  elif [ $task == \"maxbin\" ]; then\n    string='expand(config[\"path\"][\"root\"]+\"/\"+config[\"folder\"][\"maxbin\"]+\"/{IDs}/{IDs}.maxbin-bins\", IDs = IDs)'\n    if [ $local == \"true\" ]; then\n        submitLocal\n    else\n        submitCluster\n    fi\n\n  elif [ $task == \"binning\" ]; then\n    string='expand(config[\"path\"][\"root\"]+\"/\"+config[\"folder\"][\"benchmarks\"]+\"/{IDs}/{IDs}.binning.benchmark.txt\", IDs = IDs)'\n    if [ $local == \"true\" ]; then\n        submitLocal\n    else\n        submitCluster\n    fi\n\n  elif [ $task == \"binRefine\" ]; then\n    string='expand(config[\"path\"][\"root\"]+\"/\"+config[\"folder\"][\"refined\"]+\"/{IDs}\", IDs = IDs)'\n    if [ $local == \"true\" ]; then\n        submitLocal\n    else\n        submitCluster\n    fi\n\n  elif [ $task == \"binReassemble\" ]; then\n    string='expand(config[\"path\"][\"root\"]+\"/\"+config[\"folder\"][\"reassembled\"]+\"/{IDs}\", IDs = IDs)'\n    if [ $local == \"true\" ]; then\n        submitLocal\n    else\n        submitCluster\n    fi\n\n  elif [ $task == \"gtdbtk\" ]; then\n    string='expand(config[\"path\"][\"root\"]+\"/\"+config[\"folder\"][\"classification\"]+\"/{IDs}\", IDs = IDs)'\n    if [ $local == \"true\" ]; then\n        submitLocal\n    else\n        submitCluster\n    fi\n\n  elif [ $task == \"abundance\" ]; then\n    string='expand(config[\"path\"][\"root\"]+\"/\"+config[\"folder\"][\"abundance\"]+\"/{IDs}\", IDs = IDs)'\n    if [ $local == \"true\" ]; then\n        submitLocal\n    else\n        submitCluster\n    fi\n\n  elif [ $task == \"carveme\" ]; then\n    string='expand(config[\"path\"][\"root\"]+\"/\"+config[\"folder\"][\"GEMs\"]+\"/{binIDs}.xml\", binIDs = binIDs)'\n    if [ $local == \"true\" ]; then\n        submitLocal\n    else\n        submitCluster\n    fi\n\n  elif [ $task == \"smetana\" ]; then\n    string='expand(config[\"path\"][\"root\"]+\"/\"+config[\"folder\"][\"SMETANA\"]+\"/{IDs}_detailed.tsv\", IDs = IDs)'\n    if [ $local == \"true\" ]; then\n        submitLocal\n    else\n        submitCluster\n    fi\n\n  elif [ $task == \"memote\" ]; then\n    string='expand(config[\"path\"][\"root\"]+\"/\"+config[\"folder\"][\"memote\"]+\"/{gemIDs}\", gemIDs = gemIDs)'\n    if [ $local == \"true\" ]; then\n        submitLocal\n    else\n        submitCluster\n    fi\n\n  elif [ $task == \"grid\" ]; then\n    string='expand(config[\"path\"][\"root\"]+\"/\"+config[\"folder\"][\"GRiD\"]+\"/{IDs}\", IDs = IDs)'\n    if [ $local == \"true\" ]; then\n        submitLocal\n    else\n        submitCluster\n    fi\n\n  elif [ $task == \"prokka\" ]; then\n    string='expand(config[\"path\"][\"root\"]+\"/\"+config[\"folder\"][\"pangenome\"]+\"/prokka/unorganized/{binIDs}\", binIDs = binIDs)'\n    if [ $local == \"true\" ]; then\n        submitLocal\n    else\n        submitCluster\n    fi\n\n  elif [ $task == \"roary\" ]; then\n    string='expand(config[\"path\"][\"root\"]+\"/\"+config[\"folder\"][\"pangenome\"]+\"/roary/{speciesIDs}/\", speciesIDs = speciesIDs)'\n    if [ $local == \"true\" ]; then\n        submitLocal\n    else\n        submitCluster\n    fi\n\n  else\n    echo \"Task not recognized.\"\n    usage\n  fi\n\n}\n\n# Read input arguments\nif [ $# -eq 0 ]; then\n    echo \"No arguments provided ... \"\n    usage\nelse\n    local=false;\n    # Read in options\n    while [[ $1 = -?* ]]; do\n      case $1 in\n        -t|--task) shift; task=${1} ;;\n        -j|--nJobs) shift; njobs=${1} ;;\n        -c|--nCores) shift; ncores=${1} ;;\n        -m|--mem) shift; mem=${1} ;;\n        -h|--hours) shift; hours=${1} ;;\n        -l|--local) shift; local=true;;\n        --endopts) shift; break ;;\n        * ) echo \"Unknown option(s) provided, please read helpfile ... \" && usage && exit 1;;\n      esac\n      shift\n    done\n    parse\n\nfi\n"
  },
  {
    "path": "workflow/rules/Snakefile_experimental.smk.py",
    "content": "rule metaspades: \n    input:\n        R1 = rules.qfilter.output.R1, \n        R2 = rules.qfilter.output.R2\n    output:\n        config[\"path\"][\"root\"]+\"/\"+config[\"folder\"][\"assemblies\"]+\"/{IDs}/contigs.fasta.gz\"\n    benchmark:\n        config[\"path\"][\"root\"]+\"/\"+\"benchmarks/{IDs}.metaspades.benchmark.txt\"\n    shell:\n        \"\"\"\n        set +u;source activate {config[envs][metabagpipes]};set -u;\n        cp {input.R1} {input.R2} $TMPDIR    \n        cd $TMPDIR\n        metaspades.py --only-assembler -1 $(basename {input.R1}) -2 $(basename {input.R2}) -t {config[cores][metaspades]} -o .\n        gzip contigs.fasta\n        mkdir -p $(dirname {output})\n        rm $(basename {input.R1}) $(basename {input.R2})\n        mv -v contigs.fasta.gz spades.log $(dirname {output})\n        \"\"\"\n\nrule megahitCoassembly:\n    input:\n        R1 = f'/scratch/zorrilla/soil/coassembly/data/{{borkSoil}}/R1', \n        R2 = f'/scratch/zorrilla/soil/coassembly/data/{{borkSoil}}/R2'\n    output:\n        f'{config[\"path\"][\"root\"]}/coassembly/coassemblies/{{borkSoil}}/contigs.fasta.gz'\n    benchmark:\n        f'{config[\"path\"][\"root\"]}/benchmarks/coassembly.{{borkSoil}}.benchmark.txt'\n    shell:\n        \"\"\"\n        set +u;source activate {config[envs][metabagpipes]};set -u;\n        cd $SCRATCHDIR\n\n        echo -n \"Copying qfiltered reads to $SCRATCHDIR ... \"\n        cp -r {input.R1} {input.R2} $SCRATCHDIR\n        echo \"done. \"\n\n        R1=$(ls R1/|tr '\\n' ','|sed 's/,$//g')\n        R2=$(ls R2/|tr '\\n' ','|sed 's/,$//g')\n\n        mv R1/* .\n        mv R2/* .\n\n        echo -n \"Running megahit ... \"\n        megahit -t {config[cores][megahit]} \\\n            --presets {config[params][assemblyPreset]} \\\n            --min-contig-len {config[params][assemblyMin]}\\\n            --verbose \\\n            -1 $R1 \\\n            -2 $R2 \\\n            -o tmp;\n        echo \"done. \"\n\n        echo \"Renaming assembly ... \"\n        mv tmp/final.contigs.fa contigs.fasta\n        \n        echo \"Fixing contig header names: replacing spaces with hyphens ... \"\n        sed -i 's/ /-/g' contigs.fasta\n\n        echo \"Zipping and moving assembly ... \"\n        gzip contigs.fasta\n        mkdir -p $(dirname {output})\n        mv contigs.fasta.gz $(dirname {output})\n        echo \"Done. \"\n        \"\"\"\n\nrule metabatMultiSample:\n    input:\n        assembly=config[\"path\"][\"root\"]+\"/\"+config[\"folder\"][\"assemblies\"]+\"/{IDs}/contigs.fasta.gz\",\n        reads=config[\"path\"][\"root\"]+\"/\"+config[\"folder\"][\"qfiltered\"]\n    output:\n        directory(config[\"path\"][\"root\"]+\"/\"+config[\"folder\"][\"metabat\"]+\"/{IDs}/{IDs}.metabat-bins\")\n    benchmark:\n        config[\"path\"][\"root\"]+\"/\"+\"benchmarks/{IDs}.metabat.benchmark.txt\"\n    shell:\n        \"\"\"\n        set +u;source activate {config[envs][metabagpipes]};set -u;\n        mkdir -p $(dirname $(dirname {output}))\n        mkdir -p $(dirname {output})\n        cp {input.assembly} $TMPDIR\n        cd $TMPDIR\n        mv $(basename {input.assembly}) $(basename $(dirname {input.assembly})).gz\n        gunzip $(basename $(dirname {input.assembly})).gz\n        bwa index $(basename $(dirname {input.assembly}))\n\n        for sample in {input.reads}/*;do\n            echo \"Mapping sample $sample ... \"\n            ID=$(basename $sample);\n            bwa mem -t {config[cores][metabat]} $(basename $(dirname {input.assembly})) $sample/*_1.fastq.gz $sample/*_2.fastq.gz > $ID.sam\n            samtools view -@ {config[cores][metabat]} -Sb $ID.sam > $ID.bam\n            samtools sort -@ {config[cores][metabat]} $ID.bam $ID.sort\n            rm $ID.bam $ID.sam\n            echo \"Done mapping sample $sample !\"\n            echo \"Creating depth file for sample $sample ... \"\n            jgi_summarize_bam_contig_depths --outputDepth depth.txt $ID.sort.bam\n            echo \"Done creating depth file for sample $sample !\"\n            rm $ID.sort.bam\n            paste $sample.depth.txt\n        done\n\n        runMetaBat.sh $(basename $(dirname {input.assembly})) *.sort.bam\n        mv *.txt *.tab $(basename {output}) $(dirname {output})\n        \"\"\"\n\n\nrule crossMap2:  \n    input:\n        contigs = f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"assemblies\"]}/{{focal}}/contigs.fasta.gz',\n        R1 = rules.qfilter.output.R1,\n        R2 = rules.qfilter.output.R2\n    output:\n        directory(f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"crossMap\"]}/{{focal}}/{{IDs}}')\n    benchmark:\n        f'{config[\"path\"][\"root\"]}/benchmarks/{{focal}}.{{IDs}}.crossMap.benchmark.txt'\n    message:\n        \"\"\"\n        This rule is an alternative implementation of the rule crossMap.\n        Instead of taking each focal sample as a job and cross mapping in series using a for loop,\n        here the cross mapping is done completely in parallel. \n        This implementation is not recommended, as it wastefully recreates a bwa index for each mapping\n        operation. Use crossMap for smaller datasets or crossMap3 for larger datasets.\n        \"\"\"\n    shell:\n        \"\"\"\n        set +u;source activate {config[envs][metabagpipes]};set -u;\n        cd $SCRATCHDIR\n\n        echo -e \"\\nCopying assembly {input.contigs} and reads {input.R1} {input.R2} to $SCRATCHDIR\"\n        cp {input.contigs} {input.R1} {input.R2} .\n\n        mkdir -p {output}\n\n        # Define the focal sample ID, fsample: \n        # The one sample's assembly that reads will be mapped against\n        fsampleID=$(echo $(basename $(dirname {input.contigs})))\n        echo -e \"\\nFocal sample: $fsampleID ... \"\n\n        echo \"Renaming and unzipping assembly ... \"\n        mv $(basename {input.contigs}) $(echo $fsampleID|sed 's/$/.fa.gz/g')\n        gunzip $(echo $fsampleID|sed 's/$/.fa.gz/g')\n\n        echo -e \"\\nIndexing assembly ... \"\n        bwa index $fsampleID.fa\n\n        id=$(basename {output})\n        echo -e \"\\nMapping reads from sample $id against assembly of focal sample $fsampleID ...\"\n        bwa mem -t {config[cores][crossMap]} $fsampleID.fa *.fastq.gz > $id.sam\n\n        echo -e \"\\nDeleting no-longer-needed fastq files ... \"\n        rm *.gz\n                \n        echo -e \"\\nConverting SAM to BAM with samtools view ... \" \n        samtools view -@ {config[cores][crossMap]} -Sb $id.sam > $id.bam\n\n        echo -e \"\\nDeleting no-longer-needed sam file ... \"\n        rm $id.sam\n\n        echo -e \"\\nSorting BAM file with samtools sort ... \" \n        samtools sort -@ {config[cores][crossMap]} -o $id.sort $id.bam\n\n        echo -e \"\\nDeleting no-longer-needed bam file ... \"\n        rm $id.bam\n\n        echo -e \"\\nIndexing sorted BAM file with samtools index for CONCOCT input table generation ... \" \n        samtools index $id.sort\n\n        echo -e \"\\nCutting up assembly contigs >= 20kbp into 10kbp chunks and creating bedfile ... \"\n        cut_up_fasta.py $fsampleID.fa -c 10000 -o 0 --merge_last -b contigs_10K.bed > contigs_10K.fa\n\n        echo -e \"\\nGenerating CONCOCT individual/intermediate coverage table ... \"\n        concoct_coverage_table.py contigs_10K.bed *.sort > ${{fsampleID}}_${{id}}_individual.tsv\n\n        echo -e \"\\nCompressing CONCOCT coverage table ... \"\n        gzip ${{fsampleID}}_${{id}}_individual.tsv\n\n        echo -e \"\\nRunning jgi_summarize_bam_contig_depths script to generate contig abundance/depth file for maxbin2 input ... \"\n        jgi_summarize_bam_contig_depths --outputDepth ${{fsampleID}}_${{id}}_individual.depth $id.sort\n\n        echo -e \"\\nCompressing maxbin2/metabat2 depth file ... \"\n        gzip ${{fsampleID}}_${{id}}_individual.depth\n\n        echo -e \"\\nMoving relevant files to {output}\"\n        mv *.gz {output}\n\n        \"\"\"\n\nrule gatherCrossMap2: \n    input:\n        expand(f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"crossMap\"]}/{{focal}}/{{IDs}}', focal = focal , IDs = IDs)\n    shell:\n        \"\"\"\n        echo idk\n        \"\"\"\n\nrule maxbinMultiSample:\n    input:\n        assembly=config[\"path\"][\"root\"]+\"/\"+config[\"folder\"][\"assemblies\"]+\"/{IDs}/contigs.fasta.gz\",\n        reads=config[\"path\"][\"root\"]+\"/\"+config[\"folder\"][\"qfiltered\"]\n    output:\n        directory(config[\"path\"][\"root\"]+\"/\"+config[\"folder\"][\"maxbin\"]+\"/{IDs}/{IDs}.maxbin-bins\")\n    benchmark:\n        config[\"path\"][\"root\"]+\"/\"+\"benchmarks/{IDs}.maxbin.benchmark.txt\"\n    shell:\n        \"\"\"\n        set +u;source activate {config[envs][metabagpipes]};set -u;\n\n        mkdir -p $(dirname $(dirname {output}))\n        mkdir -p $(dirname {output})\n        cp {input.assembly} $TMPDIR\n        cd $TMPDIR\n\n        focal=$(basename $(dirname {input.assembly}))\n        gunzip contigs.fasta.gz\n\n        echo \"Creating kallisto index for focal sample $focal ... \"\n        #kallisto index contigs.fasta -i $focal.kaix\n        cp /home/zorrilla/workspace/straya/test/*.kaix .\n        echo \"Done creating kallisto index!\"\n\n        echo \"Begin cross mapping samples ... \"\n        for folder in {input.reads}/*;do\n            var=$(basename $folder)\n            echo \"Mapping sample $var to focal sample $focal using kallisto quant ... \"\n            kallisto quant --threads {config[cores][kallisto]} --plaintext -i $focal.kaix -o . $folder/*_1.fastq.gz $folder/*_2.fastq.gz;\n            #tail -n +2 abundance.tsv > $(basename $folder)_abundance.tsv\n            #rm abundance.tsv\n            echo \"Done mapping sample $var to focal sample!\"\n        done\n        echo \"Done cross mapping all samples! \"\n\n        find . -name \"*_abundance.tsv\" > abund_list.txt\n\n        echo \"Begin running maxbin2 algorithm ... \"\n        run_MaxBin.pl -contig contigs.fasta -out $focal -abund_list abund_list.txt -thread {config[cores][maxbin]}\n        echo \"Done running maxbin2!\"\n\n        rm contigs.fasta\n        mkdir $(basename {output})\n        mv *.fasta $(basename {output})\n        mv $(basename {output}) $(dirname {output})\n        \"\"\" \n\nrule mOTUs2classifyGenomes:\n    input:\n        bins = f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"reassembled\"]}/{{IDs}}/reassembled_bins',\n        script = f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"scripts\"]}/classify-genomes'\n    output:\n        #directory(f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"classification\"]}/{{IDs}}')\n    benchmark:\n        f'{config[\"path\"][\"root\"]}/benchmarks/{{IDs}}.classify-genomes.benchmark.txt'\n    shell:\n        \"\"\"\n        set +u;source activate {config[envs][metabagpipes]};set -u;\n        mkdir -p {output}\n        cd $SCRATCHDIR\n        cp -r {input.script}/* {input.bins}/* .\n\n        echo \"Begin classifying bins ... \"\n        for bin in *.fa; do\n            echo -e \"\\nClassifying $bin ... \"\n            $PWD/classify-genomes $bin -t {config[cores][classify]} -o $(echo $bin|sed 's/.fa/.taxonomy/')\n            cp *.taxonomy {output}\n            rm *.taxonomy\n            rm $bin \n        done\n        echo \"Done classifying bins. \"\n        \"\"\"\n\nrule taxonomyVis:\n    input: \n        f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"classification\"]}'\n    output: \n        text = f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"stats\"]}/classification.stats',\n        plot = f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"stats\"]}/taxonomyVis.pdf'\n    message:\n        \"\"\"\n        mOTUs2 taxonomy visualization.\n        Generate bar plot with most common taxa (n>15) and density plots with mapping statistics.\n        \"\"\"\n    shell:\n        \"\"\"\n        set +u;source activate {config[envs][metabagpipes]};set -u;\n        cd {input}\n\n        echo -e \"\\nBegin reading classification result files ... \\n\"\n        for folder in */;do \n\n            for file in $folder*.taxonomy;do\n\n                # Define sample ID to append to start of each bin name in summary file\n                sample=$(echo $folder|sed 's|/||')\n\n                # Define bin name with sample ID, shorten metaWRAP naming scheme (orig/permissive/strict)\n                fasta=$(echo $file | sed 's|^.*/||' | sed 's/.taxonomy//g' | sed 's/orig/o/g' | sed 's/permissive/p/g' | sed 's/strict/s/g' | sed \"s/^/$sample./g\");\n\n                # Extract NCBI ID \n                NCBI=$(less $file | grep NCBI | cut -d ' ' -f4);\n\n                # Extract consensus taxonomy\n                tax=$(less $file | grep tax | sed 's/Consensus taxonomy: //g');\n\n                # Extract consensus motus\n                motu=$(less $file | grep mOTUs | sed 's/Consensus mOTUs: //g');\n\n                # Extract number of detected genes\n                detect=$(less $file | grep detected | sed 's/Number of detected genes: //g');\n\n                # Extract percentage of agreeing genes\n                percent=$(less $file | grep agreeing | sed 's/Percentage of agreeing genes: //g' | sed 's/%//g');\n\n                # Extract number of mapped genes\n                map=$(less $file | grep mapped | sed 's/Number of mapped genes: //g');\n                \n                # Extract COG IDs, need to use set +e;...;set -e to avoid erroring out when reading .taxonomy result file for bin with no taxonomic annotation\n                set +e\n                cog=$(less $file | grep COG | cut -d$'\\t' -f1 | tr '\\n' ',' | sed 's/,$//g');\n                set -e\n                \n                # Display and store extracted results\n                echo -e \"$fasta \\t $NCBI \\t $tax \\t $motu \\t $detect \\t $map \\t $percent \\t $cog\"\n                echo -e \"$fasta \\t $NCBI \\t $tax \\t $motu \\t $detect \\t $map \\t $percent \\t $cog\" >> classification.stats;\n            \n            done;\n        \n        done\n\n        echo -e \"\\nDone generating classification.stats summary file, moving to stats/ directory and running taxonomyVis.R script ... \"\n        mv classification.stats {config[path][root]}/{config[folder][stats]}\n        cd {config[path][root]}/{config[folder][stats]}\n\n        Rscript {config[path][root]}/{config[folder][scripts]}/{config[scripts][taxonomyVis]}\n        rm Rplots.pdf # Delete redundant pdf file\n        echo \"Done. \"\n        \"\"\"\n\n\nrule parseTaxAb:\n    input:\n        taxonomy = rules.taxonomyVis.output.text ,\n        abundance = f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"abundance\"]}'\n    output:\n        directory(f'{config[\"path\"][\"root\"]}/MAG.table')\n    message:\n        \"\"\"\n        Parses an abundance table with MAG taxonomy for rows and samples for columns.\n        Note: parseTaxAb should only be run after the classifyGenomes, taxonomyVis, and abundance rules.\n        \"\"\"\n    shell:\n        \"\"\"\n        set +u;source activate {config[envs][metabagpipes]};set -u\n        cd {input.abundance}\n\n        for folder in */;do\n\n            # Define sample ID\n            sample=$(echo $folder|sed 's|/||g')\n            \n            # Same as in taxonomyVis rule, modify bin names by adding sample ID and shortening metaWRAP naming scheme (orig/permissive/strict)\n            paste $sample/$sample.abund | sed 's/orig/o/g' | sed 's/permissive/p/g' | sed 's/strict/s/g' | sed \"s/^/$sample./g\" >> abundance.stats\n       \n        done\n\n        mv abundance.stats {config[path][root]}/{config[folder][stats]}\n        cd {config[path][root]}/{config[folder][stats]}\n\n        \"\"\"\n\nrule prepareRoary:\n    input:\n        taxonomy = rules.GTDBtkVis.output.text,\n        binning = rules.binningVis.output.text,\n        script = f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"scripts\"]}/{config[\"scripts\"][\"prepRoary\"]}'\n    output:\n        directory(f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"pangenome\"]}/speciesBinIDs')\n    benchmark:\n        f'{config[\"path\"][\"root\"]}/benchmarks/prepareRoary.benchmark.txt'\n    message:\n        \"\"\"\n        This rule matches the results from classifyGenomes->taxonomyVis with the completeness & contamination\n        CheckM results from the metaWRAP reassembly->binningVis results, identifies speceies represented by \n        at least 10 high quality MAGs (completeness >= 90 & contamination <= 10), and outputs text files \n        with bin IDs for each such species. Also organizes the prokka output folders based on taxonomy.\n        Note: Do not run this before finishing all prokka jobs!\n        \"\"\"\n    shell:\n        \"\"\"\n        set +u;source activate {config[envs][metabagpipes]};set -u\n        cd $(dirname {input.taxonomy})\n\n        echo -e \"\\nCreating speciesBinIDs folder containing.txt files with binIDs for each species that is represented by at least 10 high quality MAGs (completeness >= 90 & contamination <= 10) ... \"\n        Rscript {input.script}\n\n        nSpecies=$(ls $(basename {output})|wc -l)\n        nSpeciesTot=$(cat $(basename {output})/*|wc -l)\n        nMAGsTot=$(paste {input.binning}|wc -l)\n        echo -e \"\\nIdentified $nSpecies species represented by at least 10 high quality MAGs, totaling $nSpeciesTot MAGs out of $nMAGsTot total MAGs generated ... \"\n\n        echo -e \"\\nMoving speciesBinIDs folder to pangenome directory: $(dirname {output})\"\n        mv $(basename {output}) $(dirname {output})\n\n        echo -e \"\\nOrganizing prokka folder according to taxonomy ... \"\n        echo -e \"\\nGFF files of identified species with at least 10 HQ MAGs will be copied to prokka/organzied/speciesSubfolder for roary input ... \"\n        cd $(dirname {output})\n        mkdir -p prokka/organized\n\n        for species in speciesBinIDs/*.txt;do\n\n            speciesID=$(echo $(basename $species)|sed 's/.txt//g');\n            echo -e \"\\nCreating folder and organizing prokka output for species $speciesID ... \"\n            mkdir -p prokka/organized/$speciesID\n\n            while read line;do\n                \n                binID=$(echo $line|sed 's/.bin/_bin/g')\n                echo \"Copying GFF prokka output of bin $binID\"\n                cp prokka/unorganized/$binID/*.gff prokka/organized/$speciesID/\n\n            done< $species\n        done\n\n        echo -e \"\\nDone\"\n        \"\"\" \n\n\nrule prepareRoaryMOTUS2:\n    input:\n        taxonomy = rules.taxonomyVis.output.text,\n        binning = rules.binningVis.output.text,\n        script = f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"scripts\"]}/{config[\"scripts\"][\"prepRoary\"]}'\n    output:\n        directory(f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"pangenome\"]}/speciesBinIDs')\n    benchmark:\n        f'{config[\"path\"][\"root\"]}/benchmarks/prepareRoary.benchmark.txt'\n    message:\n        \"\"\"\n        This rule matches the results from classifyGenomes->taxonomyVis with the completeness & contamination\n        CheckM results from the metaWRAP reassembly->binningVis results, identifies speceies represented by \n        at least 10 high quality MAGs (completeness >= 90 & contamination <= 10), and outputs text files \n        with bin IDs for each such species. Also organizes the prokka output folders based on taxonomy.\n        Note: Do not run this before finishing all prokka jobs!\n        \"\"\"\n    shell:\n        \"\"\"\n        set +u;source activate {config[envs][metabagpipes]};set -u\n        cd $(dirname {input.taxonomy})\n\n        echo -e \"\\nCreating speciesBinIDs folder containing.txt files with binIDs for each species that is represented by at least 10 high quality MAGs (completeness >= 90 & contamination <= 10) ... \"\n        Rscript {input.script}\n\n        nSpecies=$(ls $(basename {output})|wc -l)\n        nSpeciesTot=$(cat $(basename {output})/*|wc -l)\n        nMAGsTot=$(paste {input.binning}|wc -l)\n        echo -e \"\\nIdentified $nSpecies species represented by at least 10 high quality MAGs, totaling $nSpeciesTot MAGs out of $nMAGsTot total MAGs generated ... \"\n\n        echo -e \"\\nMoving speciesBinIDs folder to pangenome directory: $(dirname {output})\"\n        mv $(basename {output}) $(dirname {output})\n\n        echo -e \"\\nOrganizing prokka folder according to taxonomy ... \"\n        echo -e \"\\nGFF files of identified species with at least 10 HQ MAGs will be copied to prokka/organzied/speciesSubfolder for roary input ... \"\n        cd $(dirname {output})\n        mkdir -p prokka/organized\n\n        for species in speciesBinIDs/*.txt;do\n\n            speciesID=$(echo $(basename $species)|sed 's/.txt//g');\n            echo -e \"\\nCreating folder and organizing prokka output for species $speciesID ... \"\n            mkdir -p prokka/organized/$speciesID\n\n            while read line;do\n                \n                binID=$(echo $line|sed 's/.bin/_bin/g')\n                echo \"Copying GFF prokka output of bin $binID\"\n                cp prokka/unorganized/$binID/*.gff prokka/organized/$speciesID/\n\n            done< $species\n        done\n\n        echo -e \"\\nDone\"\n        \"\"\" \n\nrule roaryTop10:\n    input:\n        f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"pangenome\"]}/prokka/organized/'\n    output:\n        directory(f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"pangenome\"]}/roary/top10/')\n    benchmark:\n        f'{config[\"path\"][\"root\"]}/benchmarks/roaryTop10.roary.benchmark.txt'\n    message:\n        \"\"\"\n        Runs pangenome for ~692 MAGs belonging to 10 species:\n        Agathobacter rectale, Bacteroides uniformis, \n        Ruminococcus_E bromii_B, Gemmiger sp003476825, \n        Blautia_A wexlerae, Dialister invisus,\n        Anaerostipes hadrus, Fusicatenibacter saccharivorans,\n        Eubacterium_E hallii, and NA\n        \"\"\"\n    shell:\n        \"\"\"\n        set +u;source activate prokkaroary;set -u\n        mkdir -p $(dirname {output})\n        cd $SCRATCHDIR\n\n        cp -r {input}/Agathobacter_rectale/* . \n        cp -r {input}/Bacteroides_uniformis/* . \n        cp -r {input}/Ruminococcus_E_bromii_B/* . \n        cp -r {input}/Gemmiger_sp003476825/* . \n        cp -r {input}/Blautia_A_wexlerae/* .\n        cp -r {input}/Dialister_invisus/* . \n        cp -r {input}/Anaerostipes_hadrus/* . \n        cp -r {input}/Fusicatenibacter_saccharivorans/* . \n        cp -r {input}/Eubacterium_E_hallii/* . \n        cp -r {input}/NA/* .\n                \n        roary -s -p {config[cores][roary]} -i {config[params][roaryI]} -cd {config[params][roaryCD]} -f yes_al -e -v *.gff\n        cd yes_al\n        create_pan_genome_plots.R \n        cd ..\n        mkdir -p {output}\n\n        mv yes_al/* {output}\n        \"\"\"\n\nrule phylophlan:\n    input:\n        f'/home/zorrilla/workspace/european/dna_bins'\n    output:\n        directory(f'/scratch/zorrilla/phlan/out')\n    benchmark:\n        f'/scratch/zorrilla/phlan/logs/bench.txt'\n    shell:\n        \"\"\"\n        cd $SCRATCHDIR\n        cp -r {input} . \n        cp $(dirname {output})/*.cfg .\n        mkdir -p logs\n\n        phylophlan -i dna_bins \\\n                    -d phylophlan \\\n                    -f 02_tol.cfg \\\n                    --genome_extension fa \\\n                    --diversity low \\\n                    --fast \\\n                    -o out \\\n                    --nproc 128 \\\n                    --verbose 2>&1 | tee logs/phylophlan.logs\n\n        cp -r out $(dirname {output})\n        \"\"\"\n\nrule phylophlanPlant:\n    input:\n        f'/home/zorrilla/workspace/china_soil/dna_bins'\n    output:\n        directory(f'/home/zorrilla/workspace/china_soil/phlan/')\n    benchmark:\n        f'/scratch/zorrilla/phlan/logs/benchPlant.txt'\n    shell:\n        \"\"\"\n        cd $SCRATCHDIR\n        cp -r {input} . \n        cp /scratch/zorrilla/phlan/*.cfg .\n        mkdir -p logs\n\n        phylophlan -i dna_bins \\\n                    -d phylophlan \\\n                    -f 02_tol.cfg \\\n                    --genome_extension fa \\\n                    --diversity low \\\n                    --fast \\\n                    -o $(basename {output}) \\\n                    --nproc 128 \\\n                    --verbose 2>&1 | tee logs/phylophlan.logs\n\n        cp -r $(basename {output}) $(dirname {output})\n        \"\"\"\n        \nrule phylophlanMeta:\n    input:\n        f'/home/zorrilla/workspace/european/dna_bins'\n    output:\n        directory(f'/home/zorrilla/workspace/european/phlan/dist')\n    benchmark:\n        f'/scratch/zorrilla/phlan/logs/bench.txt'\n    shell:\n        \"\"\"\n        cd {input}\n        cd ../\n\n        phylophlan_metagenomic -i $(basename {input}) -o $(basename {output})_dist --nproc 2 --only_input\n    \n        mv $(basename {output})_dist $(basename {output})\n        mv -r $(basename {output}) $(dirname {output})\n        \"\"\"\n\n\nrule phylophlanMetaAll:\n    input:\n        lab=f'/home/zorrilla/workspace/korem/dna_bins',\n        gut=f'/home/zorrilla/workspace/european/dna_bins' ,\n        plant=f'/home/zorrilla/workspace/china_soil/dna_bins' ,\n        soil=f'/home/zorrilla/workspace/straya/dna_bins' ,\n        ocean=f'/scratch/zorrilla/dna_bins' \n    output:\n        directory(f'/home/zorrilla/workspace/european/phlan/all')\n    benchmark:\n        f'/scratch/zorrilla/phlan/logs/allMetaBench.txt'\n    shell:\n        \"\"\"\n        mkdir -p {output}\n        cd $SCRATCHDIR\n\n        mkdir allMAGs\n        cp {input.lab}/* allMAGs\n        cp {input.gut}/* allMAGs\n        cp {input.plant}/* allMAGs\n        cp {input.soil}/* allMAGs\n        cp {input.ocean}/* allMAGs\n\n        phylophlan_metagenomic -i allMAGs -o all --nproc 4 --only_input\n        mv all_distmat.tsv $(dirname {output})\n        \"\"\"\n\nrule drawTree:\n    input:\n        f'/home/zorrilla/workspace/china_soil/phlan'\n    shell:\n        \"\"\"\n        cd {input}\n        graphlan.py dna_bins.tre.iqtree tree.out\n        \"\"\"\n\nrule makePCA:\n    input:\n        f'/home/zorrilla/workspace/european/phlan'\n    shell:\n        \"\"\"\n        cd $SCRATCHDIR\n        echo -e \"\\nCopying files to scratch dir: $SCRATCHDIR\"\n        cp {input}/*.tsv {input}/*.ids {input}/*.R .\n\n        echo -e \"\\nRunning nmds.R script ... \"\n        Rscript nmds.R\n        rm *.tsv *.ids *.R\n\n        mkdir -p nmds\n        mv *.pdf nmds\n        mv nmds {input}\n        \"\"\"\n\nrule drep:\n    input:\n        f'{config[\"path\"][\"root\"]}/dna_bins'\n    output:\n        directory(f'{config[\"path\"][\"root\"]}/drep_drep')\n    benchmark:\n        f'{config[\"path\"][\"root\"]}/benchmarks/drep_drep.benchmark.txt'\n    shell:\n        \"\"\"\n        set +u;source activate drep;set -u\n        cp -r {input} $SCRATCHDIR\n        cd $SCRATCHDIR\n\n        dRep dereplicate drep_drep -g $(basename {input})/*.fa -p 48 -comp 50 -con 10\n        mv drep_drep $(dirname {input})\n        \"\"\"\n\nrule drepComp:\n    input:\n        f'{config[\"path\"][\"root\"]}/dna_bins'\n    output:\n        directory(f'{config[\"path\"][\"root\"]}/drep_comp')\n    benchmark:\n        f'{config[\"path\"][\"root\"]}/benchmarks/drep_comp.benchmark.txt'\n    shell:\n        \"\"\"\n        set +u;source activate drep;set -u\n        cp -r {input} $SCRATCHDIR\n        cd $SCRATCHDIR\n\n        dRep compare drep_comp -g $(basename {input})/*.fa -p 48\n        mv drep_comp $(dirname {input})\n        \"\"\"\n"
  },
  {
    "path": "workflow/rules/Snakefile_single_end.smk.py",
    "content": "configfile: \"config.yaml\"\n\nimport os\nimport glob\n\ndef get_ids_from_path_pattern(path_pattern):\n    ids = sorted([os.path.basename(os.path.splitext(val)[0])\n                  for val in (glob.glob(path_pattern))])\n    return ids\n\n# Make sure that final_bins/ folder contains all bins in single folder for binIDs\n# wildcard to work. Use extractProteinBins rule or perform manually.\nbinIDs = get_ids_from_path_pattern('final_bins/*.faa')\nIDs = get_ids_from_path_pattern('assemblies/*')\n\nDATA_READS = f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"data\"]}/{{IDs}}/{{IDs}}.fastq.gz'\n\n# Inserting space here to avoid having to change the hardcoded line 22 edit in the metabagpipes parser to expand wildcards\n\nrule all:\n    input:\n        expand(f'{config[\"path\"][\"root\"]}/GTDBtk/{{IDs}}', IDs=IDs)\n    message:\n        \"\"\"\n        WARNING: Be very careful when adding/removing any lines above this message.\n        The metaBAGpipes.sh parser is presently hardcoded to edit line 22 of this Snakefile to expand target rules accordingly,\n        therefore adding/removing any lines before this message will likely result in parser malfunction.\n        \"\"\"\n    shell:\n        \"\"\"\n        echo {input}\n        \"\"\"\n\n\nrule createFolders:\n    input:\n        config[\"path\"][\"root\"]\n    message:\n        \"\"\"\n        Very simple rule to check that the metaBAGpipes.sh parser, Snakefile, and config.yaml file are set up correctly. \n        Generates folders from config.yaml config file, not strictly necessary to run this rule.\n        \"\"\"\n    shell:\n        \"\"\"\n        cd {input}\n        echo -e \"Setting up result folders in the following work directory: $(echo {input}) \\n\"\n\n        # Generate folders.txt by extracting folder names from config.yaml file\n        paste config.yaml |cut -d':' -f2|tail -n +4|head -n 18|sed '/^$/d' > folders.txt # NOTE: hardcoded number (18) for folder names, increase number if new folders are introduced.\n        \n        while read line;do \n            echo \"Creating $line folder ... \"\n            mkdir -p $line;\n        done < folders.txt\n        \n        echo -e \"\\nDone creating folders. \\n\"\n\n        rm folders.txt\n        \"\"\"\n\n\nrule downloadToy:\n    input:\n        f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"scripts\"]}/{config[\"scripts\"][\"toy\"]}'\n    message:\n        \"\"\"\n        Downloads toy dataset into config.yaml data folder and organizes into sample-specific sub-folders.\n        Requires download_toydata.txt to be present in scripts folder.\n        Modify this rule to download a real dataset by replacing the links in the download_toydata.txt file with links to files from your dataset of intertest.\n        \"\"\"\n    shell:\n        \"\"\"\n        cd {config[path][root]}/{config[folder][data]}\n\n        # Download each link in download_toydata.txt\n        echo -e \"\\nBegin downloading toy dataset ... \\n\"\n        while read line;do \n            wget $line;\n        done < {input}\n        echo -e \"\\nDone donwloading dataset.\\n\"\n        \n        # Rename downloaded files, this is only necessary for toy dataset (will cause error if used for real dataset)\n        echo -ne \"\\nRenaming downloaded files ... \"\n        for file in *;do \n            mv $file ./$(echo $file|sed 's/?download=1//g');\n        done\n        echo -e \" done. \\n\"\n\n        # Organize data into sample specific sub-folders\n\n        echo -ne \"\\nGenerating list of unique sample IDs ... \"\n        for file in *.gz; do \n            echo $file; \n        done | sed 's/_.*$//g' | sed 's/.fastq.gz//g' | uniq > ID_samples.txt\n        echo -e \" done.\\n $(less ID_samples.txt|wc -l) samples identified.\\n\"\n\n        echo -ne \"\\nOrganizing downloaded files into sample specific sub-folders ... \"\n        while read line; do \n            mkdir -p $line; \n            mv $line*.gz $line; \n        done < ID_samples.txt\n        echo -e \" done. \\n\"\n        \n        rm ID_samples.txt\n        \"\"\"\n\n\nrule organizeData:\n    input:\n        f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"data\"]}'\n    message:\n        \"\"\"\n        Sorts paired end raw reads into sample specific sub folders within the dataset folder specified in the config.yaml file.\n        Assumes all samples are present in abovementioned dataset folder.\n        \n        Note: This rule is meant to be run on real datasets. \n        Do not run for toy dataset, as downloadToy rule above sorts the downloaded data already.\n        \"\"\"\n    shell:\n        \"\"\"\n        cd {input}\n    \n        echo -ne \"\\nGenerating list of unique sample IDs ... \"\n\n        # Create list of unique sample IDs\n        for file in *.gz; do \n            echo $file; \n        done | sed 's/_.*$//g' | sed 's/.fastq.gz//g' | uniq > ID_samples.txt\n\n        echo -e \" done.\\n $(less ID_samples.txt|wc -l) samples identified.\\n\"\n\n        # Create folder and move corresponding files for each sample\n\n        echo -ne \"\\nOrganizing dataset into sample specific sub-folders ... \"\n        while read line; do \n            mkdir -p $line; \n            mv $line*.gz $line; \n        done < ID_samples.txt\n        echo -e \" done. \\n\"\n        \n        rm ID_samples.txt\n        \"\"\"\n\n\nrule qfilter: \n    input:\n        READS = DATA_READS\n    output:\n        f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"qfiltered\"]}/{{IDs}}/{{IDs}}.fastq.gz', \n    shell:\n        \"\"\"\n        set +u;source activate {config[envs][metabagpipes]};set -u;\n\n        mkdir -p $(dirname $(dirname {output}))\n        mkdir -p $(dirname {output})\n\n        fastp --thread {config[cores][fastp]} \\\n            -i {input} \\\n            -o {output} \\\n            -j $(dirname {output})/$(echo $(basename $(dirname {output}))).json \\\n            -h $(dirname {output})/$(echo $(basename $(dirname {output}))).html\n\n        \"\"\"\n\n\nrule qfilterVis:\n    input: \n        f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"qfiltered\"]}'\n    output: \n        text = f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"stats\"]}/qfilter.stats',\n        plot = f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"stats\"]}/qfilterVis.pdf'\n    shell:\n        \"\"\"\n        set +u;source activate {config[envs][metabagpipes]};set -u;\n        mkdir -p $(dirname {output.text})\n        cd {input}\n\n        echo -e \"\\nGenerating quality filtering results file qfilter.stats: ... \"\n        for folder in */;do\n            for file in $folder*json;do\n                ID=$(echo $file|sed 's|/.*$||g')\n                readsBF=$(head -n 25 $file|grep total_reads|cut -d ':' -f2|sed 's/,//g'|head -n 1)\n                readsAF=$(head -n 25 $file|grep total_reads|cut -d ':' -f2|sed 's/,//g'|tail -n 1)\n                basesBF=$(head -n 25 $file|grep total_bases|cut -d ':' -f2|sed 's/,//g'|head -n 1)\n                basesAF=$(head -n 25 $file|grep total_bases|cut -d ':' -f2|sed 's/,//g'|tail -n 1)\n                q20BF=$(head -n 25 $file|grep q20_rate|cut -d ':' -f2|sed 's/,//g'|head -n 1)\n                q20AF=$(head -n 25 $file|grep q20_rate|cut -d ':' -f2|sed 's/,//g'|tail -n 1)\n                q30BF=$(head -n 25 $file|grep q30_rate|cut -d ':' -f2|sed 's/,//g'|head -n 1)\n                q30AF=$(head -n 25 $file|grep q30_rate|cut -d ':' -f2|sed 's/,//g'|tail -n 1)\n                percent=$(awk -v RBF=\"$readsBF\" -v RAF=\"$readsAF\" 'BEGIN{{print RAF/RBF}}' )\n                echo \"$ID $readsBF $readsAF $basesBF $basesAF $percent $q20BF $q20AF $q30BF $q30AF\" >> qfilter.stats\n                echo \"Sample $ID retained $percent * 100 % of reads ... \"\n            done\n        done\n\n        echo \"Done summarizing quality filtering results ... \\nMoving to /stats/ folder and running plotting script ... \"\n        mv qfilter.stats {config[path][root]}/{config[folder][stats]}\n        cd {config[path][root]}/{config[folder][stats]}\n\n        Rscript {config[path][root]}/{config[folder][scripts]}/{config[scripts][qfilterVis]}\n        echo \"Done. \"\n        rm Rplots.pdf\n        \"\"\"\n\n\nrule megahit:\n    input:\n        rules.qfilter.output\n    output:\n        f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"assemblies\"]}/{{IDs}}/contigs.fasta.gz'\n    benchmark:\n        f'{config[\"path\"][\"root\"]}/benchmarks/{{IDs}}.megahit.benchmark.txt'\n    shell:\n        \"\"\"\n        set +u;source activate {config[envs][metabagpipes]};set -u;\n        cd $TMPDIR\n\n        echo -n \"Copying qfiltered reads to $TMPDIR ... \"\n        cp {input} $TMPDIR\n        echo \"done. \"\n\n        echo -n \"Running megahit ... \"\n        megahit -t {config[cores][megahit]} \\\n            --verbose \\\n            -r $(basename {input}) \\\n            -o tmp;\n        echo \"done. \"\n\n        echo \"Renaming assembly ... \"\n        mv tmp/final.contigs.fa contigs.fasta\n        \n        echo \"Fixing contig header names: replacing spaces with hyphens ... \"\n        sed -i 's/ /-/g' contigs.fasta\n\n        echo \"Zipping and moving assembly ... \"\n        gzip contigs.fasta\n        mkdir -p $(dirname {output})\n        mv contigs.fasta.gz $(dirname {output})\n        echo \"Done. \"\n        \"\"\"\n\n\nrule assemblyVis:\n    input: \n        f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"assemblies\"]}'\n    output: \n        text = f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"stats\"]}/assembly.stats',\n        plot = f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"stats\"]}/assemblyVis.pdf',\n    shell:\n        \"\"\"\n        set +u;source activate {config[envs][metabagpipes]};set -u;\n        mkdir -p $(dirname {output.text})\n        cd {input}\n    \n        echo -e \"\\nGenerating assembly results file assembly.stats: ... \"\n        for folder in */;do\n            for file in $folder*.gz;do\n                ID=$(echo $file|sed 's|/contigs.fasta.gz||g')\n                N=$(less $file|grep -c \">\");\n                L=$(less $file|grep \">\"|cut -d '-' -f4|sed 's/len=//'|awk '{{sum+=$1}}END{{print sum}}');\n                T=$(less $file|grep \">\"|cut -d '-' -f4|sed 's/len=//'|awk '$1>=1000{{c++}} END{{print c+0}}');\n                S=$(less $file|grep \">\"|cut -d '-' -f4|sed 's/len=//'|awk '$1>=1000'|awk '{{sum+=$1}}END{{print sum}}');\n                echo $ID $N $L $T $S>> assembly.stats;\n                echo -e \"Sample $ID has a total of $L bp across $N contigs, with $S bp present in $T contigs >= 1000 bp ... \"\n            done;\n        done\n\n        echo \"Done summarizing assembly results ... \\nMoving to /stats/ folder and running plotting script ... \"\n        mv assembly.stats {config[path][root]}/{config[folder][stats]}\n        cd {config[path][root]}/{config[folder][stats]}\n\n        Rscript {config[path][root]}/{config[folder][scripts]}/{config[scripts][assemblyVis]}\n        echo \"Done. \"\n        rm Rplots.pdf\n        \"\"\"\n\n\nrule metabat:\n    input:\n        contigs = rules.megahit.output,\n        READS = f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"qfiltered\"]}'\n    output:\n        directory(f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"metabat\"]}/{{IDs}}/{{IDs}}.metabat-bins')\n    benchmark:\n        f'{config[\"path\"][\"root\"]}/benchmarks/{{IDs}}.metabat.benchmark.txt'\n    message:\n        \"\"\"\n        Cross map all samples with bwa then use the output of this rule to create contig abundance/depth files \n        to be used for binning with metabat2 and maxbin2. After depth files are copied back to workspace and \n        metabat2 finishes we avoid the need to copy bam files back to workspace saving space as well as \n        reducing total nubmer of jobs to run.\n        \"\"\"\n    shell:\n        \"\"\"\n        set +u;source activate {config[envs][metabagpipes]};set -u;\n        cd $TMPDIR\n        cp {input.contigs} .\n        mkdir -p {output}\n\n        # Define the focal sample ID, fsample: \n        # The one sample that all other samples will be mapped against mapping sample msampleID in for loop\n        fsampleID=$(echo $(basename $(dirname {input.contigs})))\n        echo -e \"\\nFocal sample: $fsampleID ... \"\n\n        echo \"Renaming and unzipping assembly ... \"\n        mv $(basename {input.contigs}) $(echo $fsampleID|sed 's/$/.fa.gz/g')\n        gunzip $(echo $fsampleID|sed 's/$/.fa.gz/g')\n\n        echo -e \"\\nIndexing assembly ... \"\n        bwa index $fsampleID.fa\n        \n        for folder in {input.READS}/*;do \n\n                id=$(basename $folder)\n\n                echo -e \"\\nCopying sample $id to be mapped againts the focal sample $fsampleID ...\"\n                cp $folder/*.gz .\n                \n                # Maybe I should be piping the lines below to reduce I/O ?\n\n                echo -e \"\\nMapping sample to assembly ... \"\n                bwa mem -t {config[cores][metabat]} $fsampleID.fa *.fastq.gz > $id.sam\n                \n                echo -e \"\\nConverting SAM to BAM with samtools view ... \" \n                samtools view -@ {config[cores][metabat]} -Sb $id.sam > $id.bam\n\n                echo -e \"\\nSorting BAM file with samtools sort ... \" \n                samtools sort -@ {config[cores][metabat]} -o $id.sort $id.bam\n\n                echo -e \"\\nRunning jgi_summarize_bam_contig_depths script to generate contig abundance/depth file ... \"\n                jgi_summarize_bam_contig_depths --outputDepth $id.depth $id.sort\n\n                echo -e \"\\nCopying depth file to workspace\"\n                mv $id.depth {output}\n\n                echo -e \"\\nRemoving temporary files ... \"\n                rm *.fastq.gz *.sam *.bam\n\n        done\n        \n        nSamples=$(ls {input.READS}|wc -l)\n        echo -e \"\\nDone mapping focal sample $fsampleID agains $nSamples samples in dataset folder.\"\n\n        echo -e \"\\nRunning jgi_summarize_bam_contig_depths for all sorted bam files ... \"\n        jgi_summarize_bam_contig_depths --outputDepth $id.all.depth *.sort\n\n        echo -e \"\\nRunning metabat2 ... \"\n        metabat2 -i $fsampleID.fa -a $id.all.depth -o $fsampleID\n\n        mv *.fa $id.all.depth $(dirname {output})\n\n        \"\"\"\n\n\nrule maxbin:\n    input:\n        assembly = rules.megahit.output,\n        depth = rules.metabat.output\n    output:\n        directory(f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"maxbin\"]}/{{IDs}}/{{IDs}}.maxbin-bins')\n    benchmark:\n        f'{config[\"path\"][\"root\"]}/benchmarks/{{IDs}}.maxbin.benchmark.txt'\n    message:\n        \"\"\"\n        Note that this rule uses of the output depth of metabat2 as an input to bin using maxbin2.\n        \"\"\"\n    shell:\n        \"\"\"\n        set +u;source activate {config[envs][metabagpipes]};set -u;\n        cp -r {input.assembly} {input.depth} $TMPDIR\n        mkdir -p $(dirname $(dirname {output}))\n        cd $TMPDIR\n\n        echo -e \"\\nUnzipping assembly ... \"\n        gunzip contigs.fasta.gz\n\n        echo -e \"\\nGenerating list of depth files based on metabat2 output ... \"\n        find $(basename {input.depth}) -name \"*.depth\" > abund.list\n        \n        echo -e \"\\nRunning maxbin2 ... \"\n        run_MaxBin.pl -contig contigs.fasta -out $(basename $(dirname {output})) -abund_list abund.list\n        \n        rm contigs.fasta *.gz\n\n        mkdir $(basename {output})\n        mkdir -p $(dirname {output})\n\n        mv *.fasta $(basename {output})\n        mv $(basename {output}) *.summary *.abundance $(dirname {output})\n        \"\"\"\n\n\nrule concoct:\n    input:\n        contigs = rules.megahit.output,\n        reads = f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"qfiltered\"]}'\n    output:\n        directory(f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"concoct\"]}/{{IDs}}/{{IDs}}.concoct-bins')\n    benchmark:\n        f'{config[\"path\"][\"root\"]}/benchmarks/{{IDs}}.concoct.benchmark.txt'\n    shell:\n        \"\"\"\n        set +u;source activate {config[envs][metabagpipes]};set -u;\n        mkdir -p $(dirname $(dirname {output}))\n\n        fsampleID=$(echo $(basename $(dirname {input.contigs})))\n        echo -e \"\\nCopying focal sample assembly $fsampleID to TMPDIR ... \"\n\n        cp {input.contigs} $TMPDIR\n        cd $TMPDIR\n\n        echo \"Unzipping assembly ... \"\n        gunzip $(basename {input.contigs})\n\n        echo -e \"Done. \\nCutting up contigs to 10kbp chunks (default), do not use this for mapping!\"\n        cut_up_fasta.py -c {config[params][cutfasta]} -o 0 -m contigs.fasta -b assembly_c10k.bed > assembly_c10k.fa\n        \n        echo -e \"\\nIndexing assembly of original contigs for mapping (not 10kbp chunks assembly file) ... \"\n        bwa index contigs.fasta\n\n        echo -e \"Done. \\nPreparing to map focal sample against other samples ... \"\n        for folder in {input.reads}/*;do \n\n                id=$(basename $folder)\n                echo -e \"\\nCopying sample $id to be mapped againts the focal sample $fsampleID ...\"\n                cp $folder/*.gz .\n                \n                # Maybe I should be piping the lines below to reduce I/O ?\n\n                echo -e \"\\nMapping sample to assembly ... \"\n                bwa mem -t {config[cores][concoct]} contigs.fasta *.fastq.gz > $id.sam\n                \n                echo -e \"\\nConverting SAM to BAM with samtools view ... \" \n                samtools view -@ {config[cores][concoct]} -Sb $id.sam > $id.bam\n\n                echo -e \"\\nSorting BAM file with samtools sort ... \" \n                samtools sort -@ {config[cores][concoct]} -o $id.sort $id.bam\n\n                echo -e \"\\nIndexing sorted BAM file with samtools index ... \" \n                samtools index $id.sort\n\n                echo -e \"\\nRemoving temporary files ... \"\n                rm *.fastq.gz *.sam *.bam\n\n        done\n\n        echo -e \"\\nSummarizing sorted and indexed BAM files with concoct_coverage_table.py ... \" \n        concoct_coverage_table.py assembly_c10k.bed *.sort > coverage_table.tsv\n\n        echo -e \"\\nRunning CONCOCT ... \"\n        concoct --coverage_file coverage_table.tsv --composition_file assembly_c10k.fa \\\n            -b $(basename $(dirname {output})) \\\n            -t {config[cores][concoct]} \\\n            -c {config[params][concoct]}\n            \n        echo -e \"\\nMerging clustering results into original contigs with merge_cutup_clustering.py ... \"\n        merge_cutup_clustering.py $(basename $(dirname {output}))_clustering_gt1000.csv > $(basename $(dirname {output}))_clustering_merged.csv\n        \n        echo -e \"\\nExtracting bins ... \"\n        mkdir -p $(basename {output})\n        extract_fasta_bins.py contigs.fasta $(basename $(dirname {output}))_clustering_merged.csv --output_path $(basename {output})\n        \n        mkdir -p $(dirname {output})\n        mv $(basename {output}) *.txt *.csv $(dirname {output})\n        \"\"\"\n\n\nrule binRefine:\n    input:\n        concoct = f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"concoct\"]}/{{IDs}}/{{IDs}}.concoct-bins',\n        metabat = f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"metabat\"]}/{{IDs}}/{{IDs}}.metabat-bins',\n        maxbin = f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"maxbin\"]}/{{IDs}}/{{IDs}}.maxbin-bins'\n    output:\n        directory(f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"refined\"]}/{{IDs}}')\n    benchmark:\n        f'{config[\"path\"][\"root\"]}/benchmarks/{{IDs}}.binRefine.benchmark.txt'\n    shell:\n        \"\"\"\n        set +u;source activate {config[envs][metawrap]};set -u;\n        mkdir -p $(dirname {output})\n        mkdir -p {output}\n        cd $TMPDIR\n\n        echo \"Copying bins from CONCOCT, metabat2, and maxbin2 to tmpdir ... \"\n        cp -r {input.concoct} {input.metabat} {input.maxbin} $TMPDIR\n\n        echo \"Renaming bin folders to avoid errors with metaWRAP ... \"\n        mv $(basename {input.concoct}) $(echo $(basename {input.concoct})|sed 's/-bins//g')\n        mv $(basename {input.metabat}) $(echo $(basename {input.metabat})|sed 's/-bins//g')\n        mv $(basename {input.maxbin}) $(echo $(basename {input.maxbin})|sed 's/-bins//g')\n        \n        echo \"Running metaWRAP bin refinement module ... \"\n        metaWRAP bin_refinement -o . \\\n            -A $(echo $(basename {input.concoct})|sed 's/-bins//g') \\\n            -B $(echo $(basename {input.metabat})|sed 's/-bins//g') \\\n            -C $(echo $(basename {input.maxbin})|sed 's/-bins//g') \\\n            -t {config[cores][refine]} \\\n            -m {config[params][refineMem]} \\\n            -c {config[params][refineComp]} \\\n            -x {config[params][refineCont]}\n \n        rm -r $(echo $(basename {input.concoct})|sed 's/-bins//g') $(echo $(basename {input.metabat})|sed 's/-bins//g') $(echo $(basename {input.maxbin})|sed 's/-bins//g') work_files\n        mv * {output}\n        \"\"\"\n\n\nrule binReassemble:\n    input:\n        READS = rules.qfilter.output,\n        refinedBins = rules.binRefine.output\n    output:\n        directory(f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"reassembled\"]}/{{IDs}}')\n    benchmark:\n        f'{config[\"path\"][\"root\"]}/benchmarks/{{IDs}}.binReassemble.benchmark.txt'\n    shell:\n        \"\"\"\n        set +u;source activate {config[envs][metawrap]};set -u;\n        mkdir -p $(dirname {output})\n        cp -r {input.refinedBins}/metawrap_*_bins {input.READS} $TMPDIR\n        cd $TMPDIR\n        \n        echo \"Running metaWRAP bin reassembly ... \"\n        metaWRAP reassemble_bins -o $(basename {output}) \\\n            -b metawrap_*_bins \\\n            -1 $(basename {input.READS}) \\\n            -2 $(basename {input.READS}) \\\n            -t {config[cores][reassemble]} \\\n            -m {config[params][reassembleMem]} \\\n            -c {config[params][reassembleComp]} \\\n            -x {config[params][reassembleCont]}\n        \n        rm -r metawrap_*_bins\n        rm -r $(basename {output})/work_files\n        rm *.fastq.gz \n        mv * $(dirname {output})\n        \"\"\"\n\n\nrule binningVis:\n    input: \n        f'{config[\"path\"][\"root\"]}'\n    output: \n        text = f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"stats\"]}/reassembled_bins.stats',\n        plot = f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"stats\"]}/binningVis.pdf'\n    message:\n        \"\"\"\n        Generate bar plot with number of bins and density plot of bin contigs, \n        total length, completeness, and contamination across different tools.\n        \"\"\"\n    shell:\n        \"\"\"\n        set +u;source activate {config[envs][metabagpipes]};set -u;\n        \n        # READ CONCOCT BINS\n\n        echo \"Generating concoct_bins.stats file containing bin ID, number of contigs, and length ... \"\n        cd {input}/{config[folder][concoct]}\n        for folder in */;do \n            var=$(echo $folder|sed 's|/||g'); # Define sample name\n            for bin in $folder*concoct-bins/*.fa;do \n                name=$(echo $bin | sed \"s|^.*/|$var.bin.|g\" | sed 's/.fa//g'); # Define bin name\n                N=$(less $bin | grep -c \">\");\n                L=$(less $bin |grep \">\"|cut -d '-' -f4|sed 's/len=//g'|awk '{{sum+=$1}}END{{print sum}}')\n                echo \"Reading bin $bin ... Contigs: $N , Length: $L \"\n                echo $name $N $L >> concoct_bins.stats;\n            done;\n        done\n        mv *.stats {input}/{config[folder][reassembled]}\n        echo \"Done reading CONCOCT bins, moving concoct_bins.stats file to $(echo {input}/{config[folder][reassembled]}) .\"\n\n        # READ METABAT2 BINS\n\n        echo \"Generating metabat_bins.stats file containing bin ID, number of contigs, and length ... \"\n        cd {input}/{config[folder][metabat]}\n        for folder in */;do \n            var=$(echo $folder | sed 's|/||'); # Define sample name\n            for bin in $folder*metabat-bins/*.fa;do \n                name=$(echo $bin|sed 's/.fa//g'|sed 's|^.*/||g'|sed \"s/^/$var./g\"); # Define bin name\n                N=$(less $bin | grep -c \">\");\n                L=$(less $bin |grep \">\"|cut -d '-' -f4|sed 's/len=//g'|awk '{{sum+=$1}}END{{print sum}}')\n                echo \"Reading bin $bin ... Contigs: $N , Length: $L \"\n                echo $name $N $L >> metabat_bins.stats;\n            done;\n        done\n        mv *.stats {input}/{config[folder][reassembled]}\n        echo \"Done reading metabat2 bins, moving metabat_bins.stats file to $(echo {input}/{config[folder][reassembled]}) .\"\n\n        # READ MAXBIN2 BINS\n\n        echo \"Generating maxbin_bins.stats file containing bin ID, number of contigs, and length ... \"\n        cd {input}/{config[folder][maxbin]}\n        for folder in */;do\n            for bin in $folder*maxbin-bins/*.fasta;do \n                name=$(echo $bin | sed 's/.fasta//g' | sed 's|^.*/||g');  # Define bin name\n                N=$(less $bin | grep -c \">\");\n                L=$(less $bin |grep \">\"|cut -d '-' -f4|sed 's/len=//g'|awk '{{sum+=$1}}END{{print sum}}')\n                echo \"Reading bin $bin ... Contigs: $N , Length: $L \"\n                echo $name $N $L >> maxbin_bins.stats;\n            done;\n        done\n        mv *.stats {input}/{config[folder][reassembled]}\n        echo \"Done reading maxbin2 bins, moving maxbin_bins.stats file to $(echo {input}/{config[folder][reassembled]}) .\"\n\n        # READ METAWRAP REFINED BINS\n\n        echo \"Generating refined_bins.stats file containing bin ID, number of contigs, and length ... \"\n        cd {input}/{config[folder][refined]}\n        for folder in */;do \n            samp=$(echo $folder | sed 's|/||'); # Define sample name \n            for bin in $folder*metawrap_*_bins/*.fa;do \n                name=$(echo $bin | sed 's/.fa//g'|sed 's|^.*/||g'|sed \"s/^/$samp./g\"); # Define bin name\n                N=$(less $bin | grep -c \">\");\n                L=$(less $bin |grep \">\"|cut -d '-' -f4|sed 's/len_//g'|awk '{{sum+=$1}}END{{print sum}}')\n                echo \"Reading bin $bin ... Contigs: $N , Length: $L \"\n                echo $name $N $L >> refined_bins.stats;\n            done;\n        done\n        echo \"Done reading metawrap refined bins ... \"\n\n        # READ METAWRAP REFINED CHECKM OUTPUT        \n        \n        echo \"Generating CheckM summary files across samples: concoct.checkm, metabat.checkm, maxbin.checkm, and refined.checkm ... \"\n        for folder in */;do \n            var=$(echo $folder|sed 's|/||g'); # Define sample name\n            paste $folder*concoct.stats|tail -n +2 | sed \"s/^/$var.bin./g\" >> concoct.checkm\n            paste $folder*metabat.stats|tail -n +2 | sed \"s/^/$var./g\" >> metabat.checkm\n            paste $folder*maxbin.stats|tail -n +2 >> maxbin.checkm\n            paste $folder*metawrap_*_bins.stats|tail -n +2|sed \"s/^/$var./g\" >> refined.checkm\n        done \n        echo \"Done reading metawrap refined output, moving refined_bins.stats, concoct.checkm, metabat.checkm, maxbin.checkm, and refined.checkm files to $(echo {input}/{config[folder][reassembled]}) .\"\n        mv *.stats *.checkm {input}/{config[folder][reassembled]}\n\n        # READ METAWRAP REASSEMBLED BINS\n\n        echo \"Generating reassembled_bins.stats file containing bin ID, number of contigs, and length ... \"\n        cd {input}/{config[folder][reassembled]}\n        for folder in */;do \n            samp=$(echo $folder | sed 's|/||'); # Define sample name \n            for bin in $folder*reassembled_bins/*.fa;do \n                name=$(echo $bin | sed 's/.fa//g' | sed 's|^.*/||g' | sed \"s/^/$samp./g\"); # Define bin name\n                N=$(less $bin | grep -c \">\");\n\n                # Need to check if bins are original (megahit-assembled) or strict/permissive (metaspades-assembled)\n                if [[ $name == *.strict ]] || [[ $name == *.permissive ]];then\n                    L=$(less $bin |grep \">\"|cut -d '_' -f4|awk '{{sum+=$1}}END{{print sum}}')\n                else\n                    L=$(less $bin |grep \">\"|cut -d '-' -f4|sed 's/len_//g'|awk '{{sum+=$1}}END{{print sum}}')\n                fi\n\n                echo \"Reading bin $bin ... Contigs: $N , Length: $L \"\n                echo $name $N $L >> reassembled_bins.stats;\n            done;\n        done\n        echo \"Done reading metawrap reassembled bins ... \"\n\n        # READ METAWRAP REFINED CHECKM OUTPUT  \n\n        echo \"Generating CheckM summary file reassembled.checkm across samples for reassembled bins ... \"\n        for folder in */;do \n            var=$(echo $folder|sed 's|/||g');\n            paste $folder*reassembled_bins.stats|tail -n +2|sed \"s/^/$var./g\";\n        done >> reassembled.checkm\n        echo \"Done generating all statistics files for binning results ... running plotting script ... \"\n\n        # RUN PLOTTING R SCRIPT\n\n        mv *.stats *.checkm {config[path][root]}/{config[folder][stats]}\n        cd {config[path][root]}/{config[folder][stats]}\n\n        Rscript {config[path][root]}/{config[folder][scripts]}/{config[scripts][binningVis]}\n        rm Rplots.pdf # Delete redundant pdf file\n        echo \"Done. \"\n        \"\"\"\n\nrule GTDBtk:\n    input: \n        f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"reassembled\"]}/{{IDs}}/reassembled_bins'\n    output:\n        directory(f'{config[\"path\"][\"root\"]}/GTDBtk/{{IDs}}')\n    benchmark:\n        f'{config[\"path\"][\"root\"]}/benchmarks/{{IDs}}.GTDBtk.benchmark.txt'\n    message:\n        \"\"\"\n        The folder dna_bins_organized assumes subfolders containing dna bins for refined and reassembled bins.\n        Note: slightly modified inputs/outputs for european dataset.\n        \"\"\"\n    shell:\n        \"\"\"\n        set +u;source activate gtdbtk-tmp;set -u;\n        export GTDBTK_DATA_PATH=/g/scb2/patil/zorrilla/conda/envs/gtdbtk/share/gtdbtk-1.1.0/db/\n\n        cd $SCRATCHDIR\n        cp -r {input} .\n\n        gtdbtk classify_wf --genome_dir $(basename {input}) --out_dir GTDBtk -x fa --cpus {config[cores][gtdbtk]}\n        mkdir -p {output}\n        mv GTDBtk/* {output}\n\n        \"\"\"\n\nrule classifyGenomes:\n    input:\n        bins = f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"reassembled\"]}/{{IDs}}/reassembled_bins',\n        script = f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"scripts\"]}/classify-genomes'\n    output:\n        directory(f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"classification\"]}/{{IDs}}')\n    benchmark:\n        f'{config[\"path\"][\"root\"]}/benchmarks/{{IDs}}.classify-genomes.benchmark.txt'\n    shell:\n        \"\"\"\n        set +u;source activate {config[envs][metabagpipes]};set -u;\n        mkdir -p {output}\n        cd $TMPDIR\n        cp -r {input.script}/* {input.bins}/* .\n\n        echo \"Begin classifying bins ... \"\n        for bin in *.fa; do\n            echo -e \"\\nClassifying $bin ... \"\n            $PWD/classify-genomes $bin -t {config[cores][classify]} -o $(echo $bin|sed 's/.fa/.taxonomy/')\n            cp *.taxonomy {output}\n            rm *.taxonomy\n            rm $bin \n        done\n        echo \"Done classifying bins. \"\n        \"\"\"\n\n\nrule taxonomyVis:\n    input: \n        f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"classification\"]}'\n    output: \n        text = f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"stats\"]}/classification.stats',\n        plot = f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"stats\"]}/taxonomyVis.pdf'\n    message:\n        \"\"\"\n        Generate bar plot with most common taxa (n>15) and density plots with mapping statistics.\n        \"\"\"\n    shell:\n        \"\"\"\n        set +u;source activate {config[envs][metabagpipes]};set -u;\n        cd {input}\n\n        echo -e \"\\nBegin reading classification result files ... \\n\"\n        for folder in */;do \n\n            for file in $folder*.taxonomy;do\n\n                # Define sample ID to append to start of each bin name in summary file\n                sample=$(echo $folder|sed 's|/||')\n\n                # Define bin name with sample ID, shorten metaWRAP naming scheme (orig/permissive/strict)\n                fasta=$(echo $file | sed 's|^.*/||' | sed 's/.taxonomy//g' | sed 's/orig/o/g' | sed 's/permissive/p/g' | sed 's/strict/s/g' | sed \"s/^/$sample./g\");\n\n                # Extract NCBI ID \n                NCBI=$(less $file | grep NCBI | cut -d ' ' -f4);\n\n                # Extract consensus taxonomy\n                tax=$(less $file | grep tax | sed 's/Consensus taxonomy: //g');\n\n                # Extract consensus motus\n                motu=$(less $file | grep mOTUs | sed 's/Consensus mOTUs: //g');\n\n                # Extract number of detected genes\n                detect=$(less $file | grep detected | sed 's/Number of detected genes: //g');\n\n                # Extract percentage of agreeing genes\n                percent=$(less $file | grep agreeing | sed 's/Percentage of agreeing genes: //g' | sed 's/%//g');\n\n                # Extract number of mapped genes\n                map=$(less $file | grep mapped | sed 's/Number of mapped genes: //g');\n                \n                # Extract COG IDs, need to use set +e;...;set -e to avoid erroring out when reading .taxonomy result file for bin with no taxonomic annotation\n                set +e\n                cog=$(less $file | grep COG | cut -d$'\\t' -f1 | tr '\\n' ',' | sed 's/,$//g');\n                set -e\n                \n                # Display and store extracted results\n                echo -e \"$fasta \\t $NCBI \\t $tax \\t $motu \\t $detect \\t $map \\t $percent \\t $cog\"\n                echo -e \"$fasta \\t $NCBI \\t $tax \\t $motu \\t $detect \\t $map \\t $percent \\t $cog\" >> classification.stats;\n            \n            done;\n        \n        done\n\n        echo -e \"\\nDone generating classification.stats summary file, moving to stats/ directory and running taxonomyVis.R script ... \"\n        mv classification.stats {config[path][root]}/{config[folder][stats]}\n        cd {config[path][root]}/{config[folder][stats]}\n\n        Rscript {config[path][root]}/{config[folder][scripts]}/{config[scripts][taxonomyVis]}\n        rm Rplots.pdf # Delete redundant pdf file\n        echo \"Done. \"\n        \"\"\"\n\n\nrule abundance:\n    input:\n        bins = f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"reassembled\"]}/{{IDs}}/reassembled_bins',\n        READS = rules.qfilter.output\n    output:\n        directory(f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"abundance\"]}/{{IDs}}')\n    benchmark:\n        f'{config[\"path\"][\"root\"]}/benchmarks/{{IDs}}.abundance.benchmark.txt'\n    message:\n        \"\"\"\n        Calculate bin abundance fraction using the following:\n\n        binAbundanceFraction = ( X / Y / Z) * 1000000\n\n        X = # of reads mapped to bin_i from sample_k\n        Y = length of bin_i (bp)\n        Z = # of reads mapped to all bins in sample_k\n\n        Note: 1000000 scaling factor converts length in bp to Mbp\n\n        \"\"\"\n    shell:\n        \"\"\"\n        set +u;source activate {config[envs][metabagpipes]};set -u;\n        mkdir -p {output}\n        cd $TMPDIR\n\n        echo -e \"\\nCopying quality filtered single end reads and generated MAGs to TMPDIR ... \"\n        cp {input.READS} {input.bins}/* .\n\n        echo -e \"\\nConcatenating all bins into one FASTA file ... \"\n        cat *.fa > $(basename {output}).fa\n\n        echo -e \"\\nCreating bwa index for concatenated FASTA file ... \"\n        bwa index $(basename {output}).fa\n\n        echo -e \"\\nMapping quality filtered single end reads to concatenated FASTA file with bwa mem ... \"\n        bwa mem -t {config[cores][abundance]} $(basename {output}).fa \\\n            $(basename {input.READS}) > $(basename {output}).sam\n\n        echo -e \"\\nConverting SAM to BAM with samtools view ... \"\n        samtools view -@ {config[cores][abundance]} -Sb $(basename {output}).sam > $(basename {output}).bam\n\n        echo -e \"\\nSorting BAM file with samtools sort ... \"\n        samtools sort -@ {config[cores][abundance]} -o $(basename {output}).sort.bam $(basename {output}).bam\n\n        echo -e \"\\nExtracting stats from sorted BAM file with samtools flagstat ... \"\n        samtools flagstat $(basename {output}).sort.bam > map.stats\n\n        echo -e \"\\nCopying sample_map.stats file to root/abundance/sample for bin concatenation and deleting temporary FASTA file ... \"\n        cp map.stats {output}/$(basename {output})_map.stats\n        rm $(basename {output}).fa\n        \n        echo -e \"\\nRepeat procedure for each bin ... \"\n        for bin in *.fa;do\n\n            echo -e \"\\nSetting up temporary sub-directory to map against bin $bin ... \"\n            mkdir -p $(echo \"$bin\"| sed \"s/.fa//\")\n            mv $bin $(echo \"$bin\"| sed \"s/.fa//\")\n            cd $(echo \"$bin\"| sed \"s/.fa//\")\n\n            echo -e \"\\nCreating bwa index for bin $bin ... \"\n            bwa index $bin\n\n            echo -e \"\\nMapping quality filtered single end reads to bin $bin with bwa mem ... \"\n            bwa mem -t {config[cores][abundance]} $bin ../$(basename {input.READS}) > $(echo \"$bin\"|sed \"s/.fa/.sam/\")\n\n            echo -e \"\\nConverting SAM to BAM with samtools view ... \"\n            samtools view -@ {config[cores][abundance]} -Sb $(echo \"$bin\"|sed \"s/.fa/.sam/\") > $(echo \"$bin\"|sed \"s/.fa/.bam/\")\n\n            echo -e \"\\nSorting BAM file with samtools sort ... \"\n            samtools sort -@ {config[cores][abundance]} -o $(echo \"$bin\"|sed \"s/.fa/.sort.bam/\") $(echo \"$bin\"|sed \"s/.fa/.bam/\")\n\n            echo -e \"\\nExtracting stats from sorted BAM file with samtools flagstat ... \"\n            samtools flagstat $(echo \"$bin\"|sed \"s/.fa/.sort.bam/\") > $(echo \"$bin\"|sed \"s/.fa/.map/\")\n\n            echo -e \"\\nAppending bin length to bin.map stats file ... \"\n            echo -n \"Bin Length = \" >> $(echo \"$bin\"|sed \"s/.fa/.map/\")\n\n            # Need to check if bins are original (megahit-assembled) or strict/permissive (metaspades-assembled)\n            if [[ $bin == *.strict.fa ]] || [[ $bin == *.permissive.fa ]];then\n                less $bin |grep \">\"|cut -d '_' -f4|awk '{{sum+=$1}}END{{print sum}}' >> $(echo \"$bin\"|sed \"s/.fa/.map/\")\n            else\n                less $bin |grep \">\"|cut -d '-' -f4|sed 's/len_//g'|awk '{{sum+=$1}}END{{print sum}}' >> $(echo \"$bin\"|sed \"s/.fa/.map/\")\n            fi\n\n            paste $(echo \"$bin\"|sed \"s/.fa/.map/\")\n\n            echo -e \"\\nCalculating abundance for bin $bin ... \"\n            echo -n \"$bin\"|sed \"s/.fa//\" >> $(echo \"$bin\"|sed \"s/.fa/.abund/\")\n            echo -n $'\\t' >> $(echo \"$bin\"|sed \"s/.fa/.abund/\")\n\n            X=$(less $(echo \"$bin\"|sed \"s/.fa/.map/\")|grep \"mapped (\"|awk -F' ' '{{print $1}}')\n            Y=$(less $(echo \"$bin\"|sed \"s/.fa/.map/\")|tail -n 1|awk -F' ' '{{print $4}}')\n            Z=$(less \"../map.stats\"|grep \"mapped (\"|awk -F' ' '{{print $1}}')\n            awk -v x=\"$X\" -v y=\"$Y\" -v z=\"$Z\" 'BEGIN{{print (x/y/z) * 1000000}}' >> $(echo \"$bin\"|sed \"s/.fa/.abund/\")\n            \n            paste $(echo \"$bin\"|sed \"s/.fa/.abund/\")\n            \n            echo -e \"\\nRemoving temporary files for bin $bin ... \"\n            rm $bin\n            cp $(echo \"$bin\"|sed \"s/.fa/.map/\") {output}\n            mv $(echo \"$bin\"|sed \"s/.fa/.abund/\") ../\n            cd ..\n            rm -r $(echo \"$bin\"| sed \"s/.fa//\")\n        done\n\n        echo -e \"\\nDone processing all bins, summarizing results into sample.abund file ... \"\n        cat *.abund > $(basename {output}).abund\n\n        echo -ne \"\\nSumming calculated abundances to obtain normalization value ... \"\n        norm=$(less $(basename {output}).abund |awk '{{sum+=$2}}END{{print sum}}');\n        echo $norm\n\n        echo -e \"\\nGenerating column with abundances normalized between 0 and 1 ... \"\n        awk -v NORM=\"$norm\" '{{printf $1\"\\t\"$2\"\\t\"$2/NORM\"\\\\n\"}}' $(basename {output}).abund > abundance.txt\n\n        rm $(basename {output}).abund\n        mv abundance.txt $(basename {output}).abund\n\n        mv $(basename {output}).abund {output}\n        \"\"\"\n\nrule abundanceVis:\n    input:\n        abundance = f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"abundance\"]}',\n        taxonomy = rules.taxonomyVis.output.text\n    output: \n        text = f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"stats\"]}/abundance.stats',\n        plot = f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"stats\"]}/abundanceVis.pdf'\n    message:\n        \"\"\"\n        Generate stacked bar plots showing composition of samples\n        \"\"\"\n    shell:\n        \"\"\"\n        set +u;source activate {config[envs][metabagpipes]};set -u\n        cd {input.abundance}\n\n        for folder in */;do\n\n            # Define sample ID\n            sample=$(echo $folder|sed 's|/||g')\n            \n            # Same as in taxonomyVis rule, modify bin names by adding sample ID and shortening metaWRAP naming scheme (orig/permissive/strict)\n            paste $sample/$sample.abund | sed 's/orig/o/g' | sed 's/permissive/p/g' | sed 's/strict/s/g' | sed \"s/^/$sample./g\" >> abundance.stats\n       \n        done\n\n        mv abundance.stats {config[path][root]}/{config[folder][stats]}\n        cd {config[path][root]}/{config[folder][stats]}\n        Rscript {config[path][root]}/{config[folder][scripts]}/{config[scripts][abundanceVis]}\n\n        \"\"\"\n\n\nrule extractProteinBins:\n    message:\n        \"Extract ORF annotated protein fasta files for each bin from reassembly checkm files.\"\n    shell:\n        \"\"\"\n        cd {config[path][root]}\n        mkdir -p {config[folder][proteinBins]}\n\n        echo -e \"Begin moving and renaming ORF annotated protein fasta bins from reassembled_bins/ to final_bins/ ... \\n\"\n        for folder in reassembled_bins/*/;do \n            echo \"Moving bins from sample $(echo $(basename $folder)) ... \"\n            for bin in $folder*reassembled_bins.checkm/bins/*;do \n                var=$(echo $bin/genes.faa | sed 's|reassembled_bins/||g'|sed 's|/reassembled_bins.checkm/bins||'|sed 's|/genes||g'|sed 's|/|_|g'|sed 's/permissive/p/g'|sed 's/orig/o/g'|sed 's/strict/s/g');\n                cp $bin/*.faa {config[path][root]}/{config[folder][proteinBins]}/$var;\n            done;\n        done\n        \"\"\"\n\n\nrule carveme:\n    input:\n        bin = f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"proteinBins\"]}/{{binIDs}}.faa',\n        media = f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"scripts\"]}/{config[\"scripts\"][\"carveme\"]}'\n    output:\n        f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"GEMs\"]}/{{binIDs}}.xml'\n    benchmark:\n        f'{config[\"path\"][\"root\"]}/benchmarks/{{binIDs}}.carveme.benchmark.txt'\n    message:\n        \"\"\"\n        Make sure that the input files are ORF annotated and preferably protein fasta.\n        If given raw fasta files, Carveme will run without errors but each contig will be treated as one gene.\n        \"\"\"\n    shell:\n        \"\"\"\n        echo \"Activating {config[envs][metabagpipes]} conda environment ... \"\n        set +u;source activate {config[envs][metabagpipes]};set -u\n        \n        mkdir -p $(dirname {output})\n        mkdir -p logs\n\n        cp {input.bin} {input.media} $TMPDIR\n        cd $TMPDIR\n        \n        echo \"Begin carving GEM ... \"\n        #carve -g {config[params][carveMedia]} \\\n        #    -v \\\n        #    --mediadb $(basename {input.media}) \\\n        #    --fbc2 \\\n        #    -o $(echo $(basename {input.bin}) | sed 's/.faa/.xml/g') $(basename {input.bin})\n\n        carve -v \\\n            --fbc2 \\\n            -o $(echo $(basename {input.bin}) | sed 's/.faa/.xml/g') $(basename {input.bin})\n        \n        echo \"Done carving GEM. \"\n        [ -f *.xml ] && mv *.xml $(dirname {output})\n        \"\"\"\n\n\nrule modelVis:\n    input: \n        f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"GEMs\"]}'\n    output: \n        text = f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"stats\"]}/GEMs.stats',\n        plot = f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"stats\"]}/modelVis.pdf'\n    message:\n        \"\"\"\n        Generate bar plot with GEMs generated across samples and density plots showing number of \n        unique metabolites, reactions, and genes across GEMs.\n        \"\"\"\n    shell:\n        \"\"\"\n        set +u;source activate {config[envs][metabagpipes]};set -u;\n        cd {input}\n\n        echo -e \"\\nBegin reading models ... \\n\"\n        for model in *.xml;do \n            id=$(echo $model|sed 's/.xml//g'); \n            mets=$(less $model| grep \"species id=\"|cut -d ' ' -f 8|sed 's/..$//g'|sort|uniq|wc -l);\n            rxns=$(less $model|grep -c 'reaction id=');\n            genes=$(less $model|grep 'fbc:geneProduct fbc:id='|grep -vic spontaneous);\n            echo \"Model: $id has $mets mets, $rxns reactions, and $genes genes ... \"\n            echo \"$id $mets $rxns $genes\" >> GEMs.stats;\n        done\n\n        echo -e \"\\nDone generating GEMs.stats summary file, moving to stats/ folder and running modelVis.R script ... \"\n        mv GEMs.stats {config[path][root]}/{config[folder][stats]}\n        cd {config[path][root]}/{config[folder][stats]}\n\n        Rscript {config[path][root]}/{config[folder][scripts]}/{config[scripts][modelVis]}\n        rm Rplots.pdf # Delete redundant pdf file\n        echo \"Done. \"\n        \"\"\"\n\n\nrule organizeGEMs:\n    input:\n        f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"refined\"]}'\n    message:\n        \"\"\"\n        Organizes GEMs into sample specific subfolders. \n        Necessary to run smetana per sample using the IDs wildcard.\n        \"\"\"\n    shell:\n        \"\"\"\n        cd {input}\n        for folder in */;do\n            echo -n \"Creating GEM subfolder for sample $folder ... \"\n            mkdir -p ../{config[folder][GEMs]}/$folder;\n            echo -n \"moving GEMs ... \"\n            mv ../{config[folder][GEMs]}/$(echo $folder|sed 's|/||')_*.xml ../{config[folder][GEMs]}/$folder;\n            echo \"done. \"\n        done\n        \"\"\"\n\nrule memote:\n    input:\n        f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"GEMs\"]}/{{IDs}}'\n    output:\n        directory(f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"memote\"]}/{{IDs}}')\n    benchmark:\n        f'{config[\"path\"][\"root\"]}/benchmarks/{{IDs}}.memote.benchmark.txt'\n    shell:\n        \"\"\"\n        set +u;source activate {config[envs][metabagpipes]};set -u\n\n        mkdir -p $(dirname {output})\n        mkdir -p {output}\n        \n        cp {input}/*.xml $TMPDIR\n        cd $TMPDIR\n        \n        for model in *.xml;do\n            memote report snapshot --filename $(echo $model|sed 's/.xml/.html/') $model\n            memote run $model > $(echo $model|sed 's/.xml/-summary.txt/')\n            mv *.txt *.html {output}\n            rm $model\n        done\n        \"\"\"\n\nrule smetana:\n    input:\n        f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"GEMs\"]}/{{IDs}}'\n    output:\n        f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"SMETANA\"]}/{{IDs}}_detailed.tsv'\n    benchmark:\n        f'{config[\"path\"][\"root\"]}/benchmarks/{{IDs}}.smetana.benchmark.txt'\n    shell:\n        \"\"\"\n        set +u;source activate {config[envs][metabagpipes]};set -u\n        mkdir -p {config[path][root]}/{config[folder][SMETANA]}\n        cp {config[path][root]}/{config[folder][scripts]}/{config[scripts][carveme]} {input}/*.xml $TMPDIR\n        cd $TMPDIR\n        \n        smetana -o $(basename {input}) --flavor fbc2 \\\n            --mediadb media_db.tsv -m {config[params][smetanaMedia]} \\\n            --detailed \\\n            --solver {config[params][smetanaSolver]} -v *.xml\n        \n        mv *.tsv $(dirname {output})\n        \"\"\"\n\nrule motus2:\n    input: \n        rules.qfilter.output\n    output:\n        directory(f'{config[\"path\"][\"root\"]}/test/motus2/{{IDs}}')\n    benchmark:\n        f'{config[\"path\"][\"root\"]}/benchmarks/{{IDs}}.motus2.benchmark.txt'\n    shell:\n        \"\"\"\n        set +u;source activate {config[envs][metabagpipes]};set -u\n        cp {input} $TMPDIR\n        cd $TMPDIR\n\n        motus profile -s $(basename {input}) -o $(basename {input}).motus2 -t 12\n        mkdir -p {output}\n        rm $(basename {input})\n        mv * {output}\n        \"\"\"\n\nrule grid:\n    input:\n        bins = f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"reassembled\"]}/{{IDs}}/reassembled_bins',\n        reads = rules.qfilter.output\n    output:\n        directory(f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"GRiD\"]}/{{IDs}}')\n    benchmark:\n        f'{config[\"path\"][\"root\"]}/benchmarks/{{IDs}}.grid.benchmark.txt'\n    shell:\n        \"\"\"\n        set +u;source activate {config[envs][metabagpipes]};set -u\n\n        cp -r {input.bins} {input.reads} $TMPDIR\n        cd $TMPDIR\n\n        mkdir MAGdb out\n        update_database -d MAGdb -g $(basename {input.bins}) -p MAGdb\n        rm -r $(basename {input.bins})\n\n        grid multiplex -r . -e fastq.gz -d MAGdb -p -c 0.2 -o out -n {config[cores][grid]}\n\n        rm $(basename {input.reads})\n        mkdir {output}\n        mv out/* {output}\n        \"\"\"\n"
  },
  {
    "path": "workflow/rules/kallisto2concoctTable.smk",
    "content": "rule kallisto2concoctTable: \n    input:\n        f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"kallisto\"]}/{{focal}}/'\n    output: \n        f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"concoct\"]}/{{focal}}/cov/coverage_table.tsv'\n    message:\n        \"\"\"\n        This rule is necessary for the crossMapParallel implementation subworkflow.\n        It summarizes the individual concoct input tables for a given focal sample.\n        Note: silence output if not using parallel mapping approach\n        \"\"\"\n    shell:\n        \"\"\"\n        # Activate metagem environment\n        set +u;source activate {config[envs][metagem]};set -u;\n\n        # Create output folder\n        mkdir -p $(dirname {output})\n\n        # Compile individual mapping results into coverage table for given assembly\n        python {config[path][root]}/{config[folder][scripts]}/{config[scripts][kallisto2concoct]} \\\n            --samplenames <(for s in {input}/*; do echo $s|sed 's|^.*/||'; done) \\\n            $(find {input} -name \"*.gz\") > {output}\n    \n        \"\"\""
  },
  {
    "path": "workflow/rules/maxbin_single.smk",
    "content": "rule maxbin_single:\n    input:\n        assembly = rules.megahit.output,\n        R1 = rules.qfilter.output.R1,\n        R2 = rules.qfilter.output.R2\n    output:\n        directory(f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"maxbin\"]}/{{IDs}}/{{IDs}}.maxbin-bins')\n    benchmark:\n        f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"benchmarks\"]}/{{IDs}}.maxbin.benchmark.txt'\n    shell:\n        \"\"\"\n        # Activate metagem environment\n        set +u;source activate {config[envs][metagem]};set -u;\n\n        # Create output folder\n        mkdir -p $(dirname {output})\n\n        # Make job specific scratch dir\n        fsampleID=$(echo $(basename $(dirname {input.assembly})))\n        echo -e \"\\nCreating temporary directory {config[path][scratch]}/{config[folder][maxbin]}/${{fsampleID}} ... \"\n        mkdir -p {config[path][scratch]}/{config[folder][maxbin]}/${{fsampleID}}\n\n        # Move into scratch dir\n        cd {config[path][scratch]}/{config[folder][maxbin]}/${{fsampleID}}\n\n        # Copy files to tmp\n        cp -r {input.assembly} {input.R1} {input.R2} .\n\n        echo -e \"\\nUnzipping assembly ... \"\n        gunzip $(basename {input.assembly})\n        \n        echo -e \"\\nRunning maxbin2 ... \"\n        run_MaxBin.pl -contig $(echo $(basename {input.assembly})|sed 's/.gz//') \\\n            -out $(basename $(dirname {output})) \\\n            -reads $(basename {input.R1}) \\\n            -reads2 $(basename {input.R2}) \\\n            -thread {config[cores][maxbin]}\n\n        rm $(echo $(basename {input.assembly})|sed 's/.gz//')\n        \n        mkdir $(basename {output})\n        mv *.fasta *.summary *.abundance *.abund1 *.abund2 $(basename {output})\n        mv $(basename {output}) $(dirname {output})\n\n        \"\"\""
  },
  {
    "path": "workflow/rules/metabat_single.smk",
    "content": "rule metabat_single:\n    input:\n        assembly = rules.megahit.output,\n        R1 = rules.qfilter.output.R1,\n        R2 = rules.qfilter.output.R2\n    output:\n        directory(f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"metabat\"]}/{{IDs}}/{{IDs}}.metabat-bins')\n    benchmark:\n        f'{config[\"path\"][\"root\"]}/{config[\"folder\"][\"benchmarks\"]}/{{IDs}}.metabat.benchmark.txt'\n    message:\n        \"\"\"\n        Implementation of metabat2 where only coverage information from the focal sample is used\n        for binning. Use with the crossMapParallel subworkflow, where cross sample coverage information\n        is only used by CONCOCT.\n        \"\"\"\n    shell:\n        \"\"\"\n        # Activate metagem environment\n        set +u;source activate {config[envs][metagem]};set -u;\n        # Make job specific scratch dir\n        fsampleID=$(echo $(basename $(dirname {input.assembly})))\n        echo -e \"\\nCreating temporary directory {config[path][scratch]}/{config[folder][metabat]}/${{fsampleID}} ... \"\n        mkdir -p {config[path][scratch]}/{config[folder][metabat]}/${{fsampleID}}\n\n        # Move into scratch dir\n        cd {config[path][scratch]}/{config[folder][metabat]}/${{fsampleID}}\n\n        # Copy files\n        cp {input.assembly} {input.R1} {input.R2} .\n\n        echo -e \"\\nFocal sample: $fsampleID ... \"\n\n        echo \"Renaming and unzipping assembly ... \"\n        mv $(basename {input.assembly}) $(echo $fsampleID|sed 's/$/.fa.gz/g')\n        gunzip $(echo $fsampleID|sed 's/$/.fa.gz/g')\n\n        echo -e \"\\nIndexing assembly ... \"\n        bwa index $fsampleID.fa\n\n        id=$(basename {output})\n        echo -e \"\\nMapping reads from sample against assembly $fsampleID ...\"\n        bwa mem -t {config[cores][metabat]} $fsampleID.fa *.fastq.gz > $id.sam\n\n        echo -e \"\\nDeleting no-longer-needed fastq files ... \"\n        rm *.gz\n                \n        echo -e \"\\nConverting SAM to BAM with samtools view ... \" \n        samtools view -@ {config[cores][metabat]} -Sb $id.sam > $id.bam\n\n        echo -e \"\\nDeleting no-longer-needed sam file ... \"\n        rm $id.sam\n\n        echo -e \"\\nSorting BAM file with samtools sort ... \" \n        samtools sort -@ {config[cores][metabat]} -o $id.sort $id.bam\n\n        echo -e \"\\nDeleting no-longer-needed bam file ... \"\n        rm $id.bam\n\n        # Run metabat2\n        echo -e \"\\nRunning metabat2 ... \"\n        jgi_summarize_bam_contig_depths --outputDepth $id.depth.txt $id.sort\n\n        metabat2 -i $fsampleID.fa -a $id.depth.txt -s \\\n            {config[params][metabatMin]} \\\n            -v --seed {config[params][seed]} \\\n            -t 0 -m {config[params][minBin]} \\\n            -o $(basename $(dirname {output}))\n\n        rm $fsampleID.fa\n        rm $id.depth.txt\n\n        # Move files to output dir\n        mv *.fa {output}\n        \"\"\""
  },
  {
    "path": "workflow/scripts/assemblyVis.R",
    "content": "library(ggplot2)\nlibrary(gridExtra)\n\nassembly = read.delim(\"assembly.stats\",stringsAsFactors = FALSE,header = FALSE,sep = \" \")\ncolnames(assembly) = c(\"sample\",\"contigs\",\"length_total\")\n\nassembly$ave = assembly$length_total/assembly$contigs\n\naveplot = ggplot(data=assembly) +\n  geom_density(aes(x=ave,fill=\"All contigs\")) +\n  xlab(\"Average contig length\") +\n  ggtitle(\"Average contig length across samples\") +\n  theme(legend.title = element_blank())+ \n  theme(legend.position = \"none\")\n\ncontplot = ggplot(data=assembly) +\n  geom_density(aes(x=contigs,fill=\"Total contigs\")) +\n  ggtitle(\"Contigs across samples\") +\n  theme(legend.title = element_blank()) + \n  scale_x_log10() + \n  theme(legend.position = \"none\")\n\nscatplot = ggplot(data=assembly) +\n  geom_point(aes(x=length_total,y=contigs)) +\n  ggtitle(\"Total length vs number of contigs\")+\n  xlab(\"Total length\") +\n  ylab(\"Number of contigs\") +\n  expand_limits(x = 0, y = 0)\n\nbarplot = ggplot(data=assembly) +\n  geom_bar(aes(x=reorder(sample,-length_total),y=length_total,fill=\"Contigs\"),stat = \"identity\",color=\"black\",size=0.2) +\n  ggtitle(\"Total length across assemblies\") + \n  ylab(\"Length (bp)\") +\n  xlab(\"Sample ID\") +\n  #scale_y_log10() +\n  coord_flip() + \n  theme(legend.position = \"none\")\n\nassemblyplot=grid.arrange(barplot,arrangeGrob(scatplot,aveplot,contplot,nrow=3),ncol=2,nrow=1)\n\nggsave(\"assemblyVis.pdf\",plot= assemblyplot,device = \"pdf\",dpi = 300, width = 30, height = 40, units = \"cm\")\n"
  },
  {
    "path": "workflow/scripts/assemblyVis_alternative.R",
    "content": "library(ggplot2)\nlibrary(gridExtra)\n\nassembly = read.delim(\"500assembly.stats\",stringsAsFactors = FALSE,header = FALSE,sep = \" \")\ncolnames(assembly) = c(\"sample\",\"contigs\",\"length_total\",\"gt1000\",\"gt1000_total\")\n\nassembly$ave = assembly$length_total/assembly$contigs\nassembly$gt1000_ave = assembly$gt1000_total/assembly$gt1000\n\naveplot = ggplot(data=assembly) +\n  geom_density(aes(x=ave,fill=\"All contigs\")) +\n  geom_density(aes(x=gt1000_ave,fill=\"Contigs >= 1000bp\")) +\n  xlab(\"Average contig length\") +\n  ggtitle(\"Average contig length across samples\") +\n  theme(legend.title = element_blank())\n\ncontplot = ggplot(data=assembly) +\n  geom_density(aes(x=contigs,fill=\"Total contigs\")) +\n  geom_density(aes(x=gt1000,fill=\"Contigs >= 1000\")) +\n  ggtitle(\"Contigs across samples\") +\n  theme(legend.title = element_blank()) + \n  scale_x_log10()\n\ncontplot2 = ggplot(data=assembly) +\n  geom_point(aes(x=contigs,y=gt1000)) +\n  xlab(\"Total contigs\") + \n  ylab(\"Contigs >= 1000bp\") +\n  ggtitle(\"Total contigs vs >=1000bp across samples\") +\n  geom_abline(slope = 1,intercept=0) +\n  expand_limits(x = 0, y = 0)\n\nlenplot = ggplot(data=assembly) +\n  geom_density(aes(x=length_total,fill=\"Total length\")) +\n  geom_density(aes(x=gt1000_total,fill=\"Length >= 1000\")) +\n  xlab(\"Length\") +\n  ggtitle(\"Length across samples\") +\n  theme(legend.title = element_blank()) + \n  scale_x_log10()\n\nlenplot2= ggplot(data=assembly) +\n  geom_point(aes(x=length_total,y=gt1000_total)) +\n  ggtitle(\"Total length vs >= 1000bp across samples\")+\n  xlab(\"Total length\") +\n  ylab(\"Length of contigs >= 1000 bp\") +\n  geom_abline(slope = 1,intercept=0) +\n  expand_limits(x = 0, y = 0)\n\nfracplot = ggplot(data=assembly) +\n  geom_density(aes(100*gt1000/contigs,fill=\"Contigs\")) +\n  geom_density(aes(100*gt1000_total/length_total,fill=\"Length\")) +\n  ggtitle(\"% Information captured by contigs >= 1000 bp\") + \n  xlab(\"% Information\") +\n  theme(legend.title = element_blank())\n\nassemblyplot=grid.arrange(fracplot,contplot2,contplot,aveplot,lenplot2,lenplot,ncol=3,nrow=2)\n\nggsave(\"assemblyVis_old.pdf\",plot= assemblyplot,device = \"pdf\",dpi = 300, width = 40, height = 20, units = \"cm\")\n"
  },
  {
    "path": "workflow/scripts/binFilter.py",
    "content": "#!/usr/bin/env python\n\"\"\"\nBased on the checkm results, approves bins according to the leves of contamination and completeness.\nCopies approved bins to output directory.\n@author: alneberg\n\"\"\"\nfrom __future__ import print_function\nimport sys\nimport os\nimport argparse\nimport pandas as pd\nfrom shutil import copyfile\n\ndef main(args):\n    # Read in the checkm table\n    df = pd.read_table(args.checkm_stats, index_col=0)\n    # extract the ids for all rows that meet the requirements\n    filtered_df = df[(df['Completeness'] >= args.min_completeness) & (df['Contamination'] <= args.max_contamination)] \n    \n    approved_bins = list(filtered_df.index)\n\n    # copy the approved bins to the new output directory\n    for approved_bin_int in approved_bins:\n        approved_bin = str(approved_bin_int)\n        bin_source = os.path.join(args.bin_directory, approved_bin)\n        bin_source += '.' + args.extension\n        bin_destination = os.path.join(args.output_directory)\n        bin_destination += '/' + os.path.basename(bin_source)\n        \n        sys.stderr.write(\"Copying approved bin {} from {} to {}\\n\".format(approved_bin, bin_source, bin_destination))\n        copyfile(bin_source, bin_destination)\n\n    sys.stderr.write(\"\\nApproved {} bins\\n\\n\".format(len(approved_bins)))\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser(description=__doc__)\n    parser.add_argument(\"bin_directory\", help=(\"Input fasta files should be within directory.\"))\n    parser.add_argument(\"checkm_stats\", help=\"Checkm qa stats in tab_table format\")\n    parser.add_argument(\"output_directory\", help=\"Directory where to put approved bins\")\n    parser.add_argument(\"--min_completeness\", default=85, type=float, help=\"default=85\")\n    parser.add_argument(\"--max_contamination\", default=5, type=float, help=\"default=5\")\n    parser.add_argument(\"--extension\", default='fa')\n    args = parser.parse_args()\n\nmain(args)\n"
  },
  {
    "path": "workflow/scripts/binningVis.R",
    "content": "library(gridExtra)\nlibrary(dplyr)\nlibrary(ggplot2)\n\nconcoctCheckm = read.delim(\"concoct.checkm\",stringsAsFactors = FALSE,header = FALSE)\ncolnames(concoctCheckm) = c(\"bin\",\"completeness\",\"contamination\",\"GC\",\"lineage\",\"N50\",\"size\",\"set\")\nconcoctBins= read.delim(\"concoct_bins.stats\",stringsAsFactors = FALSE,header = FALSE, sep = \" \")\ncolnames(concoctBins) = c(\"bin\",\"contigs\",\"length\")\nconcoct = left_join(concoctCheckm,concoctBins%>%select(-length),by=\"bin\") %>% filter(contamination<=10,completeness>=50) %>% distinct() %>% select(-set)\nconcoct$sample = gsub(\"\\\\..*$\",\"\",concoct$bin)\nconcoct$binner = \"CONCOCT\"\n\nmetabatCheckm = read.delim(\"metabat.checkm\",stringsAsFactors = FALSE,header = FALSE)\ncolnames(metabatCheckm) = c(\"bin\",\"completeness\",\"contamination\",\"GC\",\"lineage\",\"N50\",\"size\",\"set\")\nmetabatBins= read.delim(\"metabat_bins.stats\",stringsAsFactors = FALSE,header = FALSE, sep = \" \")\ncolnames(metabatBins) = c(\"bin\",\"contigs\",\"length\")\nmetabat = left_join(metabatCheckm,metabatBins%>%select(-length),by=\"bin\") %>% filter(contamination<=10,completeness>=50)%>% distinct() %>% select(-set)\nmetabat$sample = gsub(\"\\\\..*$\",\"\",metabat$bin)\nmetabat$binner = \"MetaBAT2\"\n\nmaxbinCheckm = read.delim(\"maxbin.checkm\",stringsAsFactors = FALSE,header = FALSE)\ncolnames(maxbinCheckm) = c(\"bin\",\"completeness\",\"contamination\",\"GC\",\"lineage\",\"N50\",\"size\",\"set\")\nmaxbinBins= read.delim(\"maxbin_bins.stats\",stringsAsFactors = FALSE,header = FALSE, sep = \" \")\ncolnames(maxbinBins) = c(\"bin\",\"contigs\",\"length\")\nmaxbin = left_join(maxbinCheckm,maxbinBins%>%select(-length),by=\"bin\") %>% filter(contamination<=10,completeness>=50)%>% distinct() %>% select(-set)\nmaxbin$contigs = as.numeric(maxbin$contigs)\nmaxbin$sample = gsub(\"\\\\..*$\",\"\",maxbin$bin)\nmaxbin$binner = \"MaxBin2\"\n\nrefinedCheckm = read.delim(\"refined.checkm\",stringsAsFactors = FALSE,header = FALSE)\ncolnames(refinedCheckm) = c(\"bin\",\"completeness\",\"contamination\",\"GC\",\"lineage\",\"N50\",\"size\",\"set\")\nrefinedBins= read.delim(\"refined_bins.stats\",stringsAsFactors = FALSE,header = FALSE, sep = \" \")\ncolnames(refinedBins) = c(\"bin\",\"contigs\",\"length\")\nrefined = left_join(refinedCheckm,refinedBins%>%select(-length),by=\"bin\") %>% filter(contamination<=10,completeness>=50)%>% distinct() %>% select(-set)\nrefined$sample = gsub(\"\\\\..*$\",\"\",refined$bin)\nrefined$binner = \"metaWRAP_refined\"\n\nreassembledCheckm = read.delim(\"reassembled.checkm\",stringsAsFactors = FALSE,header = FALSE)\ncolnames(reassembledCheckm) = c(\"bin\",\"completeness\",\"contamination\",\"GC\",\"lineage\",\"N50\",\"size\")\nreassembledBins= read.delim(\"reassembled_bins.stats\",stringsAsFactors = FALSE,header = FALSE, sep = \" \")\ncolnames(reassembledBins) = c(\"bin\",\"contigs\",\"length\")\nreassembled = left_join(reassembledCheckm,reassembledBins%>%select(-length),by=\"bin\") %>% filter(contamination<=10,completeness>=50)%>% distinct()\nreassembled$sample = gsub(\"\\\\..*$\",\"\",reassembled$bin)\nreassembled$binner = \"metaWRAP_reassembled\"\n\n#bins <- as.data.frame(matrix(0,nrow = 5,ncol=2))\n#colnames(bins) = c(\"variable\",\"value\")\n#bins$variable = c(\"maxbin2\",\"refined\",\"CONCOCT\",\"metabat2\",\"reassembled\")\n#bins$value = c(as.numeric(dim(maxbin)[1]),as.numeric(dim(refined)[1]),as.numeric(dim(concoct)[1]),as.numeric(dim(metabat)[1]),as.numeric(dim(reassembled)[1]))\n\nrbind(concoct,metabat,maxbin,refined,reassembled) %>% group_by(binner,sample) %>% summarize(count=n()) -> bins\n\nbinplot = ggplot(data = bins,aes(x=reorder(binner,-count),y=count,fill= binner)) +\n  geom_bar(stat = \"identity\",color=\"black\") +\n  ylab(\"Generated bins\") + \n  xlab(\"Binning tool\") +\n  theme(legend.title = element_blank()) +\n  ggtitle(\"Number of MQ bins\") + \n  coord_flip() + \n  theme(legend.position = \"none\")+ \n  facet_wrap(~sample,ncol=1)\n\ncompplot = ggplot() + \n  geom_density(data=concoct,aes(completeness,color=\"CONCOCT\")) +\n  geom_density(data=maxbin,aes(completeness,color=\"maxbin2\")) +\n  geom_density(data=metabat,aes(completeness,color=\"metabat2\")) + \n  geom_density(data=refined,aes(completeness,color=\"refined\")) + \n  geom_density(data=reassembled,aes(completeness,color=\"reassembled\")) +\n  ggtitle(\"Completeness\") +\n  theme(axis.text.y=element_blank()) + \n  theme(legend.position = \"none\")\n\ncontplot = ggplot() + \n  geom_density(data=concoct,aes(contamination,color=\"CONCOCT\")) +\n  geom_density(data=maxbin,aes(contamination,color=\"maxbin2\")) +\n  geom_density(data=metabat,aes(contamination,color=\"metabat2\")) + \n  geom_density(data=refined,aes(contamination,color=\"refined\")) + \n  geom_density(data=reassembled,aes(contamination,color=\"reassembled\")) +\n  ggtitle(\"Contamination\") +\n  theme(axis.text.y=element_blank()) + \n  theme(legend.position = \"none\")\n\nlengthplot = ggplot() + \n  geom_density(data=concoct,aes(size,color=\"CONCOCT\")) +\n  geom_density(data=maxbin,aes(size,color=\"maxbin2\")) +\n  geom_density(data=metabat,aes(size,color=\"metabat2\")) + \n  geom_density(data=refined,aes(size,color=\"refined\")) + \n  geom_density(data=reassembled,aes(size,color=\"reassembled\")) +\n  ggtitle(\"BP Length\") + \n  theme(legend.position = \"none\") +\n  theme(axis.text.y=element_blank())\n\ncontigplot = ggplot() + \n  geom_density(data=concoct,aes(contigs,color=\"CONCOCT\")) +\n  geom_density(data=maxbin,aes(contigs,color=\"maxbin2\")) +\n  geom_density(data=metabat,aes(contigs,color=\"metabat2\")) + \n  geom_density(data=refined,aes(contigs,color=\"refined\")) + \n  geom_density(data=reassembled,aes(contigs,color=\"reassembled\")) +\n  ggtitle(\"Number of contigs\") + \n  theme(legend.position = \"none\") +\n  theme(axis.text.y=element_blank())\n\ndensities1= grid.arrange(compplot,lengthplot,nrow=2,ncol=1)\ndensities2=grid.arrange(contplot,contigplot,nrow=2,ncol=1)\n\nplot=grid.arrange(binplot,densities1,densities2,nrow=1,ncol=3)\n\nggsave(\"binningVis.pdf\",plot=plot, height = 8, width = 12)"
  },
  {
    "path": "workflow/scripts/binningVis_perSample.R",
    "content": "library(gridExtra)\nlibrary(dplyr)\nlibrary(ggplot2)\n\nconcoctCheckm = read.delim(\"concoct.checkm\",stringsAsFactors = FALSE,header = FALSE)\ncolnames(concoctCheckm) = c(\"bin\",\"completeness\",\"contamination\",\"GC\",\"lineage\",\"N50\",\"size\",\"set\")\nconcoctBins= read.delim(\"concoct_bins.stats\",stringsAsFactors = FALSE,header = FALSE, sep = \" \")\ncolnames(concoctBins) = c(\"bin\",\"contigs\",\"length\")\nconcoct = left_join(concoctCheckm,concoctBins%>%select(-length),by=\"bin\") %>% filter(contamination<=10,completeness>=50) %>% distinct() %>% select(-set)\nconcoct$sample = gsub(\"\\\\..*$\",\"\",concoct$bin)\nconcoct$binner = \"CONCOCT\"\n\nmetabatCheckm = read.delim(\"metabat.checkm\",stringsAsFactors = FALSE,header = FALSE)\ncolnames(metabatCheckm) = c(\"bin\",\"completeness\",\"contamination\",\"GC\",\"lineage\",\"N50\",\"size\",\"set\")\nmetabatBins= read.delim(\"metabat_bins.stats\",stringsAsFactors = FALSE,header = FALSE, sep = \" \")\ncolnames(metabatBins) = c(\"bin\",\"contigs\",\"length\")\nmetabat = left_join(metabatCheckm,metabatBins%>%select(-length),by=\"bin\") %>% filter(contamination<=10,completeness>=50)%>% distinct() %>% select(-set)\nmetabat$sample = gsub(\"\\\\..*$\",\"\",metabat$bin)\nmetabat$binner = \"MetaBAT2\"\n\nmaxbinCheckm = read.delim(\"maxbin.checkm\",stringsAsFactors = FALSE,header = FALSE)\ncolnames(maxbinCheckm) = c(\"bin\",\"completeness\",\"contamination\",\"GC\",\"lineage\",\"N50\",\"size\",\"set\")\nmaxbinBins= read.delim(\"maxbin_bins.stats\",stringsAsFactors = FALSE,header = FALSE, sep = \" \")\ncolnames(maxbinBins) = c(\"bin\",\"contigs\",\"length\")\nmaxbin = left_join(maxbinCheckm,maxbinBins%>%select(-length),by=\"bin\") %>% filter(contamination<=10,completeness>=50)%>% distinct() %>% select(-set)\nmaxbin$contigs = as.numeric(maxbin$contigs)\nmaxbin$sample = gsub(\"\\\\..*$\",\"\",maxbin$bin)\nmaxbin$binner = \"MaxBin2\"\n\nrefinedCheckm = read.delim(\"refined.checkm\",stringsAsFactors = FALSE,header = FALSE)\ncolnames(refinedCheckm) = c(\"bin\",\"completeness\",\"contamination\",\"GC\",\"lineage\",\"N50\",\"size\",\"set\")\nrefinedBins= read.delim(\"refined_bins.stats\",stringsAsFactors = FALSE,header = FALSE, sep = \" \")\ncolnames(refinedBins) = c(\"bin\",\"contigs\",\"length\")\nrefined = left_join(refinedCheckm,refinedBins%>%select(-length),by=\"bin\") %>% filter(contamination<=10,completeness>=50)%>% distinct() %>% select(-set)\nrefined$sample = gsub(\"\\\\..*$\",\"\",refined$bin)\nrefined$binner = \"metaWRAP_refined\"\n\nreassembledCheckm = read.delim(\"reassembled.checkm\",stringsAsFactors = FALSE,header = FALSE)\ncolnames(reassembledCheckm) = c(\"bin\",\"completeness\",\"contamination\",\"GC\",\"lineage\",\"N50\",\"size\")\nreassembledBins= read.delim(\"reassembled_bins.stats\",stringsAsFactors = FALSE,header = FALSE, sep = \" \")\ncolnames(reassembledBins) = c(\"bin\",\"contigs\",\"length\")\nreassembled = left_join(reassembledCheckm,reassembledBins%>%select(-length),by=\"bin\") %>% filter(contamination<=10,completeness>=50)%>% distinct()\nreassembled$sample = gsub(\"\\\\..*$\",\"\",reassembled$bin)\nreassembled$binner = \"metaWRAP_reassembled\"\n\n#bins <- as.data.frame(matrix(0,nrow = 5,ncol=2))\n#colnames(bins) = c(\"variable\",\"value\")\n#bins$variable = c(\"maxbin2\",\"refined\",\"CONCOCT\",\"metabat2\",\"reassembled\")\n#bins$value = c(as.numeric(dim(maxbin)[1]),as.numeric(dim(refined)[1]),as.numeric(dim(concoct)[1]),as.numeric(dim(metabat)[1]),as.numeric(dim(reassembled)[1]))\n\nrbind(concoct,metabat,maxbin,refined,reassembled) %>% group_by(binner,sample) %>% summarize(count=n()) -> bins\n\nbinplot = ggplot(data = bins,aes(x=reorder(binner,-count),y=count,fill= binner)) +\n  geom_bar(stat = \"identity\",color=\"black\") +\n  ylab(\"Generated bins\") + \n  xlab(\"Binning tool\") +\n  theme(legend.title = element_blank()) +\n  ggtitle(\"Number of MQ bins\") + \n  coord_flip() + \n  theme(legend.position = \"none\")+ \n  facet_wrap(~sample,ncol=1)\n\ncompplot = ggplot() + \n  geom_density(data=concoct,aes(completeness,color=\"CONCOCT\")) +\n  geom_density(data=maxbin,aes(completeness,color=\"maxbin2\")) +\n  geom_density(data=metabat,aes(completeness,color=\"metabat2\")) + \n  geom_density(data=refined,aes(completeness,color=\"refined\")) + \n  geom_density(data=reassembled,aes(completeness,color=\"reassembled\")) +\n  ggtitle(\"Completeness\") +\n  theme(axis.text.y=element_blank()) + \n  theme(legend.position = \"none\")\n\ncontplot = ggplot() + \n  geom_density(data=concoct,aes(contamination,color=\"CONCOCT\")) +\n  geom_density(data=maxbin,aes(contamination,color=\"maxbin2\")) +\n  geom_density(data=metabat,aes(contamination,color=\"metabat2\")) + \n  geom_density(data=refined,aes(contamination,color=\"refined\")) + \n  geom_density(data=reassembled,aes(contamination,color=\"reassembled\")) +\n  ggtitle(\"Contamination\") +\n  theme(axis.text.y=element_blank()) + \n  theme(legend.position = \"none\")\n\nlengthplot = ggplot() + \n  geom_density(data=concoct,aes(size,color=\"CONCOCT\")) +\n  geom_density(data=maxbin,aes(size,color=\"maxbin2\")) +\n  geom_density(data=metabat,aes(size,color=\"metabat2\")) + \n  geom_density(data=refined,aes(size,color=\"refined\")) + \n  geom_density(data=reassembled,aes(size,color=\"reassembled\")) +\n  ggtitle(\"BP Length\") + \n  theme(legend.position = \"none\") +\n  theme(axis.text.y=element_blank())\n\ncontigplot = ggplot() + \n  geom_density(data=concoct,aes(contigs,color=\"CONCOCT\")) +\n  geom_density(data=maxbin,aes(contigs,color=\"maxbin2\")) +\n  geom_density(data=metabat,aes(contigs,color=\"metabat2\")) + \n  geom_density(data=refined,aes(contigs,color=\"refined\")) + \n  geom_density(data=reassembled,aes(contigs,color=\"reassembled\")) +\n  ggtitle(\"Number of contigs\") + \n  theme(legend.position = \"none\") +\n  theme(axis.text.y=element_blank())\n\ndensities1= grid.arrange(compplot,lengthplot,nrow=2,ncol=1)\ndensities2=grid.arrange(contplot,contigplot,nrow=2,ncol=1)\n\nplot=grid.arrange(binplot,densities1,densities2,nrow=1,ncol=3)\n\nggsave(\"binningVis.pdf\",plot=plot, height = 6, width = 12)"
  },
  {
    "path": "workflow/scripts/compositionVis.R",
    "content": "library(tidyverse)\nlibrary(tidytext)\n\ntaxonomy=read.delim(\"GTDBTk.stats\",header=TRUE) %>% \n  select(user_genome,classification) %>% \n  separate(.,classification,into = c(\"kingdom\",\"phylum\",\"class\",\"order\",\"family\",\"genus\",\"species\"),sep = \";\")\n\nabundance=read.delim(\"abundance.stats\",header=FALSE)\ncolnames(abundance)=c(\"user_genome\",\"absolute_ab\",\"rel_ab\")\n\ntaxab = left_join(taxonomy,abundance,by=\"user_genome\")\ntaxab$sample = gsub(\"\\\\..*$\",\"\",taxab$user_genome)\ntaxab$species = gsub(\"s__$\",\"Undefined sp.\",taxab$species)\ntaxab$species = gsub(\"s__\",\"\",taxab$species)\n\nggplot(taxab%>% filter(species!=\"Undefined sp.\")) +\n  geom_bar(aes(x=reorder_within(species,-rel_ab,sample),y=rel_ab*100),stat=\"identity\") + \n  scale_x_reordered() +\n  facet_wrap(~sample,scales = \"free\") + \n  ylab(\"Relative abundance (%)\") + \n  xlab(\"Species\") +\n  coord_flip()\n\nggsave(\"compositionVis.pdf\",width = 12,height=8)"
  },
  {
    "path": "workflow/scripts/compositionVis_old.R",
    "content": "library(gridExtra)\nlibrary(dplyr)\nlibrary(ggplot2)\n\nclassification = read.delim(\"classification.stats\",stringsAsFactors = FALSE,header = FALSE)\ncolnames(classification)=c(\"fasta\",\"NCBI\",\"taxonomy\",\"motu\",\"detect\",\"map\",\"percent\",\"cog\") # add descriptive column names based on classify-genomes output\nclassification$percent=as.numeric(classification$percent) # force percentage to be numeric, will default to chr if any bin in dataset cannot be assigned a taxonomy\nclassification$taxonomy=substr(classification$taxonomy,1,50) # subset taxonomy name, some can get very long, making plot labels obscene\nclassification$percent[is.na(classification$percent)]  <- 0 # replace NAs with zero's (occurs when bin taxonomy cannot be assigned due to no marker genes)\nclassification$fasta=gsub(\" $\",\"\",classification$fasta) # make sure there are not trailing white spaces\nclassification$sample=gsub(\"\\\\..*$\",\"\",classification$fasta) # add sample info from bin name\n\nabundance = read.delim(\"abundance.stats\",stringsAsFactors = FALSE,header=FALSE)\ncolnames(abundance) = c(\"fasta\",\"ab\",\"abNorm\")\n\nclassification = left_join(classification,abundance,by=\"fasta\")\n\nplot = ggplot(classification, aes(x = sample, y = abNorm, fill = taxonomy)) + \n  geom_bar(stat = \"identity\") + \n  ggtitle(\"Taxonomic composition of samples based on MAGs\") +\n  xlab(\"Sample ID\") + \n  ylab(\"Normalized relative abundance\") +\n  coord_flip()\n\nggsave(\"abundanceVis.pdf\",plot=plot, height = 8, width = 12)"
  },
  {
    "path": "workflow/scripts/download_toydata.txt",
    "content": "https://zenodo.org/record/3534949/files/sample1_1.fastq.gz?download=1\nhttps://zenodo.org/record/3534949/files/sample1_2.fastq.gz?download=1\nhttps://zenodo.org/record/3534949/files/sample2_1.fastq.gz?download=1\nhttps://zenodo.org/record/3534949/files/sample2_2.fastq.gz?download=1\nhttps://zenodo.org/record/3534949/files/sample3_1.fastq.gz?download=1\nhttps://zenodo.org/record/3534949/files/sample3_2.fastq.gz?download=1\n"
  },
  {
    "path": "workflow/scripts/env_setup.sh",
    "content": "#!/bin/bash\n\n  echo '\n=================================================================================================================================\nDeveloped by: Francisco Zorrilla, Kiran R. Patil, and Aleksej Zelezniak___________________________________________________________\nPublication: doi.org/10.1101/2020.12.31.424982___________________________/\\\\\\\\\\\\\\\\\\\\\\\\___/\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\___/\\\\\\\\____________/\\\\\\\\_        \n________________________________________________________________________/\\\\\\//////////___\\/\\\\\\///////////___\\/\\\\\\\\\\\\________/\\\\\\\\\\\\_       \n____________________________________________/\\\\\\________________________/\\\\\\______________\\/\\\\\\______________\\/\\\\\\//\\\\\\____/\\\\\\//\\\\\\_      \n_______/\\\\\\\\\\__/\\\\\\\\\\________/\\\\\\\\\\\\\\\\____/\\\\\\\\\\\\\\\\\\\\\\___/\\\\\\\\\\\\\\\\\\_____\\/\\\\\\____/\\\\\\\\\\\\\\__\\/\\\\\\\\\\\\\\\\\\\\\\______\\/\\\\\\\\///\\\\\\/\\\\\\/_\\/\\\\\\_     \n______/\\\\\\///\\\\\\\\\\///\\\\\\____/\\\\\\/////\\\\\\__\\////\\\\\\////___\\////////\\\\\\____\\/\\\\\\___\\/////\\\\\\__\\/\\\\\\///////_______\\/\\\\\\__\\///\\\\\\/___\\/\\\\\\_    \n______\\/\\\\\\_\\//\\\\\\__\\/\\\\\\___/\\\\\\\\\\\\\\\\\\\\\\______\\/\\\\\\_________/\\\\\\\\\\\\\\\\\\\\___\\/\\\\\\_______\\/\\\\\\__\\/\\\\\\______________\\/\\\\\\____\\///_____\\/\\\\\\_   \n_______\\/\\\\\\__\\/\\\\\\__\\/\\\\\\__\\//\\\\///////_______\\/\\\\\\_/\\\\____/\\\\\\/////\\\\\\___\\/\\\\\\_______\\/\\\\\\__\\/\\\\\\______________\\/\\\\\\_____________\\/\\\\\\_  \n________\\/\\\\\\__\\/\\\\\\__\\/\\\\\\___\\//\\\\\\\\\\\\\\\\\\\\_____\\//\\\\\\\\\\____\\//\\\\\\\\\\\\\\\\/\\\\__\\//\\\\\\\\\\\\\\\\\\\\\\\\/___\\/\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\__\\/\\\\\\_____________\\/\\\\\\_ \n_________\\///___\\///___\\///_____\\//////////_______\\/////______\\////////\\//____\\////////////_____\\///////////////___\\///______________\\///__\n=============================================================================================================================================\nA Snakemake-based pipeline desinged to predict metabolic interactions directly from metagenomics data using high performance computer clusters\n===============================================================================================================================================\n'\n\n#check if conda is installed/available\necho -ne \"Checking if conda is available ... \"\n\nif ! command -v conda &> /dev/null ; then\n    echo -e \"\\nWARNING: Conda is not available! Please load your cluster's conda module or install locally and re-run the env_setup.sh script using:\\n\\nbash env_setup.sh\\n\" && exit\nelse\n    condav=$(conda --version | cut -d ' ' -f2)\n    echo -e \"detected version $condav!\"\nfi\n\n#check if mamba or mamba env are available\necho -ne \"Checking if mamba environment is available ... \"\n\nrepl=\"etc\\/profile\\.d\\/conda\\.sh\"\nsource $(which conda | sed -e \"s/condabin\\/conda/${repl}/\" | sed -e \"s/bin\\/conda/${repl}/\")\n\nif conda info --envs | grep -q mamba ; then\n    conda activate mamba\n    if command -v mamba &> /dev/null ; then\n        #mamba env installed and activated\n        mambav=$(mamba --version|head -n1|cut -d ' ' -f2) && echo -e \"detected version $mambav!\\n\"\n    else\n        #mamba not installed in mamba env\n        conda install mamba && echo \"Installed mamba\\n\"\n    fi\nelse\n    while true; do\n        read -p \"Do you wish to create an environment for mamba installation? This is recommended for faster setup (y/n)\" yn\n        case $yn in\n            [Yy]* ) echo \"conda create -n mamba mamba -c conda-forge\"|bash; break;;\n            [Nn]* ) echo -e \"\\nPlease set up mamba before proceeding.\\n\"; exit;;\n            * ) echo \"Please answer yes or no.\";;\n        esac\n    done\nfi\n\nconda activate mamba && echo \"activated mamba environment!\"\n\nwhile true; do\n    read -p \"Do you wish to download and set up metaGEM conda environment? (y/n)\" yn\n    case $yn in\n        [Yy]* ) echo \"mamba env create --prefix ./envs/metagem -f envs/metaGEM_env.yml && source activate envs/metagem && pip install --user memote carveme smetana &&  echo \"|bash; break;;\n        [Nn]* ) echo -e \"\\nSkipping metaGEM env setup, note that you will need this for refinement & reassembly of MAGs.\\n\"; break;;\n        * ) echo \"Please answer yes or no.\";;\n    esac\ndone\n\nwhile true; do\n    read -p \"Do you wish to download the GTDB-tk database (~25 Gb)? (y/n)\" yn\n    case $yn in\n        [Yy]* ) echo \"download-db.sh && source deactivate && source activate mamba\"|bash; break;;\n        [Nn]* ) echo -e \"\\nSkipping GTDB-tk database download, note that you will need this for taxonomic classification of MAGs.\\n\"; break;;\n        * ) echo \"Please answer yes or no.\";;\n    esac\ndone\n\nwhile true; do\n    read -p \"Do you wish to download and set up metaWRAP conda environment? (y/n)\" yn\n    case $yn in\n        [Yy]* ) echo \"mamba env create --prefix ./envs/metawrap -f envs/metaWRAP_env.yml\"|bash; break;;\n        [Nn]* ) echo -e \"\\nSkipping metaWRAP env setup, note that you will need this for refinement & reassembly of MAGs.\\n\"; break;;\n        * ) echo \"Please answer yes or no.\";;\n    esac\ndone\n\nwhile true; do\n    read -p \"Do you wish to download the CheckM database (~275 Mb)? (y/n)\" yn\n    case $yn in\n        [Yy]* ) echo \"wget https://data.ace.uq.edu.au/public/CheckM_databases/checkm_data_2015_01_16.tar.gz\"|bash; break;;\n        [Nn]* ) echo -e \"\\nSkipping CheckM database download, note that you will need this for bin refinement & reassembly.\\n\"; break;;\n        * ) echo \"Please answer yes or no.\";;\n    esac\ndone\n\nwhile true; do\n    read -p \"Do you wish to download and set up prokka + roary conda environment? (y/n)\" yn\n    case $yn in\n        [Yy]* ) echo \"mamba env create --prefix ./envs/prokkaroary -f envs/prokkaroary_env.yml\"|bash; break;;\n        [Nn]* ) echo -e \"\\nSkipping prokka-roary env setup, note that you will need this for pangenome analysis of MAGs.\\n\"; break;;\n        * ) echo \"Please answer yes or no.\";;\n    esac\ndone\n\necho 'Please ensure that the installation directory is present in your $PATH variable if installation issues arise with any tools.'\necho \"\"\n"
  },
  {
    "path": "workflow/scripts/kallisto2concoct.py",
    "content": "#!/usr/bin/env python\n\"\"\"A script to create a concoct input table from kallisto abundance.txt output files. https://github.com/EnvGen/toolbox/blob/master/scripts/kallisto_concoct/input_table.py\"\"\"\n\nimport argparse\nimport pandas as pd\nimport os\nimport sys\n\ndef samplenames_from_file(name_file):\n    if name_file:\n        with open(name_file, 'r') as name_file_h:\n            return [l.strip() for l in name_file_h]\n    else:\n        return None\n\ndef main(args):\n    sample_dfs = []\n\n    samplenames = samplenames_from_file(args.samplenames)\n    for i, sample in enumerate(args.quantfiles):\n        if samplenames:\n            samplename = samplenames[i]\n        else:\n            samplename = os.path.basename(sample)        \n        sample_df = pd.read_table(sample, index_col=0)\n        \n        sample_dfs.append((samplename, sample_df))\n    kallisto_df = pd.DataFrame(index=sample_df.index)\n\n    for sample, sample_df in sample_dfs:\n        kallisto_df['kallisto_coverage_{0}'.format(sample)] = 200*sample_df['est_counts'].divide(sample_df['length'])\n    kallisto_df.to_csv(sys.stdout, sep=\"\\t\", float_format=\"%.6f\")\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser(description=__doc__)\n    parser.add_argument(\"quantfiles\", nargs='+', help=\"Kallisto abundance.txt files\")\n    parser.add_argument(\"--samplenames\", default=None, help=\"File with sample names, one line each, Should be the same order and the same number as the abundance.txt files\")\n    args = parser.parse_args()\n    \n    main(args)\n\n"
  },
  {
    "path": "workflow/scripts/media_db.tsv",
    "content": "medium\tdescription\tcompound\tname\nM1\tM1\t3mb\t3mb\nM1\tM1\t4abz\t4abz\nM1\tM1\tac\tac\nM1\tM1\tbtn\tbtn\nM1\tM1\tbut\tbut\nM1\tM1\tca2\tca2\nM1\tM1\tcbl1\tcbl1\nM1\tM1\tcbl2\tcbl2\nM1\tM1\tcellb\tcellb\nM1\tM1\tcl\tcl\nM1\tM1\tcobalt2\tcobalt2\nM1\tM1\tcu2\tcu2\nM1\tM1\tcys__L\tcys__L\nM1\tM1\tfe2\tfe2\nM1\tM1\tfe3\tfe3\nM1\tM1\tfol\tfol\nM1\tM1\tfru\tfru\nM1\tM1\tglc__D\tglc__D\nM1\tM1\th\th\nM1\tM1\th2o\th2o\nM1\tM1\thco3\thco3\nM1\tM1\this__L\this__L\nM1\tM1\tk\tk\nM1\tM1\tlipoate\tlipoate\nM1\tM1\tmalt\tmalt\nM1\tM1\tmg2\tmg2\nM1\tM1\tmn2\tmn2\nM1\tM1\tmndn\tmndn\nM1\tM1\tmobd\tmobd\nM1\tM1\tna1\tna1\nM1\tM1\tnac\tnac\nM1\tM1\tni2\tni2\nM1\tM1\tno3\tno3\nM1\tM1\tpheme\tpheme\nM1\tM1\tpi\tpi\nM1\tM1\tpnto__R\tpnto__R\nM1\tM1\tppa\tppa\nM1\tM1\tpydxn\tpydxn\nM1\tM1\tribflv\tribflv\nM1\tM1\tslnt\tslnt\nM1\tM1\tso4\tso4\nM1\tM1\tthm\tthm\nM1\tM1\ttungs\ttungs\nM1\tM1\tzn2\tzn2\nM2\tM2\t4abz\t4abz\nM2\tM2\tac\tac\nM2\tM2\tade\tade\nM2\tM2\tala__L\tala__L\nM2\tM2\targ__L\targ__L\nM2\tM2\tascb__L\tascb__L\nM2\tM2\tasn__L\tasn__L\nM2\tM2\tasp__L\tasp__L\nM2\tM2\tbtn\tbtn\nM2\tM2\tca2\tca2\nM2\tM2\tcit\tcit\nM2\tM2\tcl\tcl\nM2\tM2\tcobalt2\tcobalt2\nM2\tM2\tcu2\tcu2\nM2\tM2\tcys__L\tcys__L\nM2\tM2\tfe2\tfe2\nM2\tM2\tfe3\tfe3\nM2\tM2\tfol\tfol\nM2\tM2\tglc__D\tglc__D\nM2\tM2\tgln__L\tgln__L\nM2\tM2\tglu__L\tglu__L\nM2\tM2\tgly\tgly\nM2\tM2\tgthrd\tgthrd\nM2\tM2\tgua\tgua\nM2\tM2\th\th\nM2\tM2\th2o\th2o\nM2\tM2\this__L\this__L\nM2\tM2\tile__L\tile__L\nM2\tM2\tinost\tinost\nM2\tM2\tk\tk\nM2\tM2\tleu__L\tleu__L\nM2\tM2\tlipoate\tlipoate\nM2\tM2\tlys__L\tlys__L\nM2\tM2\tmet__L\tmet__L\nM2\tM2\tmg2\tmg2\nM2\tM2\tmn2\tmn2\nM2\tM2\tmobd\tmobd\nM2\tM2\tna1\tna1\nM2\tM2\tnac\tnac\nM2\tM2\tnh4\tnh4\nM2\tM2\tni2\tni2\nM2\tM2\tphe__L\tphe__L\nM2\tM2\tpi\tpi\nM2\tM2\tpnto__R\tpnto__R\nM2\tM2\tpro__L\tpro__L\nM2\tM2\tpydam\tpydam\nM2\tM2\tpydxn\tpydxn\nM2\tM2\tribflv\tribflv\nM2\tM2\tser__L\tser__L\nM2\tM2\tso4\tso4\nM2\tM2\tthm\tthm\nM2\tM2\tthr__L\tthr__L\nM2\tM2\ttrp__L\ttrp__L\nM2\tM2\ttyr__L\ttyr__L\nM2\tM2\tura\tura\nM2\tM2\tval__L\tval__L\nM2\tM2\txan\txan\nM2\tM2\tzn2\tzn2\nM3\tM3\t3mb\t3mb\nM3\tM3\t4abz\t4abz\nM3\tM3\tac\tac\nM3\tM3\tade\tade\nM3\tM3\tala__L\tala__L\nM3\tM3\targ__L\targ__L\nM3\tM3\tascb__L\tascb__L\nM3\tM3\tasn__L\tasn__L\nM3\tM3\tasp__L\tasp__L\nM3\tM3\tbtn\tbtn\nM3\tM3\tbut\tbut\nM3\tM3\tca2\tca2\nM3\tM3\tcbl1\tcbl1\nM3\tM3\tcbl2\tcbl2\nM3\tM3\tcellb\tcellb\nM3\tM3\tcit\tcit\nM3\tM3\tcl\tcl\nM3\tM3\tcobalt2\tcobalt2\nM3\tM3\tcu2\tcu2\nM3\tM3\tcys__L\tcys__L\nM3\tM3\tfe2\tfe2\nM3\tM3\tfe3\tfe3\nM3\tM3\tfol\tfol\nM3\tM3\tfru\tfru\nM3\tM3\tglc__D\tglc__D\nM3\tM3\tgln__L\tgln__L\nM3\tM3\tglu__L\tglu__L\nM3\tM3\tgly\tgly\nM3\tM3\tgthrd\tgthrd\nM3\tM3\tgua\tgua\nM3\tM3\th\th\nM3\tM3\th2o\th2o\nM3\tM3\thco3\thco3\nM3\tM3\this__L\this__L\nM3\tM3\tile__L\tile__L\nM3\tM3\tinost\tinost\nM3\tM3\tk\tk\nM3\tM3\tlcts\tlcts\nM3\tM3\tleu__L\tleu__L\nM3\tM3\tlipoate\tlipoate\nM3\tM3\tlys__L\tlys__L\nM3\tM3\tmalt\tmalt\nM3\tM3\tmet__L\tmet__L\nM3\tM3\tmg2\tmg2\nM3\tM3\tmn2\tmn2\nM3\tM3\tmndn\tmndn\nM3\tM3\tmobd\tmobd\nM3\tM3\tna1\tna1\nM3\tM3\tnac\tnac\nM3\tM3\tnad\tnad\nM3\tM3\tnh4\tnh4\nM3\tM3\tni2\tni2\nM3\tM3\tno3\tno3\nM3\tM3\tphe__L\tphe__L\nM3\tM3\tpheme\tpheme\nM3\tM3\tpi\tpi\nM3\tM3\tpnto__R\tpnto__R\nM3\tM3\tppa\tppa\nM3\tM3\tpro__L\tpro__L\nM3\tM3\tpydam\tpydam\nM3\tM3\tpydxn\tpydxn\nM3\tM3\tribflv\tribflv\nM3\tM3\tser__L\tser__L\nM3\tM3\tslnt\tslnt\nM3\tM3\tso4\tso4\nM3\tM3\tthm\tthm\nM3\tM3\tthr__L\tthr__L\nM3\tM3\ttrp__L\ttrp__L\nM3\tM3\ttungs\ttungs\nM3\tM3\ttyr__L\ttyr__L\nM3\tM3\tura\tura\nM3\tM3\tval__L\tval__L\nM3\tM3\txan\txan\nM3\tM3\tzn2\tzn2\nM4\tM4\t3mb\t3mb\nM4\tM4\t4abz\t4abz\nM4\tM4\tac\tac\nM4\tM4\tade\tade\nM4\tM4\tala__L\tala__L\nM4\tM4\targ__L\targ__L\nM4\tM4\tascb__L\tascb__L\nM4\tM4\tasn__L\tasn__L\nM4\tM4\tasp__L\tasp__L\nM4\tM4\tbtn\tbtn\nM4\tM4\tbut\tbut\nM4\tM4\tca2\tca2\nM4\tM4\tcbl1\tcbl1\nM4\tM4\tcbl2\tcbl2\nM4\tM4\tcellb\tcellb\nM4\tM4\tcit\tcit\nM4\tM4\tcl\tcl\nM4\tM4\tcobalt2\tcobalt2\nM4\tM4\tcu2\tcu2\nM4\tM4\tcys__L\tcys__L\nM4\tM4\tfe2\tfe2\nM4\tM4\tfe3\tfe3\nM4\tM4\tfol\tfol\nM4\tM4\tfru\tfru\nM4\tM4\tglc__D\tglc__D\nM4\tM4\tgln__L\tgln__L\nM4\tM4\tglu__L\tglu__L\nM4\tM4\tgly\tgly\nM4\tM4\tgthrd\tgthrd\nM4\tM4\tgua\tgua\nM4\tM4\th\th\nM4\tM4\th2o\th2o\nM4\tM4\thco3\thco3\nM4\tM4\this__L\this__L\nM4\tM4\tile__L\tile__L\nM4\tM4\tinost\tinost\nM4\tM4\tk\tk\nM4\tM4\tlcts\tlcts\nM4\tM4\tleu__L\tleu__L\nM4\tM4\tlipoate\tlipoate\nM4\tM4\tlys__L\tlys__L\nM4\tM4\tmalt\tmalt\nM4\tM4\tmet__L\tmet__L\nM4\tM4\tmg2\tmg2\nM4\tM4\tmn2\tmn2\nM4\tM4\tmndn\tmndn\nM4\tM4\tmobd\tmobd\nM4\tM4\tna1\tna1\nM4\tM4\tnac\tnac\nM4\tM4\tnad\tnad\nM4\tM4\tnh4\tnh4\nM4\tM4\tni2\tni2\nM4\tM4\tno3\tno3\nM4\tM4\tphe__L\tphe__L\nM4\tM4\tpheme\tpheme\nM4\tM4\tpi\tpi\nM4\tM4\tpnto__R\tpnto__R\nM4\tM4\tppa\tppa\nM4\tM4\tpro__L\tpro__L\nM4\tM4\tpydam\tpydam\nM4\tM4\tpydxn\tpydxn\nM4\tM4\tribflv\tribflv\nM4\tM4\tser__L\tser__L\nM4\tM4\tslnt\tslnt\nM4\tM4\tso4\tso4\nM4\tM4\tthm\tthm\nM4\tM4\tthr__L\tthr__L\nM4\tM4\ttrp__L\ttrp__L\nM4\tM4\ttungs\ttungs\nM4\tM4\ttyr__L\ttyr__L\nM4\tM4\tura\tura\nM4\tM4\tval__L\tval__L\nM4\tM4\txan\txan\nM4\tM4\tzn2\tzn2\nM5\tM5\t4abz\t4abz\nM5\tM5\tac\tac\nM5\tM5\tade\tade\nM5\tM5\tala__L\tala__L\nM5\tM5\targ__L\targ__L\nM5\tM5\tascb__L\tascb__L\nM5\tM5\tasn__L\tasn__L\nM5\tM5\tasp__L\tasp__L\nM5\tM5\tbtn\tbtn\nM5\tM5\tca2\tca2\nM5\tM5\tcbl1\tcbl1\nM5\tM5\tcbl2\tcbl2\nM5\tM5\tcellb\tcellb\nM5\tM5\tcit\tcit\nM5\tM5\tcl\tcl\nM5\tM5\tcobalt2\tcobalt2\nM5\tM5\tcu2\tcu2\nM5\tM5\tcys__L\tcys__L\nM5\tM5\tfe2\tfe2\nM5\tM5\tfe3\tfe3\nM5\tM5\tfol\tfol\nM5\tM5\tfru\tfru\nM5\tM5\tglc__D\tglc__D\nM5\tM5\tgln__L\tgln__L\nM5\tM5\tglu__L\tglu__L\nM5\tM5\tgly\tgly\nM5\tM5\tgthrd\tgthrd\nM5\tM5\tgua\tgua\nM5\tM5\th\th\nM5\tM5\th2o\th2o\nM5\tM5\thco3\thco3\nM5\tM5\this__L\this__L\nM5\tM5\tile__L\tile__L\nM5\tM5\tinost\tinost\nM5\tM5\tk\tk\nM5\tM5\tlcts\tlcts\nM5\tM5\tleu__L\tleu__L\nM5\tM5\tlipoate\tlipoate\nM5\tM5\tlys__L\tlys__L\nM5\tM5\tmalt\tmalt\nM5\tM5\tmet__L\tmet__L\nM5\tM5\tmg2\tmg2\nM5\tM5\tmn2\tmn2\nM5\tM5\tmndn\tmndn\nM5\tM5\tmobd\tmobd\nM5\tM5\tna1\tna1\nM5\tM5\tnac\tnac\nM5\tM5\tnad\tnad\nM5\tM5\tnh4\tnh4\nM5\tM5\tni2\tni2\nM5\tM5\tno3\tno3\nM5\tM5\tphe__L\tphe__L\nM5\tM5\tpheme\tpheme\nM5\tM5\tpi\tpi\nM5\tM5\tpnto__R\tpnto__R\nM5\tM5\tpro__L\tpro__L\nM5\tM5\tpydam\tpydam\nM5\tM5\tpydxn\tpydxn\nM5\tM5\tribflv\tribflv\nM5\tM5\tser__L\tser__L\nM5\tM5\tslnt\tslnt\nM5\tM5\tso4\tso4\nM5\tM5\tthm\tthm\nM5\tM5\tthr__L\tthr__L\nM5\tM5\ttrp__L\ttrp__L\nM5\tM5\ttungs\ttungs\nM5\tM5\ttyr__L\ttyr__L\nM5\tM5\tura\tura\nM5\tM5\tval__L\tval__L\nM5\tM5\txan\txan\nM5\tM5\tzn2\tzn2\nM7\tM7\t3mb\t3mb\nM7\tM7\t4abz\t4abz\nM7\tM7\tac\tac\nM7\tM7\tade\tade\nM7\tM7\tala__L\tala__L\nM7\tM7\targ__L\targ__L\nM7\tM7\tascb__L\tascb__L\nM7\tM7\tasn__L\tasn__L\nM7\tM7\tasp__L\tasp__L\nM7\tM7\tbtn\tbtn\nM7\tM7\tbut\tbut\nM7\tM7\tca2\tca2\nM7\tM7\tcbl1\tcbl1\nM7\tM7\tcbl2\tcbl2\nM7\tM7\tcit\tcit\nM7\tM7\tcl\tcl\nM7\tM7\tcobalt2\tcobalt2\nM7\tM7\tcu2\tcu2\nM7\tM7\tcys__L\tcys__L\nM7\tM7\tfe2\tfe2\nM7\tM7\tfe3\tfe3\nM7\tM7\tfol\tfol\nM7\tM7\tfru\tfru\nM7\tM7\tglc__D\tglc__D\nM7\tM7\tgln__L\tgln__L\nM7\tM7\tglu__L\tglu__L\nM7\tM7\tgly\tgly\nM7\tM7\tgthrd\tgthrd\nM7\tM7\tgua\tgua\nM7\tM7\th\th\nM7\tM7\th2o\th2o\nM7\tM7\thco3\thco3\nM7\tM7\this__L\this__L\nM7\tM7\tile__L\tile__L\nM7\tM7\tinost\tinost\nM7\tM7\tk\tk\nM7\tM7\tleu__L\tleu__L\nM7\tM7\tlipoate\tlipoate\nM7\tM7\tlys__L\tlys__L\nM7\tM7\tmet__L\tmet__L\nM7\tM7\tmg2\tmg2\nM7\tM7\tmn2\tmn2\nM7\tM7\tmndn\tmndn\nM7\tM7\tmobd\tmobd\nM7\tM7\tna1\tna1\nM7\tM7\tnac\tnac\nM7\tM7\tnad\tnad\nM7\tM7\tnh4\tnh4\nM7\tM7\tni2\tni2\nM7\tM7\tno3\tno3\nM7\tM7\tphe__L\tphe__L\nM7\tM7\tpheme\tpheme\nM7\tM7\tpi\tpi\nM7\tM7\tpnto__R\tpnto__R\nM7\tM7\tppa\tppa\nM7\tM7\tpro__L\tpro__L\nM7\tM7\tpydam\tpydam\nM7\tM7\tpydxn\tpydxn\nM7\tM7\tribflv\tribflv\nM7\tM7\tser__L\tser__L\nM7\tM7\tslnt\tslnt\nM7\tM7\tso4\tso4\nM7\tM7\tthm\tthm\nM7\tM7\tthr__L\tthr__L\nM7\tM7\ttrp__L\ttrp__L\nM7\tM7\ttungs\ttungs\nM7\tM7\ttyr__L\ttyr__L\nM7\tM7\tura\tura\nM7\tM7\tval__L\tval__L\nM7\tM7\txan\txan\nM7\tM7\tzn2\tzn2\nM8\tM8\t3mb\t3mb\nM8\tM8\t4abz\t4abz\nM8\tM8\tac\tac\nM8\tM8\tade\tade\nM8\tM8\tala__L\tala__L\nM8\tM8\targ__L\targ__L\nM8\tM8\tascb__L\tascb__L\nM8\tM8\tasn__L\tasn__L\nM8\tM8\tasp__L\tasp__L\nM8\tM8\tbtn\tbtn\nM8\tM8\tbut\tbut\nM8\tM8\tca2\tca2\nM8\tM8\tcbl1\tcbl1\nM8\tM8\tcbl2\tcbl2\nM8\tM8\tcellb\tcellb\nM8\tM8\tcit\tcit\nM8\tM8\tcl\tcl\nM8\tM8\tcobalt2\tcobalt2\nM8\tM8\tcu2\tcu2\nM8\tM8\tcys__L\tcys__L\nM8\tM8\tfe2\tfe2\nM8\tM8\tfe3\tfe3\nM8\tM8\tfol\tfol\nM8\tM8\tfru\tfru\nM8\tM8\tglc__D\tglc__D\nM8\tM8\tgln__L\tgln__L\nM8\tM8\tglu__L\tglu__L\nM8\tM8\tgly\tgly\nM8\tM8\tgthrd\tgthrd\nM8\tM8\tgua\tgua\nM8\tM8\th\th\nM8\tM8\th2o\th2o\nM8\tM8\thco3\thco3\nM8\tM8\this__L\this__L\nM8\tM8\tile__L\tile__L\nM8\tM8\tinost\tinost\nM8\tM8\tk\tk\nM8\tM8\tlcts\tlcts\nM8\tM8\tleu__L\tleu__L\nM8\tM8\tlipoate\tlipoate\nM8\tM8\tlys__L\tlys__L\nM8\tM8\tmalt\tmalt\nM8\tM8\tmet__L\tmet__L\nM8\tM8\tmg2\tmg2\nM8\tM8\tmn2\tmn2\nM8\tM8\tmndn\tmndn\nM8\tM8\tmobd\tmobd\nM8\tM8\tna1\tna1\nM8\tM8\tnac\tnac\nM8\tM8\tnad\tnad\nM8\tM8\tnh4\tnh4\nM8\tM8\tni2\tni2\nM8\tM8\tno3\tno3\nM8\tM8\tphe__L\tphe__L\nM8\tM8\tpheme\tpheme\nM8\tM8\tpi\tpi\nM8\tM8\tpnto__R\tpnto__R\nM8\tM8\tppa\tppa\nM8\tM8\tpro__L\tpro__L\nM8\tM8\tpydam\tpydam\nM8\tM8\tpydxn\tpydxn\nM8\tM8\tribflv\tribflv\nM8\tM8\tser__L\tser__L\nM8\tM8\tslnt\tslnt\nM8\tM8\tso4\tso4\nM8\tM8\tthm\tthm\nM8\tM8\tthr__L\tthr__L\nM8\tM8\ttrp__L\ttrp__L\nM8\tM8\ttungs\ttungs\nM8\tM8\ttyr__L\ttyr__L\nM8\tM8\tura\tura\nM8\tM8\tval__L\tval__L\nM8\tM8\txan\txan\nM8\tM8\tzn2\tzn2\nM9\tM9\t3mb\t3mb\nM9\tM9\t4abz\t4abz\nM9\tM9\tac\tac\nM9\tM9\tade\tade\nM9\tM9\tala__L\tala__L\nM9\tM9\targ__L\targ__L\nM9\tM9\tascb__L\tascb__L\nM9\tM9\tasn__L\tasn__L\nM9\tM9\tasp__L\tasp__L\nM9\tM9\tbtn\tbtn\nM9\tM9\tbut\tbut\nM9\tM9\tca2\tca2\nM9\tM9\tcbl1\tcbl1\nM9\tM9\tcbl2\tcbl2\nM9\tM9\tcit\tcit\nM9\tM9\tcl\tcl\nM9\tM9\tcobalt2\tcobalt2\nM9\tM9\tcu2\tcu2\nM9\tM9\tcys__L\tcys__L\nM9\tM9\tfe2\tfe2\nM9\tM9\tfe3\tfe3\nM9\tM9\tfol\tfol\nM9\tM9\tgln__L\tgln__L\nM9\tM9\tglu__L\tglu__L\nM9\tM9\tgly\tgly\nM9\tM9\tgthrd\tgthrd\nM9\tM9\tgua\tgua\nM9\tM9\th\th\nM9\tM9\th2o\th2o\nM9\tM9\thco3\thco3\nM9\tM9\this__L\this__L\nM9\tM9\tile__L\tile__L\nM9\tM9\tinost\tinost\nM9\tM9\tk\tk\nM9\tM9\tleu__L\tleu__L\nM9\tM9\tlipoate\tlipoate\nM9\tM9\tlys__L\tlys__L\nM9\tM9\tmet__L\tmet__L\nM9\tM9\tmg2\tmg2\nM9\tM9\tmn2\tmn2\nM9\tM9\tmndn\tmndn\nM9\tM9\tmobd\tmobd\nM9\tM9\tna1\tna1\nM9\tM9\tnac\tnac\nM9\tM9\tnad\tnad\nM9\tM9\tnh4\tnh4\nM9\tM9\tni2\tni2\nM9\tM9\tno3\tno3\nM9\tM9\tphe__L\tphe__L\nM9\tM9\tpheme\tpheme\nM9\tM9\tpi\tpi\nM9\tM9\tpnto__R\tpnto__R\nM9\tM9\tppa\tppa\nM9\tM9\tpro__L\tpro__L\nM9\tM9\tpydam\tpydam\nM9\tM9\tpydxn\tpydxn\nM9\tM9\tribflv\tribflv\nM9\tM9\tser__L\tser__L\nM9\tM9\tslnt\tslnt\nM9\tM9\tso4\tso4\nM9\tM9\tthm\tthm\nM9\tM9\tthr__L\tthr__L\nM9\tM9\ttrp__L\ttrp__L\nM9\tM9\ttungs\ttungs\nM9\tM9\ttyr__L\ttyr__L\nM9\tM9\tura\tura\nM9\tM9\tval__L\tval__L\nM9\tM9\txan\txan\nM9\tM9\tzn2\tzn2\nM10\tM10\t3mb\t3mb\nM10\tM10\t4abz\t4abz\nM10\tM10\tac\tac\nM10\tM10\tade\tade\nM10\tM10\tala__L\tala__L\nM10\tM10\targ__L\targ__L\nM10\tM10\tascb__L\tascb__L\nM10\tM10\tasn__L\tasn__L\nM10\tM10\tasp__L\tasp__L\nM10\tM10\tbtn\tbtn\nM10\tM10\tbut\tbut\nM10\tM10\tca2\tca2\nM10\tM10\tcbl1\tcbl1\nM10\tM10\tcbl2\tcbl2\nM10\tM10\tcellb\tcellb\nM10\tM10\tcit\tcit\nM10\tM10\tcl\tcl\nM10\tM10\tcobalt2\tcobalt2\nM10\tM10\tcu2\tcu2\nM10\tM10\tcys__L\tcys__L\nM10\tM10\tfe2\tfe2\nM10\tM10\tfe3\tfe3\nM10\tM10\tfol\tfol\nM10\tM10\tfru\tfru\nM10\tM10\tglc__D\tglc__D\nM10\tM10\tgln__L\tgln__L\nM10\tM10\tglu__L\tglu__L\nM10\tM10\tgly\tgly\nM10\tM10\tgthrd\tgthrd\nM10\tM10\tgua\tgua\nM10\tM10\th\th\nM10\tM10\th2o\th2o\nM10\tM10\thco3\thco3\nM10\tM10\this__L\this__L\nM10\tM10\tile__L\tile__L\nM10\tM10\tinost\tinost\nM10\tM10\tk\tk\nM10\tM10\tlcts\tlcts\nM10\tM10\tleu__L\tleu__L\nM10\tM10\tlipoate\tlipoate\nM10\tM10\tlys__L\tlys__L\nM10\tM10\tmalt\tmalt\nM10\tM10\tmet__L\tmet__L\nM10\tM10\tmg2\tmg2\nM10\tM10\tmn2\tmn2\nM10\tM10\tmndn\tmndn\nM10\tM10\tmobd\tmobd\nM10\tM10\tna1\tna1\nM10\tM10\tnac\tnac\nM10\tM10\tnad\tnad\nM10\tM10\tnh4\tnh4\nM10\tM10\tni2\tni2\nM10\tM10\tno3\tno3\nM10\tM10\tphe__L\tphe__L\nM10\tM10\tpheme\tpheme\nM10\tM10\tpi\tpi\nM10\tM10\tpnto__R\tpnto__R\nM10\tM10\tppa\tppa\nM10\tM10\tpro__L\tpro__L\nM10\tM10\tpydam\tpydam\nM10\tM10\tpydxn\tpydxn\nM10\tM10\tribflv\tribflv\nM10\tM10\tser__L\tser__L\nM10\tM10\tslnt\tslnt\nM10\tM10\tso4\tso4\nM10\tM10\tthm\tthm\nM10\tM10\tthr__L\tthr__L\nM10\tM10\ttrp__L\ttrp__L\nM10\tM10\ttungs\ttungs\nM10\tM10\ttyr__L\ttyr__L\nM10\tM10\tura\tura\nM10\tM10\tval__L\tval__L\nM10\tM10\txan\txan\nM10\tM10\tzn2\tzn2\nM11\tM11\t3mb\t3mb\nM11\tM11\t4abz\t4abz\nM11\tM11\tac\tac\nM11\tM11\tade\tade\nM11\tM11\tala__L\tala__L\nM11\tM11\targ__L\targ__L\nM11\tM11\tascb__L\tascb__L\nM11\tM11\tasn__L\tasn__L\nM11\tM11\tasp__L\tasp__L\nM11\tM11\tbtn\tbtn\nM11\tM11\tbut\tbut\nM11\tM11\tca2\tca2\nM11\tM11\tcbl1\tcbl1\nM11\tM11\tcbl2\tcbl2\nM11\tM11\tcellb\tcellb\nM11\tM11\tcit\tcit\nM11\tM11\tcl\tcl\nM11\tM11\tcobalt2\tcobalt2\nM11\tM11\tcu2\tcu2\nM11\tM11\tcys__L\tcys__L\nM11\tM11\tfe2\tfe2\nM11\tM11\tfe3\tfe3\nM11\tM11\tfol\tfol\nM11\tM11\tfru\tfru\nM11\tM11\tglc__D\tglc__D\nM11\tM11\tgln__L\tgln__L\nM11\tM11\tglu__L\tglu__L\nM11\tM11\tgly\tgly\nM11\tM11\tgthrd\tgthrd\nM11\tM11\tgua\tgua\nM11\tM11\th\th\nM11\tM11\th2o\th2o\nM11\tM11\thco3\thco3\nM11\tM11\this__L\this__L\nM11\tM11\tile__L\tile__L\nM11\tM11\tinost\tinost\nM11\tM11\tk\tk\nM11\tM11\tlcts\tlcts\nM11\tM11\tleu__L\tleu__L\nM11\tM11\tlipoate\tlipoate\nM11\tM11\tlys__L\tlys__L\nM11\tM11\tmalt\tmalt\nM11\tM11\tmet__L\tmet__L\nM11\tM11\tmg2\tmg2\nM11\tM11\tmn2\tmn2\nM11\tM11\tmndn\tmndn\nM11\tM11\tmobd\tmobd\nM11\tM11\tna1\tna1\nM11\tM11\tnac\tnac\nM11\tM11\tnad\tnad\nM11\tM11\tnh4\tnh4\nM11\tM11\tni2\tni2\nM11\tM11\tno3\tno3\nM11\tM11\tpheme\tpheme\nM11\tM11\tpi\tpi\nM11\tM11\tpnto__R\tpnto__R\nM11\tM11\tppa\tppa\nM11\tM11\tpro__L\tpro__L\nM11\tM11\tpydam\tpydam\nM11\tM11\tpydxn\tpydxn\nM11\tM11\tribflv\tribflv\nM11\tM11\tser__L\tser__L\nM11\tM11\tslnt\tslnt\nM11\tM11\tso4\tso4\nM11\tM11\tthm\tthm\nM11\tM11\tthr__L\tthr__L\nM11\tM11\ttungs\ttungs\nM11\tM11\tura\tura\nM11\tM11\tval__L\tval__L\nM11\tM11\txan\txan\nM11\tM11\tzn2\tzn2\nM13\tM13\tca2\tca2\nM13\tM13\tcbl1\tcbl1\nM13\tM13\tcbl2\tcbl2\nM13\tM13\tcl\tcl\nM13\tM13\tcobalt2\tcobalt2\nM13\tM13\tcu2\tcu2\nM13\tM13\tcys__L\tcys__L\nM13\tM13\tfe2\tfe2\nM13\tM13\tfe3\tfe3\nM13\tM13\tglc__D\tglc__D\nM13\tM13\th\th\nM13\tM13\th2o\th2o\nM13\tM13\this__L\this__L\nM13\tM13\tk\tk\nM13\tM13\tmg2\tmg2\nM13\tM13\tmn2\tmn2\nM13\tM13\tmndn\tmndn\nM13\tM13\tna1\tna1\nM13\tM13\tnh4\tnh4\nM13\tM13\tni2\tni2\nM13\tM13\tpheme\tpheme\nM13\tM13\tpi\tpi\nM13\tM13\tso4\tso4\nM13\tM13\tzn2\tzn2\nM14\tM14\tade\tade\nM14\tM14\tala__L\tala__L\nM14\tM14\targ__L\targ__L\nM14\tM14\tascb__L\tascb__L\nM14\tM14\tasp__L\tasp__L\nM14\tM14\tbtn\tbtn\nM14\tM14\tca2\tca2\nM14\tM14\tcl\tcl\nM14\tM14\tcobalt2\tcobalt2\nM14\tM14\tcu2\tcu2\nM14\tM14\tcys__L\tcys__L\nM14\tM14\tfe2\tfe2\nM14\tM14\tfe3\tfe3\nM14\tM14\tglc__D\tglc__D\nM14\tM14\tglu__L\tglu__L\nM14\tM14\tgly\tgly\nM14\tM14\th\th\nM14\tM14\th2o\th2o\nM14\tM14\this__L\this__L\nM14\tM14\tile__L\tile__L\nM14\tM14\tk\tk\nM14\tM14\tleu__L\tleu__L\nM14\tM14\tlys__L\tlys__L\nM14\tM14\tmet__L\tmet__L\nM14\tM14\tmg2\tmg2\nM14\tM14\tmn2\tmn2\nM14\tM14\tna1\tna1\nM14\tM14\tnac\tnac\nM14\tM14\tni2\tni2\nM14\tM14\tphe__L\tphe__L\nM14\tM14\tpi\tpi\nM14\tM14\tpnto__R\tpnto__R\nM14\tM14\tpro__L\tpro__L\nM14\tM14\tpydam\tpydam\nM14\tM14\tser__L\tser__L\nM14\tM14\tso4\tso4\nM14\tM14\tthiog\tthiog\nM14\tM14\tthm\tthm\nM14\tM14\tthr__L\tthr__L\nM14\tM14\ttrp__L\ttrp__L\nM14\tM14\ttyr__L\ttyr__L\nM14\tM14\tura\tura\nM14\tM14\tval__L\tval__L\nM14\tM14\tzn2\tzn2\nM15A\tM15A\tca2\tca2\nM15A\tM15A\tcl\tcl\nM15A\tM15A\tcobalt2\tcobalt2\nM15A\tM15A\tcu2\tcu2\nM15A\tM15A\tfe2\tfe2\nM15A\tM15A\tfe3\tfe3\nM15A\tM15A\tglc__D\tglc__D\nM15A\tM15A\th\th\nM15A\tM15A\th2o\th2o\nM15A\tM15A\tk\tk\nM15A\tM15A\tmg2\tmg2\nM15A\tM15A\tmn2\tmn2\nM15A\tM15A\tmobd\tmobd\nM15A\tM15A\tna1\tna1\nM15A\tM15A\tnh4\tnh4\nM15A\tM15A\tni2\tni2\nM15A\tM15A\tpi\tpi\nM15A\tM15A\tso4\tso4\nM15A\tM15A\tzn2\tzn2\nM15B\tM15B\tca2\tca2\nM15B\tM15B\tcl\tcl\nM15B\tM15B\tcobalt2\tcobalt2\nM15B\tM15B\tcu2\tcu2\nM15B\tM15B\tfe2\tfe2\nM15B\tM15B\tfe3\tfe3\nM15B\tM15B\tglc__D\tglc__D\nM15B\tM15B\th\th\nM15B\tM15B\th2o\th2o\nM15B\tM15B\tk\tk\nM15B\tM15B\tmg2\tmg2\nM15B\tM15B\tmn2\tmn2\nM15B\tM15B\tmobd\tmobd\nM15B\tM15B\tna1\tna1\nM15B\tM15B\tnh4\tnh4\nM15B\tM15B\tni2\tni2\nM15B\tM15B\tpi\tpi\nM15B\tM15B\tso4\tso4\nM15B\tM15B\tzn2\tzn2\nM16\tM16\t4abz\t4abz\nM16\tM16\tasp__L\tasp__L\nM16\tM16\tbtn\tbtn\nM16\tM16\tca2\tca2\nM16\tM16\tcl\tcl\nM16\tM16\tcobalt2\tcobalt2\nM16\tM16\tcu2\tcu2\nM16\tM16\tcys__L\tcys__L\nM16\tM16\tfe2\tfe2\nM16\tM16\tfe3\tfe3\nM16\tM16\tglu__L\tglu__L\nM16\tM16\th\th\nM16\tM16\th2co3\th2co3\nM16\tM16\th2o\th2o\nM16\tM16\tk\tk\nM16\tM16\tlac__L\tlac__L\nM16\tM16\tmg2\tmg2\nM16\tM16\tmn2\tmn2\nM16\tM16\tmobd\tmobd\nM16\tM16\tna1\tna1\nM16\tM16\tnac\tnac\nM16\tM16\tnh4\tnh4\nM16\tM16\tni2\tni2\nM16\tM16\tpi\tpi\nM16\tM16\tpnto__R\tpnto__R\nM16\tM16\tptrc\tptrc\nM16\tM16\tpydx\tpydx\nM16\tM16\tser__L\tser__L\nM16\tM16\tso4\tso4\nM16\tM16\tthiog\tthiog\nM16\tM16\tthm\tthm\nM16\tM16\ttyr__L\ttyr__L\nM16\tM16\tzn2\tzn2\nMILK \tMILK \th2o \tH2O\nMILK \tMILK \to2 \tO2\nMILK \tMILK \tco2 \tCO2\nMILK \tMILK \tca2 \tCa2+\nMILK \tMILK \tcl \tCl-\nMILK \tMILK \tcobalt2 \tCo2+\nMILK \tMILK \tcu2 \tCu2+\nMILK \tMILK \tfe2 \tFe2+\nMILK \tMILK \tfe3 \tFe3+\nMILK \tMILK \th \tH+\nMILK \tMILK \tk \tK+\nMILK \tMILK \tmg2 \tMg\nMILK \tMILK \tmn2 \tMn2+\nMILK \tMILK \tmobd \tMolybdate\nMILK \tMILK \tna1 \tNa+\nMILK \tMILK \tnh4 \tAmmonium\nMILK \tMILK \tni2 \tNi2+\nMILK \tMILK \tpi \tPhosphate\nMILK \tMILK \tso4 \tSulfate\nMILK \tMILK \tzn2 \tZn2+\nMILK \tMILK \tala__L \tL-Alanine\nMILK \tMILK \tasn__L \tL-Asparagine\nMILK \tMILK \tasp__L \tL-Aspartate\nMILK \tMILK \tglu__L \tL-Glutamate\nMILK \tMILK \tgln__L \tL-Glutamine\nMILK \tMILK \tgly \tGlycine\nMILK \tMILK \this__L \tL-Histidine\nMILK \tMILK \tile__L \tL-Isoleucine\nMILK \tMILK \tleu__L \tL-Leucine\nMILK \tMILK \tlys__L \tL-Lysine\nMILK \tMILK \torn \tOrnithine\nMILK \tMILK \tphe__L \tL-Phenylalanine\nMILK \tMILK \tpeamn \tPhylethylamine\nMILK \tMILK \tpro__L \tL-Proline\nMILK \tMILK \tser__L \tL-Serine\nMILK \tMILK \tthr__L \tL-Threonine\nMILK \tMILK \ttrp__L \ttryptophan\nMILK \tMILK \ttyr__L \tL-Tyrosine\nMILK \tMILK \tval__L \tL-Valine\nMILK \tMILK \tlcts \tLactose\nMILK \tMILK \tglc__D \tD-Glucose\nMILK \tMILK \tgal \tD-Galactose\nMILK \tMILK \tgal_bD \tBeta D-Galactose\nMILK \tMILK \tcit \tCitrate\nMILK \tMILK \tlac__D \tD-Lactate\nMILK \tMILK \tlac__L \tL-Lactate\nMILK \tMILK \tfor \tFormate\nMILK \tMILK \tac \tAcetate\nMILK \tMILK \toxa \tOxalate\nMILK \tMILK \tpydx \tPyridoxal\nMILK \tMILK \tcbl1 \tVitamin B12\nMILK \tMILK \tthm \tThiamin\nMILK \tMILK \tpnto__R \t(R)-Pantothenate\nMILK \tMILK \tfol \tFolate\nMILK \tMILK \tribflv \tRiboflavin\nMILK \tMILK \tnac \tIsonicotinic acid\nMILK \tMILK \tbtn \tBiotin\nMILK \tMILK \tbut \tButyrate\nMILK \tMILK \tcaproic \tCaproic acid\nMILK \tMILK \tocta \toctanoate\nMILK \tMILK \tdca \tdecanoate\nMILK \tMILK \tddca \tdodecanoate\nMILK \tMILK \tttdca \ttetradecanoate\nMILK \tMILK \tptdca \tPentadecanoate\nMILK \tMILK \thdca \thexadecanoate\nMILK \tMILK \tocdca \toctadecanoate\nMILK \tMILK \tarach \tarachidic acid\nMILK \tMILK \tttdcea \tTetradecenoate (n-C14:1)\nMILK \tMILK \thdcea \thexadecanoate (n-C16:1)\nMILK \tMILK \tocdcea \toctadecanoate (n-C18:1)\nMILK \tMILK \tlnlc \tlinoleic acid\nMILK \tMILK \tarachd \tarachidonic acid\nMILK \tMILK \tade \tAdenine\nMILK \tMILK \tgua \tGuanine\nMILK \tMILK \tins \tInosine\nMILK \tMILK \tthymd \tThymidine\nMILK \tMILK \tura \tUracil\nMILK \tMILK \txan \tXanthine\n"
  },
  {
    "path": "workflow/scripts/modelVis.R",
    "content": "library(gridExtra)\nlibrary(dplyr)\nlibrary(ggplot2)\n\ngems = read.delim(\"GEMs.stats\",stringsAsFactors = FALSE,header=FALSE, sep = \" \")\ngems$V5 = gsub(\"_.*$\",\"\",gems$V1)\ncolnames(gems) = c(\"bin\",\"mets\",\"rxns\",\"genes\",\"sample\")\n\nsamplesplot = gems %>% \n  count(sample) %>% \n  ggplot(aes(x=reorder(sample,-n),y=n)) + \n  geom_bar(stat = \"identity\") + \n  coord_flip() + \n  ggtitle(\"Number of GEMs across samples\") + \n  ylab(\"Number of GEMs carved\") +\n  xlab(\"Sample ID\")\n\nmetplot = ggplot() + \n  geom_density(data=gems,aes(mets),fill=\"#7fc97f\") +\n  ggtitle(\"Unique metabolites across GEMs\") + \n  theme(legend.position = \"none\") +\n  theme(axis.text.y=element_blank())\n\nrxnplot = ggplot() + \n  geom_density(data=gems,aes(rxns),fill=\"#beaed4\") +\n  ggtitle(\"Reactions across GEMs\") + \n  theme(legend.position = \"none\") +\n  theme(axis.text.y=element_blank())\n\ngeneplot = ggplot() + \n  geom_density(data=gems,aes(genes),fill=\"#fdc086\") +\n  ggtitle(\"Genes across GEMs\") + \n  theme(legend.position = \"none\") +\n  theme(axis.text.y=element_blank())\n\nplot=grid.arrange(samplesplot,arrangeGrob(metplot,rxnplot,geneplot,nrow=3,ncol=1),nrow=1,ncol=2,heights=c(60),widths=c(30,30))\n\nggsave(\"modelVis.pdf\",plot=plot, height = 8, width = 12)"
  },
  {
    "path": "workflow/scripts/prepareRoaryInput.R",
    "content": "# Prepare roary input script\n\nlibrary(dplyr)\nlibrary(tidyr)\n\n# Load in classification just as in taxonomyVis.R\nclassification = read.delim(\"GTDBtk.stats\",stringsAsFactors = FALSE,header = TRUE)\nclassification$bin = classification$user_genome\ngtdbtk_class = classification[,c(2,20)]\ngtdbtk_class$classification = gsub(\"*.__\",\"\",gtdbtk_class$classification)\ngtdbtk_class %>% separate(classification,c(\"domain\",\"phylum\",\"class\",\"order\",\"family\",\"genus\",\"species\"),sep = \";\") -> gtdbtk_class\ngtdbtk_class[gtdbtk_class==\"\"]<-'NA'\ngtdbtk_class$sample = gsub(\"_.*$\",\"\",gtdbtk_class$bin)\n\n# Load in refined+reassembled consensus bins just as in binningVis.R\nreassembledCheckm = read.delim(\"reassembled.checkm\",stringsAsFactors = FALSE,header = FALSE)\ncolnames(reassembledCheckm) = c(\"bin\",\"completeness\",\"contamination\",\"GC\",\"lineage\",\"N50\",\"size\")\nreassembledBins= read.delim(\"reassembled_bins.stats\",stringsAsFactors = FALSE,header = FALSE, sep = \" \")\ncolnames(reassembledBins) = c(\"bin\",\"contigs\",\"length\")\nreassembled = left_join(reassembledCheckm,reassembledBins,by=\"bin\")\nreassembled$bin=gsub(\" $\",\"\",reassembled$bin)\nreassembled$bin=gsub(\"permissive\",\"p\",reassembled$bin)\nreassembled$bin=gsub(\"orig\",\"o\",reassembled$bin)\nreassembled$bin=gsub(\"strict\",\"s\",reassembled$bin)\nreassembled$bin=gsub(\"\\\\.bin\",\"_bin\",reassembled$bin)\n\n# Join dataframes by bin and filter out low completeness bins\nbins = left_join(reassembled,gtdbtk_class,by=\"bin\") %>% filter(completeness >= 90) %>% filter(contamination <= 5)\n\n# Identify which species are represented by at least 10 high quality bins\nbins %>% group_by(species) %>% count() %>% filter(n>=10) -> species\n\n# Run for loop to generate text file with bin IDs for each identified species\ndir.create(gsub(\"$\",\"/speciesBinIDs\",getwd()))\n\nfor (i in species$species) {\n  #Remove any forbidden characters if present:\n  #(spaces, forward slashes, square brackets, parentheses)\n  name = gsub(\" \",\"_\",i)\n  name = gsub(\"/\",\"_\",name)\n  name = gsub(\"\\\\[\",\"_\",name)\n  name = gsub(\"]\",\"_\",name)\n  name = gsub(\"\\\\(\",\"_\",name)\n  name = gsub(\")\",\"_\",name)\n  write.table(bins %>% \n                filter(species == i) %>%\n                select(bin),paste0(paste0(\"speciesBinIDs/\",name),\".txt\"),sep=\"\\n\",row.names=FALSE,col.names = FALSE,quote = FALSE)\n}"
  },
  {
    "path": "workflow/scripts/prepareRoaryInputGTDBtk.R",
    "content": "# Prepare roary input script\n\nlibrary(dplyr)\n\n# Load in classification just as in taxonomyVis.R\nclassification = read.delim(\"GTDBtk.stats\",stringsAsFactors = FALSE,header = TRUE)\nclassification$bin = classification$user_genome\n\n# Load in refined+reassembled consensus bins just as in binningVis.R\nreassembledCheckm = read.delim(\"reassembled.checkm\",stringsAsFactors = FALSE,header = FALSE)\ncolnames(reassembledCheckm) = c(\"bin\",\"completeness\",\"contamination\",\"GC\",\"lineage\",\"N50\",\"size\")\nreassembledBins= read.delim(\"reassembled_bins.stats\",stringsAsFactors = FALSE,header = FALSE, sep = \" \")\ncolnames(reassembledBins) = c(\"bin\",\"contigs\",\"length\")\nreassembled = left_join(reassembledCheckm,reassembledBins,by=\"bin\")\nreassembled$bin=gsub(\" $\",\"\",reassembled$bin)\nreassembled$bin=gsub(\"permissive\",\"p\",reassembled$bin)\nreassembled$bin=gsub(\"orig\",\"o\",reassembled$bin)\nreassembled$bin=gsub(\"strict\",\"s\",reassembled$bin)\nreassembled$bin=gsub(\"\\\\.bin\",\"_bin\",reassembled$bin)\n\n# Join dataframes by bin and filter out low completeness bins\nbins = left_join(reassembled,classification,by=\"bin\") %>% filter(completeness >= 90) %>% filter(contamination <= 5)\nbins$taxonomy = gsub(\"^ \",\"\",bins$taxonomy)\nbins$taxonomy = gsub(\" $\",\"\",bins$taxonomy)\n\n# Identify which species are represented by at least 10 high quality bins\nbins %>% group_by(taxonomy) %>% count() %>% filter(n>=10) -> species\n\n# Run for loop to generate text file with bin IDs for each identified species\ndir.create(gsub(\"$\",\"/speciesBinIDs\",getwd()))\n\nfor (i in species$taxonomy) {\n  name = gsub(\" \",\"_\",i)\n  name = gsub(\"/\",\"_\",name)\n  name = gsub(\"\\\\[\",\"_\",name)\n  name = gsub(\"]\",\"_\",name)\n  name = gsub(\"\\\\(\",\"_\",name)\n  name = gsub(\")\",\"_\",name)\n  write.table(bins %>% \n                filter(taxonomy == i) %>%\n                select(bin),paste0(paste0(\"speciesBinIDs/\",name),\".txt\"),sep=\"\\n\",row.names=FALSE,col.names = FALSE,quote = FALSE)\n}"
  },
  {
    "path": "workflow/scripts/prepareRoaryInput_old.R",
    "content": "# Prepare roary input script\n\nlibrary(dplyr)\n\n# Load in classification just as in taxonomyVis.R\nclassification = read.delim(\"classification.stats\",stringsAsFactors = FALSE,header = FALSE)\ncolnames(classification)=c(\"bin\",\"NCBI\",\"taxonomy\",\"motu\",\"detect\",\"map\",\"percent\",\"cog\")\nclassification$percent=as.numeric(classification$percent)\nclassification$taxonomy=substr(classification$taxonomy,1,40)\nclassification$percent[is.na(classification$percent)]  <- 0\nclassification$bin=gsub(\" $\",\"\",classification$bin)\n\n# Load in refined+reassembled consensus bins just as in binningVis.R\nreassembledCheckm = read.delim(\"reassembled.checkm\",stringsAsFactors = FALSE,header = FALSE)\ncolnames(reassembledCheckm) = c(\"bin\",\"completeness\",\"contamination\",\"GC\",\"lineage\",\"N50\",\"size\")\nreassembledBins= read.delim(\"reassembled_bins.stats\",stringsAsFactors = FALSE,header = FALSE, sep = \" \")\ncolnames(reassembledBins) = c(\"bin\",\"contigs\",\"length\")\nreassembled = left_join(reassembledCheckm,reassembledBins,by=\"bin\")\nreassembled$bin=gsub(\" $\",\"\",reassembled$bin)\nreassembled$bin=gsub(\"permissive\",\"p\",reassembled$bin)\nreassembled$bin=gsub(\"orig\",\"o\",reassembled$bin)\nreassembled$bin=gsub(\"strict\",\"s\",reassembled$bin)\n\n# Join dataframes by bin and filter out low completeness bins\nbins = left_join(reassembled,classification,by=\"bin\") %>% filter(completeness >= 90)\nbins$taxonomy = gsub(\"^ \",\"\",bins$taxonomy)\nbins$taxonomy = gsub(\" $\",\"\",bins$taxonomy)\n\n# Identify which species are represented by at least 10 high quality bins\nbins %>% group_by(taxonomy) %>% count() %>% filter(n>=10) -> species\n\n# Run for loop to generate text file with bin IDs for each identified species\ndir.create(gsub(\"$\",\"/speciesBinIDs\",getwd()))\n\nfor (i in species$taxonomy) {\n  name = gsub(\" \",\"_\",i)\n  name = gsub(\"/\",\"_\",name)\n  name = gsub(\"\\\\[\",\"_\",name)\n  name = gsub(\"]\",\"_\",name)\n  name = gsub(\"\\\\(\",\"_\",name)\n  name = gsub(\")\",\"_\",name)\n  write.table(bins %>% \n                filter(taxonomy == i) %>%\n                select(bin),paste0(paste0(\"speciesBinIDs/\",name),\".txt\"),sep=\"\\n\",row.names=FALSE,col.names = FALSE,quote = FALSE)\n}"
  },
  {
    "path": "workflow/scripts/qfilterVis.R",
    "content": "library(gridExtra)\nlibrary(dplyr)\nlibrary(ggplot2)\n\nqfilter = read.delim(\"qfilter.stats\",stringsAsFactors = FALSE, header = FALSE, sep = \" \")\ncolnames(qfilter) = c(\"ID\",\"readsBF\",\"readsAF\",\"basesBF\",\"basesAF\",\"percentReads\",\"q20BF\",\"q20AF\",\"q30BF\",\"q30AF\")\n\nreads = ggplot(data = qfilter) + \n  geom_density(aes(readsBF,fill=\"Pre-filtering\"),alpha=0.8) + \n  geom_density(aes(readsAF,fill=\"Post-filtering\"),alpha=0.8) +\n  ggtitle(\"Number of reads across samples\") + \n  xlab(\"Total reads\") + \n  theme(legend.title = element_blank())\n\nbases = ggplot(data = qfilter) + \n  geom_density(aes(basesBF,fill=\"Pre-filtering\"),alpha=0.8) + \n  geom_density(aes(basesAF,fill=\"Post-filtering\"),alpha=0.8) +\n  ggtitle(\"Number of bases across samples\") + \n  xlab(\"Total bases\") + \n  theme(legend.title = element_blank())\n\nq20 = ggplot(data = qfilter) + \n  geom_density(aes(q20BF*100,fill=\"Pre-filtering\"),alpha=0.8) + \n  geom_density(aes(q20AF*100,fill=\"Post-filtering\"),alpha=0.8) +\n  ggtitle(\"Percent Q20 bases across samples\") + \n  xlab(\"Percent of Q20 bases\") + \n  theme(legend.title = element_blank())\n\nq30 = ggplot(data = qfilter) + \n  geom_density(aes(q30BF*100,fill=\"Pre-filtering\"),alpha=0.8) + \n  geom_density(aes(q30AF*100,fill=\"Post-filtering\"),alpha=0.8) +\n  ggtitle(\"Percent Q30 bases across samples\") + \n  xlab(\"Percent of Q30 bases\") + \n  theme(legend.title = element_blank())\n\nbar = ggplot(data = qfilter) +\n  geom_bar(aes(x=reorder(ID,-basesBF),y=basesBF,fill=\"Pre-filtering\"),color=\"black\",stat = \"identity\") + \n  geom_bar(aes(x=reorder(ID,-basesBF),y=basesAF,fill=\"Post-filtering\"),color=\"black\",stat = \"identity\") + \n  geom_bar(aes(x=reorder(ID,-basesBF),y=q20AF*basesAF,fill=\"Q20 bases\"),color=\"black\",stat = \"identity\") + \n  geom_bar(aes(x=reorder(ID,-basesBF),y=q30AF*basesAF,fill=\"Q30 bases\"),color=\"black\",stat = \"identity\") + \n  coord_flip() +\n  ggtitle(\"Raw read QC summary stacked bar plot\") + \n  xlab(\"Sample ID\") + \n  ylab(\"Base pairs\") +\n  theme(legend.title = element_blank())\n\nqfilt=grid.arrange(bar,arrangeGrob(reads,bases,q20,q30,nrow=4,ncol=1),ncol =2,nrow=1)\nggsave(\"qfilterVis.pdf\",plot= qfilt,device = \"pdf\",height = 6, width=8)\n"
  },
  {
    "path": "workflow/scripts/taxonomyVis.R",
    "content": "library(gridExtra)\nlibrary(dplyr)\nlibrary(ggplot2)\n\nclassification = read.delim(\"classification.stats\",stringsAsFactors = FALSE,header = FALSE)\ncolnames(classification)=c(\"fasta\",\"NCBI\",\"taxonomy\",\"motu\",\"detect\",\"map\",\"percent\",\"cog\")\nclassification$percent=as.numeric(classification$percent)\nclassification$taxonomy=substr(classification$taxonomy,1,40)\nclassification$percent[is.na(classification$percent)]  <- 0\nclassification$fasta=gsub(\" $\",\"\",classification$fasta)\n\ntaxplot = classification %>% \n  count(taxonomy) %>% filter(n>10) %>% \n  ggplot(aes(x=reorder(taxonomy,-n),y=n)) +\n  geom_bar(stat = \"identity\") + \n  ggtitle(\"Taxonomy of reconstructed MAGs\") +\n  xlab(\"Taxonomy\") +\n  ylab(\"Count\") +\n  coord_flip()\n\nmapplot=ggplot(classification)+\n  geom_density(aes(map),fill=\"#7fc97f\") + \n  ggtitle(\"Density of marker genes mapped \") +\n  xlab(\"Number of marker genes mapped to MAG\") +\n  ylab(\"Density\")\n\ndetplot=ggplot(classification)+\n  geom_density(aes(detect),fill=\"#beaed4\") + \n  ggtitle(\"Density of marker genes detected\") +\n  xlab(\"Number of marker genes detected in MAG\") +\n  ylab(\"Density\")\n\nperplot=ggplot(classification)+\n  geom_density(aes(percent),fill=\"#fdc086\") + \n  ggtitle(\"Density of agreeing percentage of marker genes\") +\n  xlab(\"Percentage of mapped marker genes agreeing with assigned taxomy\") +\n  ylab(\"Density\")\n\nplotax=grid.arrange(taxplot,arrangeGrob(detplot,mapplot,perplot,nrow=3,ncol = 1),nrow=1,ncol=2)\n\nggsave(\"taxonomyVis.pdf\",plot= plotax,device = \"pdf\",dpi = 300, width = 40, height = 20, units = \"cm\")\n"
  }
]